Global ETD Search

1	Longitudinal Data Clustering Via Kernel Mixture Models Zhang, Xi January 2021 (has links) Kernel mixture models are proposed to cluster univariate, independent multivariate and dependent bivariate longitudinal data. The Gaussian distribution in finite mixture models is replaced by the Gaussian and gamma kernel functions, and the expectation-maximization algorithm is used to estimate bandwidths and compute log-likelihood scores. For dependent bivariate longitudinal data, the bivariate Gaussian copula is used to reveal the correlation between two attributes. After that, we use AIC, BIC and ICL to select the best model. In addition, we also introduce a kernel distance-based clustering method to compare with the kernel mixture models. A simulation is performed to illustrate the performance of this mixture model, and results show that the gamma kernel mixture model performs better than the kernel distance-based clustering method based on misclassification rates. Finally, these two models are applied to COVID-19 data, and sixty countries are classified into ten clusters based on growth rates and death rates. / Thesis / Master of Science (MSc) kernel mixture model longitudinal data clustering
2	PV Hosting Analysis and Demand Response Selection for handling Modern Grid Edge Capability Abraham, Sherin Ann 27 June 2019 (has links) Recent technological developments have led to significant changes in the power grid. Increasing consumption, widespread adoption of Distributed Energy Resources (DER), installation of smart meters, these are some of the many factors that characterize the changing distribution network. These transformations taking place at the edge of the grid call for improved planning and operation practices. In this context, this thesis aims to improve the grid edge functionality by putting forth a method to address the problem of high demand during peak period by identifying customer groups for participation in demand response programs, which can lead to significant peak shaving for the utility. A possible demand response strategy for peak shaving makes use of Photovoltaic (PV) and Battery energy storage system (BESS). In the process, this work also examines the approach to computation of hosting capacity (HC) for small PV and quantifies the difference obtained in HC when a detailed Low voltage (LV) network is available and included in HC studies. Most PV hosting studies assess the impact on system feeders with aggregated LV loads. However, as more residential customers adopt rooftop solar, the need to include secondary network models in the analysis is studied by performing a comparative study of hosting capacity for a feeder with varying loading information available. / Master of Science / Today, with significant technological advancements, as we proceed towards a modern grid, a mere change in physical infrastructure will not be enough. With the changes in kinds of equipment installed on the grid, a wave of transformation has also begun to flow in the planning and operation practices for a smarter grid. Today, the edge of the grid where the customer is interfaced to the power system has become extremely complex. Customers can use rooftop solar PV to generate their own electricity, they are more informed about their consumption behavior due to installation of smart meters and also have options to integrate other technology like battery energy storage system and electric vehicles. Like with any good technology, adoption of these advancements in the system brings with itself a greater need for reform in operation and planning of the system. For instance, increasing installation of rooftop solar at the customer end calls for review of existing methods that determine the maximum level of PV deployment possible in the network without violating the operating conditions. So, in this work, a comparative study is done to review the PV hosting capacity of a network with varying levels of information available. And the importance of utilities to have secondary network models available is emphasized. With PV deployed in the system, enhanced demand response strategies can be formulated by utilities to tackle high demand during peak period. In a bid to identify customers for participation in such programs, in this work, a computationally efficient strategy is developed to identify customers with high demand during peak period, who can be incentivized to participate in demand response programs. With this, a significant peak shaving can be achieved by the utility, and in turn stress on the distribution network is reduced during peak hours. Smart grid AMI data clustering PV hosting
3	A Data Clustering Approach to Support Modular Product Family Design Sahin, Asli 14 November 2007 (has links) Product Platform Planning is an emerging philosophy that calls for the planned development of families of related products. It is markedly different from the traditional product development process and relatively new in engineering design. Product families and platforms can offer a multitude of benefits when applied successfully such as economies of scale from producing larger volumes of the same modules, lower design costs from not having to redesign similar subsystems, and many other advantages arising from the sharing of modules. While advances in this are promising, there still remain significant challenges in designing product families and platforms. This is particularly true for defining the platform components, platform architecture, and significantly different platform and product variants in a systematic manner. Lack of precise definition for platform design assets in terms of relevant customer requirements, distinct differentiations, engineering functions, components, component interfaces, and relations among all, causes a major obstacle for companies to take full advantage of the potential benefits of product platform strategy. The main purpose of this research is to address the above mentioned challenges during the design and development of modular platform-based product families. It focuses on providing answers to a fundamental question, namely, how can a decision support approach from product module definition to the determination of platform alternatives and product variants be integrated into product family design? The method presented in this work emphasizes the incorporation of critical design requirements and specifications for the design of distinctive product modules to create platform concepts and product variants using a data clustering approach. A case application developed in collaboration with a tire manufacturer is used to verify that this research approach is suitable for reducing the complexity of design results by determining design commonalities across multiple design characteristics. The method was found helpful for determining and integrating critical design information (i.e., component dimensions, material properties, modularization driving factors, and functional relations) systematically into the design of product families and platforms. It supported decision-makers in defining distinctive product modules within the families and in determining multiple platform concepts and derivative product variants. / Ph. D. Modular Product Product Platform Design Data Clustering
4	Multivariate longitudinal data clustering with a copula kernel mixture model Zhang, Xi January 2024 (has links) Many common clustering methods cannot be used for clustering multivariate longitudinal data when the covariance of random variables is a function of the time points. For this reason, a copula kernel mixture model (CKMM) is proposed for clustering such data. The CKMM is a finite mixture model that decomposes each mixture component’s joint density function into a copula and marginal distribution functions, where a Gaussian copula is used for its mathematical traceability. This thesis considers three scenarios: first, the CKMM is developed for balanced multivariate longitudinal data with known eigenfunctions; second, the CKMM is used to fit unbalanced data where trajectories are aligned on the time axis, and eigenfunctions are unknown; and lastly, a dynamic CKMM (DCKMM) is applied to unbalanced data where trajectories are misaligned, and eigenfunctions are unknown. Expectation-maximization type algorithms are used for parameter estimation. The performance of CKMM is demonstrated on both simulated and real data. / Thesis / Candidate in Philosophy model-based clustering longitudinal data clustering
5	A clustering scheme for large high-dimensional document datasets Chen, Jing-wen 09 August 2007 (has links) Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method. Dimension reduction high-dimensional data clustering text mining Document clustering
6	Model-based clustering of high-dimensional binary data Tang, Yang 05 September 2013 (has links) We present a mixture of latent trait models with common slope parameters (MCLT) for high dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a d-dimensional Gaussian latent variable, is extended by implementing common factor analyzers. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. The Bayesian information criterion is used to select the number of components and the covariance structure as well as the dimensions of latent variables. Our approach is demonstrated on U.S. Congressional voting data and on a data set describing the sensory properties of orange juice. Our examples show that our model performs well even when the number of observations is not very large relative to the data dimensionality. In both cases, our approach yields intuitive clustering results. Additionally, our dimensionality-reduction method allows data to be displayed in low-dimensional plots. / Early Researcher Award from the Government of Ontario (McNicholas); NSERC Discovery Grants (Browne and McNicholas).
7	Emotional Impacts on Driver Behavior: An Emo-Psychophysical Car-Following Model Higgs, Bryan James 09 September 2014 (has links) This research effort aims to create a new car-following model that accounts for the effects of emotion on driver behavior. This research effort is divided into eight research milestones: (1) the development of a segmentation and clustering algorithm to perform new investigations into driver behavior; (2) the finding that driver behavior is different between drivers, between car-following periods, and within a car-following period; (3) the finding that there are patterns in the distribution of driving behaviors; (4) the finding that driving states can result in different driving actions and that the same driving action can be the result of multiple driving states; (5) the finding that the performance of car-following models can be improved by calibration to state-action clusters; (6) the development of a psychophysiological driving simulator study; (7) the finding that the distribution of driving behavior is affected by emotional states; and (8) the development of a car-following model that incorporates the influence of emotions. / Ph. D. naturalistic data psychophysiology data clustering driving simulator emotion
8	Geometric Methods for Mining Large and Possibly Private Datasets Chen, Keke 07 July 2006 (has links) With the wide deployment of data intensive Internet applications and continued advances in sensing technology and biotechnology, large multidimensional datasets, possibly containing privacy-conscious information have been emerging. Mining such datasets has become increasingly common in business integration, large-scale scientific data analysis, and national security. The proposed research aims at exploring the geometric properties of the multidimensional datasets utilized in statistical learning and data mining, and providing novel techniques and frameworks for mining very large datasets while protecting the desired data privacy. The first main contribution of this research is the development of iVIBRATE interactive visualization-based approach for clustering very large datasets. The iVIBRATE framework uniquely addresses the challenges in handling irregularly shaped clusters, domain-specific cluster definition, and cluster-labeling of the data on disk. It consists of the VISTA visual cluster rendering subsystem, and the Adaptive ClusterMap Labeling subsystem. The second main contribution is the development of ``Best K Plot'(BKPlot) method for determining the critical clustering structures in multidimensional categorical data. The BKPlot method uniquely addresses two challenges in clustering categorical data: How to determine the number of clusters (the best K) and how to identify the existence of significant clustering structures. The method consists of the basic theory, the sample BKPlot theory for large datasets, and the testing method for identifying no-cluster datasets. The third main contribution of this research is the development of the theory of geometric data perturbation and its application in privacy-preserving data classification involving single party or multiparty collaboration. The key of geometric data perturbation is to find a good randomly generated rotation matrix and an appropriate noise component that provides satisfactory balance between privacy guarantee and data quality, considering possible inference attacks. When geometric perturbation is applied to collaborative multiparty data classification, it is challenging to unify the different geometric perturbations used by different parties. We study three protocols under the data-mining-service oriented framework for unifying the perturbations: 1) the threshold-satisfied voting protocol, 2) the space adaptation protocol, and 3) the space adaptation protocol with a trusted party. The tradeoffs between the privacy guarantee, the model accuracy and the cost are studied for the protocols. Geometric methods Information visualization Data mining Privacy-preserving data mining Data clustering Data classification Distributed collaborative data mining Categorical data clustering
9	Algoritmos e técnicas de validação em agrupamento de dados multi-representados, agrupamento possibilístico e bi-agrupamento / Algorithms and validation techniques in multi-represented data clustering, possibilistic clustering and bi-clustering Horta, Danilo 25 November 2013 (has links) Existem bases para as quais os dados são naturalmente representados por mais de uma visão. Por exemplo, imagens podem ser descritas por atributos de cores, textura e forma. Proteínas podem ser caracterizadas pela sequência de aminoácidos e pela representação tridimensional. A unificação das diferentes visões de uma base de dados pode ser problemática porque elas podem não ser comparáveis entre si ou podem apresentar diferentes graus de importância. Esses graus de importância podem, inclusive, se manifestar de maneira local, de acordo com a subestrutura dos dados em questão. Isso motivou o surgimento de algoritmos de agrupamento de dados capazes de lidar com bases multi-representadas (i.e., que possuem mais de uma visão dos dados), como o algoritmo SCAD. Esse algoritmo se mostrou promissor em experimentos relatados na literatura, mas possui problemas críticos identificados neste trabalho que o impedem de funcionar em determinados cenários. Tais problemas foram solucionados por meio da proposição de uma nova versão do algoritmo, denominada ASCAD, fundamentada em provas formais sobre a sua convergência. Foram desenvolvidas versões relacionais do algoritmo ASCAD, capazes de lidar com bases descritas apenas por relações de proximidade entre os objetos. Foi desenvolvido também um índice de validação interna e relativa de agrupamento voltado para dados multi-representados. A avaliação de agrupamento possibilístico e de bi-agrupamento por meio da comparação entre solução encontrada e solução de referência (validação externa) também foi explorada. Algoritmos de bi-agrupamento têm ganhado um interesse crescente da comunidade de análise de expressão gênica. No entanto, pouco se conhece do comportamento e das propriedades das medidas voltadas para validação externa de bi-agrupamento, o que motivou uma análise teórica e empírica dessas medidas. Essa análise mostrou que a maioria das medidas de biagrupamento possui problemas críticos e destacou duas delas como sendo as mais promissoras. Foram inclusas nessa análise três medidas de agrupamento particional não exclusivo, cujo uso na comparação de bi-agrupamentos é possível por meio de uma nova abordagem de avaliação de bi-agrupamento proposta nesta tese. Agrupamento particional não exclusivo faz parte de um domínio mais geral de soluções, i.e., o domínio dos agrupamentos possibilísticos. Observou-se algumas falhas conceituais importantes das medidas de agrupamento possibilístico, o que motivou o desenvolvimento de novas medidas e de uma análise empírica e conceitual envolvendo 34 medidas. Uma das medidas propostas se destacou como sendo a única que apresentou avaliações imparciais com relação ao número de grupos, o valor máximo de similaridade ao comparar a solução ideal encontrada com a solução de referência e avaliações sensíveis às diferenças das soluções em todos os cenários considerados / There are data sets for which the instances are naturally represented by more than one view. For example, images can be described by attributes of color, texture, and shape. Proteins can be characterized by the amino acid sequence and by their three-dimensional description. The unification of different views of a data set can be problematic because they may not be comparable or may have different degrees of importance. These degrees of importance may even manifest itself locally, according to the data substructures. This prompted the emergence of clustering algorithms capable of handling multi-represented data sets (i.e., data sets having more than one view) as the SCAD algorithm. This algorithm has shown promising results in experiments reported in the literature, but it has critical problems identified in this work that hinder its application in certain scenarios. These problems were solved here by proposing a new version of the algorithm, called ASCAD, based on formal proofs about its correctness. We developed relational versions for ASCAD, capable of handling data sets described only by the proximities between the instances. We also developed an index for internal and relative validation of multi-represented data clusterings. The evaluation of possibilistic clustering and bi-clustering by comparing the found and reference solutions (external validation) was also explored. Bi-clustering algorithms have gained increasing interest from the community of gene expression analysis. However, little is known of the behavior and properties of the measures aimed at external validation of bi-clustering, which motivated a theoretical and empirical analysis of these measures in this work. This analysis showed that most bi-clustering measures has critical issues and highlighted two of the measures as being the most promising. We included in this analysis three measures of non-exclusive partitional clustering, whose use in comparing bi-clusterings is possible through a new approach proposed in this thesis. Non-exclusive partitional clustering belong to a more general domain of solutions, i.e., the domain of possibilistic clusterings. There are some important conceptual flaws in the measures of possibilistic clustering, which motivated us to develop new measures and to conceptually and empirically analyse 34 measures. One of the proposed measures stood out as being the one who presented unbiased evaluations regarding the number of clusters, the maximum similarity when comparing the optimal solution with the reference one, and evaluations sensitive to solution differences in all scenarios considered Agrupamento de dados Clustering validation Data clustering Multi-represented data Validação de agrupamento
10	Desenvolvimento de modelos dinâmicos para a formação de clusters aplicados em dados biológicos / Developing dynamical systems for data clustering applied to biological data Damiance Junior, Antonio Paulo Galdeano 16 October 2006 (has links) Com o advento da tecnologia de microarray, uma grande quantidade de dados de expressão gênica encontra-se disponível. Após a extração das taxas de expressão dos genes, técnicas de formação de clusters são utilizadas para a análise dos dados. Diante da diversidade do conhecimento que pode ser extraído dos dados de expressão gênica, existe a necessidade de diferentes técnicas de formação de clusters. O modelo dinâmico desenvolvido em (Zhao et. al. 2003a) apresenta diversas características interessantes para o problema de formação de clusters, entre as quais podemos citar: a não necessidade de fornecer o número de cluster, a propriedade de multi-escala, serem altamente paralelos e, principalmente, permitirem a inserção de regras e mecanismos mais complexos para a formação dos clusters. Todavia, este modelo apresenta dificuldades em determinar clusters de formato e tamanho arbitrários, além de não realizar a clusterização hierárquica, sendo estas duas características desejáveis para uma técnica de clusterização. Neste trabalho, foram desenvolvidas três técnicas para superar as limitações do modelo dinâmico proposto em (Zhao et. al. 2003a). O Modelo1, o qual é uma simplificação do modelo dinâmico original, porém mais eficiente. O Modelo2, que a partir da inserção de um novo conjunto de elementos no modelo dinâmico, permite a formação de clusters de formato e tamanho arbitrário. E um algoritmo para a clusterização hierárquica que utiliza o Modelo1 como bloco de construção. Os modelos desenvolvidos foram aplicados em dados biológicos, segmentando imagens de microarray e auxiliando na análise do conjunto expressão de genes de St. Jude Leukemia. / With the advent of microarray technology, a large amount of gene expression data is now available. Clustering is the computational technique usually employed to analyze and explore the data produced by microarrays. Due to the variety of information that can be extracted from the expression data, many clustering techniques with different approaches are needed. In the work proposed by (Zhao et. al. 2003a), the dynamical model for data clustering has several interesting features to the clustering task: the number of clusters does not need to be known, the multi-scale property, high parallelism, and it is flexible to use more complex rules while clustering the data. However, two desirable features for clustering techniques are not present: the ability to detect different clusters sizes and shapes, and a hierarchical representation of the clusters. This project presents three techniques, overcoming the restrictions of the dynamical model proposed by (Zhao et. al. 2003a). The first technique, called Model1, is more effective than the original model and was obtained simplifying it. The second technique, called Model2, is capable of detecting different clusters sizes and shapes. The third technique consists in a hierarchical algorithm that uses Model1 as a building block. The techniques here developed were used with biological data. Microarray image segmentation was performed and the St. Jude Leukemia gene expression data was analyzed and explored. Auto-organização Clusterização de dados Data clustering Dynamical model Expressão de genes Gene expression Modelos dinâmicos Self-organizing

Search results