Spelling suggestions: "subject:"data clustering"" "subject:"mata clustering""
1 |
Longitudinal Data Clustering Via Kernel Mixture ModelsZhang, Xi January 2021 (has links)
Kernel mixture models are proposed to cluster univariate, independent multivariate and dependent bivariate longitudinal data. The Gaussian distribution in finite mixture models is replaced by the Gaussian and gamma kernel functions, and the expectation-maximization algorithm is used to estimate bandwidths and compute log-likelihood scores. For dependent bivariate longitudinal data, the bivariate Gaussian copula is used to reveal the correlation between two attributes. After that, we use AIC, BIC and ICL to select the best model. In addition, we also introduce a kernel distance-based clustering method to compare with the kernel mixture models. A simulation is performed to illustrate the performance of this mixture model, and results show that the gamma kernel mixture model performs better than the kernel distance-based clustering method based on misclassification rates. Finally, these two models are applied to COVID-19 data, and sixty countries are classified into ten clusters based on growth rates and death rates. / Thesis / Master of Science (MSc)
|
2 |
PV Hosting Analysis and Demand Response Selection for handling Modern Grid Edge CapabilityAbraham, Sherin Ann 27 June 2019 (has links)
Recent technological developments have led to significant changes in the power grid. Increasing consumption, widespread adoption of Distributed Energy Resources (DER), installation of smart meters, these are some of the many factors that characterize the changing distribution network. These transformations taking place at the edge of the grid call for improved planning and operation practices. In this context, this thesis aims to improve the grid edge functionality by putting forth a method to address the problem of high demand during peak period by identifying customer groups for participation in demand response programs, which can lead to significant peak shaving for the utility. A possible demand response strategy for peak shaving makes use of Photovoltaic (PV) and Battery energy storage system (BESS). In the process, this work also examines the approach to computation of hosting capacity (HC) for small PV and quantifies the difference obtained in HC when a detailed Low voltage (LV) network is available and included in HC studies. Most PV hosting studies assess the impact on system feeders with aggregated LV loads. However, as more residential customers adopt rooftop solar, the need to include secondary network models in the analysis is studied by performing a comparative study of hosting capacity for a feeder with varying loading information available. / Master of Science / Today, with significant technological advancements, as we proceed towards a modern grid, a mere change in physical infrastructure will not be enough. With the changes in kinds of equipment installed on the grid, a wave of transformation has also begun to flow in the planning and operation practices for a smarter grid. Today, the edge of the grid where the customer is interfaced to the power system has become extremely complex. Customers can use rooftop solar PV to generate their own electricity, they are more informed about their consumption behavior due to installation of smart meters and also have options to integrate other technology like battery energy storage system and electric vehicles. Like with any good technology, adoption of these advancements in the system brings with itself a greater need for reform in operation and planning of the system. For instance, increasing installation of rooftop solar at the customer end calls for review of existing methods that determine the maximum level of PV deployment possible in the network without violating the operating conditions. So, in this work, a comparative study is done to review the PV hosting capacity of a network with varying levels of information available. And the importance of utilities to have secondary network models available is emphasized. With PV deployed in the system, enhanced demand response strategies can be formulated by utilities to tackle high demand during peak period. In a bid to identify customers for participation in such programs, in this work, a computationally efficient strategy is developed to identify customers with high demand during peak period, who can be incentivized to participate in demand response programs. With this, a significant peak shaving can be achieved by the utility, and in turn stress on the distribution network is reduced during peak hours.
|
3 |
Automatic K-Expectation-Maximization (K-EM) Clustering Algorithm for Data Mining ApplicationsHarsh, Archit 12 August 2016 (has links)
A non-parametric data clustering technique for achieving efficient data-clustering and improving the number of clusters is presented in this thesis. K-Means and Expectation-Maximization algorithms have been widely deployed in data-clustering applications. Result findings in related works revealed that both these algorithms have been found to be characterized with shortcomings. K-Means was established not to guarantee convergence and the choice of clusters heavily influenced the results. Expectation-Maximization’s premature convergence does not assure the optimality of results and as with K-Means, the choice of clusters influence the results. To overcome the shortcomings, a fast automatic K-EM algorithm is developed that provide optimal number of clusters by employing various internal cluster validity metrics, providing efficient and unbiased results. The algorithm is implemented on a wide array of data sets to ensure the accuracy of the results and efficiency of the algorithm.
|
4 |
A Data Clustering Approach to Support Modular Product Family DesignSahin, Asli 14 November 2007 (has links)
Product Platform Planning is an emerging philosophy that calls for the planned development of families of related products. It is markedly different from the traditional product development process and relatively new in engineering design. Product families and platforms can offer a multitude of benefits when applied successfully such as economies of scale from producing larger volumes of the same modules, lower design costs from not having to redesign similar subsystems, and many other advantages arising from the sharing of modules. While advances in this are promising, there still remain significant challenges in designing product families and platforms. This is particularly true for defining the platform components, platform architecture, and significantly different platform and product variants in a systematic manner. Lack of precise definition for platform design assets in terms of relevant customer requirements, distinct differentiations, engineering functions, components, component interfaces, and relations among all, causes a major obstacle for companies to take full advantage of the potential benefits of product platform strategy.
The main purpose of this research is to address the above mentioned challenges during the design and development of modular platform-based product families. It focuses on providing answers to a fundamental question, namely, how can a decision support approach from product module definition to the determination of platform alternatives and product variants be integrated into product family design?
The method presented in this work emphasizes the incorporation of critical design requirements and specifications for the design of distinctive product modules to create platform concepts and product variants using a data clustering approach.
A case application developed in collaboration with a tire manufacturer is used to verify that this research approach is suitable for reducing the complexity of design results by determining design commonalities across multiple design characteristics. The method was found helpful for determining and integrating critical design information (i.e., component dimensions, material properties, modularization driving factors, and functional relations) systematically into the design of product families and platforms. It supported decision-makers in defining distinctive product modules within the families and in determining multiple platform concepts and derivative product variants. / Ph. D.
|
5 |
Multivariate longitudinal data clustering with a copula kernel mixture modelZhang, Xi January 2024 (has links)
Many common clustering methods cannot be used for clustering multivariate longitudinal data when the covariance of random variables is a function of the time points. For this reason, a copula kernel mixture model (CKMM) is proposed for clustering such data. The CKMM is a finite mixture model that decomposes each mixture component’s joint density function into a copula and marginal distribution functions, where a Gaussian copula is used for its mathematical traceability. This thesis considers three scenarios: first, the CKMM is developed for balanced multivariate longitudinal data with known eigenfunctions; second, the CKMM is used to fit unbalanced data where trajectories are aligned on the time axis, and eigenfunctions are unknown; and lastly, a dynamic CKMM (DCKMM) is applied to unbalanced data where trajectories are misaligned, and eigenfunctions are unknown. Expectation-maximization type algorithms are used for parameter estimation. The performance of CKMM is demonstrated on both simulated and real data. / Thesis / Candidate in Philosophy
|
6 |
A clustering scheme for large high-dimensional document datasetsChen, Jing-wen 09 August 2007 (has links)
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
|
7 |
Model-based clustering of high-dimensional binary dataTang, Yang 05 September 2013 (has links)
We present a mixture of latent trait models with common slope parameters (MCLT) for high dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a d-dimensional Gaussian latent variable, is extended by implementing common factor analyzers. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. The Bayesian information criterion is used to select the number of components and the covariance structure as well as the dimensions of latent variables. Our approach is demonstrated on U.S. Congressional voting data and on a data set describing the sensory properties of orange juice. Our examples show that our model performs well even when the number of observations is not very large relative to the data dimensionality. In both cases, our approach yields intuitive clustering results. Additionally, our dimensionality-reduction method allows data to be displayed in low-dimensional plots. / Early Researcher Award from the Government of Ontario (McNicholas); NSERC Discovery Grants (Browne and McNicholas).
|
8 |
A data clustering algorithm for stratified data partitioning in artificial neural networkSahoo, Ajit Kumar Unknown Date
No description available.
|
9 |
Scalable Embeddings for Kernel Clustering on MapReduceElgohary, Ahmed 14 February 2014 (has links)
There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications.
The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically.
|
10 |
A data clustering algorithm for stratified data partitioning in artificial neural networkSahoo, Ajit Kumar 06 1900 (has links)
The statistical properties of training, validation and test data play an important role in assuring optimal performance in artificial neural networks (ANN). Re-searchers have proposed randomized data partitioning (RDP) and stratified data partitioning (SDP) methods for partition of input data into training, vali-dation and test datasets. RDP methods based on genetic algorithm (GA) are computationally expensive as the random search space can be in the power of twenty or more for an average sized dataset. For SDP methods, clustering al-gorithms such as self organizing map (SOM) and fuzzy clustering (FC) are used to form strata. It is assumed that data points in any individual stratum are in close statistical agreement. Reported clustering algorithms are designed to form natural clusters. In the case of large multivariate datasets, some of these natural clusters can be big enough such that the furthest data vectors are statis-tically far away from the mean. Further, these algorithms are computationally expensive as well. Here a custom design clustering algorithm (CDCA) has been proposed to overcome these shortcomings. Comparisons have been made using three benchmark case studies, one each from classification, function ap-proximation and prediction domain respectively. The proposed CDCA data partitioning method was evaluated in comparison with SOM, FC and GA based data partitioning methods. It was found that the CDCA data partitioning method not only performed well but also reduced the average CPU time. / Engineering Management
|
Page generated in 0.4191 seconds