Spelling suggestions: "subject:"clustering"" "subject:"klustering""
321 |
Structured clustering representations and methodsHeilbut, Adrian Mark 21 June 2016 (has links)
Rather than designing focused experiments to test individual hypotheses, scientists now commonly acquire measurements using massively parallel techniques, for post hoc interrogation. The resulting data is both high-dimensional and structured, in that observed variables are grouped and ordered into related subspaces, reflecting both natural physical organization and factorial experimental designs. Such structure encodes critical constraints and clues to interpretation, but typical unsupervised learning methods assume exchangeability and fail to account adequately for the structure of data in a flexible and interpretable way. In this thesis, I develop computational methods for exploratory analysis of structured high-dimensional data, and apply them to study gene expression regulation in Parkinson’s (PD) and Huntington’s diseases (HD).
BOMBASTIC (Block-Organized, Model-Based, Tree-Indexed Clustering) is a methodology to cluster and visualize data organized in pre-specified subspaces, by combining independent clusterings of blocks into hierarchies. BOMBASTIC provides a formal specification of the block-clustering problem and a modular implementation that facilitates integration, visualization, and comparison of diverse datasets and rapid exploration of alternative analyses.
These tools, along with standard methods, were applied to study gene expression in mouse models of neurodegenerative diseases, in collaboration with Dr. Myriam Heiman and Dr. Robert Fenster. In PD, I analyzed cell-type-specific expression following levodopa treatment to study mechanisms underlying levodopa-induced dyskinesia (LID). I identified likely regulators of the transcriptional changes leading to LID and implicated signaling pathways amenable to pharmacological modulation (Heiman, Heilbut et al, 2014). In HD, I analyzed multiple mouse models (Kuhn, 2007), cell-type specific profiles of medium spiny neurons (Fenster, 2011), and an RNA-Seq dataset profiling multiple tissue types over time and across an mHTT allelic series (CHDI, 2015). I found evidence suggesting that altered activity of the PRC2 complex significantly contributes to the transcriptional dysregulation observed in striatal neurons in HD.
|
322 |
Visualização, kernels e subespaços: um estudo prático / Visualization, kernels and subspace: a practical studyBarbosa, Adriano Oliveira 16 December 2016 (has links)
Dados de alta dimensão são tipicamente tratados como pertencentes a um único subespaço do espaço onde estão imersos. Entretanto, dados utilizados em aplicações reais estão usualmente distribuídos entre subespaços independentes e com dimensões distintas. Um objeto de estudo surge a partir dessa afirmação: como essa distribuição em subespaços independentes pode auxiliar tarefas de visualização? Por outro lado, se o dado parece estar embaralhado nesse espaço de alta dimensão, como visualizar seus padrões e realizar tarefas como classificação? Podemos, por exemplo, mapear esse dado num outro espaço utilizando uma função capaz de o desembaralhar, de modo que os padrões intrínsecos fiquem mais claros e, assim, facilitando nossa tarefa de visualização ou classificação. Essa Tese apresenta dois estudos que abordam ambos os problemas. Para o primeiro, utilizamos técnicas de subspace clustering para definir, quando existente, a estrutura de subespaços do dado e estudamos como essa informação pode auxiliar em visualizações utilizando projeções multidimensionais. Para o segundo problema, métodos de kernel, bastante conhecidos na literatura, são as ferramentas a nos auxiliar. Utilizamos a medida de similaridade do kernel para desenvolver uma nova técnica de projeção multidimensional capaz de lidar com dados imersos no espaço de características induzido implicitamente pelo kernel. / High-dimensional data are typically handled as laying in a single subspace of the original space. However, data involved in real applications are usually spread around in distinct subspaces which may have different dimensions. We would like to study how the subspace structure information can be used to improve visualization tasks. On the other hand, what if the data is tangled in this high-dimensional space, how to visualize its patterns or how to accomplish classification tasks? One could, for example, map the data in another high-dimensional space using amapping capable of untangle the data making the patterns clear, rendering the visualization or classification an easy task. This dissertation presents an study for both problems pointed out above. For the former, we use subspace clustering techniques to define, when it exists, a subspace structure, studying how this information can be used to support visualization tasks based on multidimensional projections. For the latter problem we employ kernel methods, well known in the literature, as a tool to assist visualization tasks. We use a similarity measure given by the kernel to develop acompletely new multidimensional projection technique capable of dealing with data embedded in the implicit feature space defined by the kernel.
|
323 |
Propriedades assintóticas e estimadores consistentes para a probabilidade de clustering / Asymptotic properties and consistent estimators for the clustering probabilityMariana Pereira de Melo 23 May 2014 (has links)
Considere um processo estocástico X_m em tempo discreto definido sobre o alfabeto finito A. Seja x_0^k-1 uma palavra fixa sobre A^k. No estudo das propriedades estatísticas na teoria de recorrência de Poincaré, é clássico o estudo do tempo decorrente até que a sequência fixa x_0^k-1 seja encontrada em uma realização do processo. Tipicamente, esta é uma quantidade exponencialmente grande com relação ao comprimento da palavra. Contrariamente, o primeiro tempo de retorno possível para uma sequência dada está definido como sendo o mínimo entre os tempos de entrada de todas as sequências que começam com a própria palavra e é uma quantidade tipicamente pequena, da ordem do tamanho da palavra. Neste trabalho estudamos o comportamento da probabilidade deste primeiro retorno possível de uma palavra x_0^k-1 dado que o processo começa com ela mesma. Esta quantidade mede a intensidade de que, uma vez observado um conjunto alvo, possam ser observados agrupamentos ou clusters. Provamos que, sob certas condições, a taxa de decaimento exponencial desta probabilidade converge para a entropia para quase toda a sequência quando k diverge. Apresentamos também um estimador desta probabilidade para árvores de contexto e mostramos sua consistência. / Considering a stochastic process X_m in a discrete defined time over a finite alphabet A and x_0^k-1 a fixed word over A^k. In the study of the statistical properties of the Poincaré recurrence theory, it is usual the study of the time elapsed until a fixed sequence x_0^k-1 appears in a given realization of process. This quantity is known as the hitting time and it is usually exponentially large in relation to the size of word. On the opposite, the first possible return time of a given word is defined as the minimum among all the hitting times of realizations that begins with the given word x_0^k-1. This quantity is tipically small that is of the order of the length of the sequence. In this work, we study the probability of the first possible return time given that the process begins of the target word. This quantity measures the intensity of that, once observed the target set, it can be observed in clusters. We show that, under certain conditions, the exponential decay rate of this probability converges to the entropy for all almost every word x_0^k-1 as k diverges. We also present an estimator of this probability for context trees and shows its consistency.
|
324 |
Clustering de trajetórias / Trajectory clusteringMarcio Takashi Iura Oshiro 16 September 2015 (has links)
Esta tese teve como objetivo estudar problemas cinéticos de clustering, ou seja, problemas de clustering nos quais os objetos se movimentam. O trabalho se concentrou no caso unidimensional, em que os objetos são pontos se movendo na reta real. Diversas variantes desse caso foram abordadas. Em termos do movimento, consideramos o caso em que cada ponto se move com uma velocidade constante num dado intervalo de tempo, o caso em que os pontos se movem arbitrariamente e temos apenas as suas posições em instantes discretos de tempo, o caso em que os pontos se movem com uma velocidade aleatória em que se conhece apenas o valor esperado da velocidade, e o caso em que, dada uma partição do intervalo de tempo, os pontos se movem com velocidades constantes em cada subintervalo. Em termos do tipo de clustering buscado, nos concentramos no caso em que o número de clusters é um dado do problema e consideramos diferentes medidas de qualidade para o clustering. Duas delas são tradicionais para problemas de clustering: a soma dos diâmetros dos clusters e o diâmetro máximo de um cluster. A terceira medida considerada leva em conta a característica cinética do problema, e permite, de uma maneira controlada, que o clustering mude com o tempo. Para cada uma das variantes do problema, são apresentados algoritmos, exatos ou de aproximação, alguns resultados de complexidade obtidos, e questões que ficaram em aberto. / This work aimed to study kinetic problems of clustering, i.e., clustering problems in which the objects are moving. The study focused on the unidimensional case, where the objects are points moving on the real line. Several variants of this case have been discussed. Regarding the movement, we consider the case where each point moves at a constant velocity in a given time interval, the case where the points move arbitrarily and we only know their positions in discrete time instants, the case where the points move at a random velocity in which only the expected value of the velocity is known, and the case where, given a partition of the time interval, the points move at constant velocities in each sub-interval. Regarding the kind of clustering sought, we focused in the case where the number of clusters is part of the input of the problem and we consider different measures of quality for the clustering. Two of them are traditional measures for clustering problems: the sum of the cluster diameters and the maximum diameter of a cluster. The third measure considered takes into account the kinetic characteristic of the problem, and allows, in a controlled manner, that a cluster change along time. For each of the variants of the problem, we present algorithms, exact or approximation, some obtained complexity results, and open questions.
|
325 |
Multi-Purpose Boundary-Based Clustering on Proximity Graphs for Geographical Data MiningLee, Ickjai Lee January 2002 (has links)
With the growth of geo-referenced data and the sophistication and complexity of spatial databases, data mining and knowledge discovery techniques become essential tools for successful analysis of large spatial datasets. Spatial clustering is fundamental and central to geographical data mining. It partitions a dataset into smaller homogeneous groups due to spatial proximity. Resulting groups represent geographically interesting patterns of concentrations for which further investigations should be undertaken to find possible causal factors. In this thesis, we propose a spatial-dominant generalization approach that mines multivariate causal associations among geographical data layers using clustering analysis. First, we propose a generic framework of multi-purpose exploratory spatial clustering in the form of the Template-Method Pattern. Based on an object-oriented framework, we design and implement an automatic multi-purpose exploratory spatial clustering tool. The first instance of this framework uses the Delaunay diagram as an underlying proximity graph. Our spatial clustering incorporates the peculiar characteristics of spatial data that make space special. Thus, our method is able to identify high-quality spatial clusters including clusters of arbitrary shapes, clusters of heterogeneous densities, clusters of different sizes, closely located high-density clusters, clusters connected by multiple chains, sparse clusters near to high-density clusters and clusters containing clusters within O(n log n) time. It derives values for parameters from data and thus maximizes user-friendliness. Therefore, our approach minimizes user-oriented bias and constraints that hinder exploratory data analysis and geographical data mining. Sheer volume of spatial data stored in spatial databases is not the only concern. The heterogeneity of datasets is a common issue in data-rich environments, but left open by exploratory tools. Our spatial clustering extends to the Minkowski metric in the absence or presence of obstacles to deal with situations where interactions between spatial objects are not adequately modeled by the Euclidean distance. The genericity is such that our clustering methodology extends to various spatial proximity graphs beyond the default Delaunay diagram. We also investigate an extension of our clustering to higher-dimensional datasets that robustly identify higher-dimensional clusters within O(n log n) time. The versatility of our clustering is further illustrated with its deployment to multi-level clustering. We develop a multi-level clustering method that reveals hierarchical structures hidden in complex datasets within O(n log n) time. We also introduce weighted dendrograms to effectively visualize the cluster hierarchies. Interpretability and usability of clustering results are of great importance. We propose an automatic pattern spotter that reveals high level description of clusters. We develop an effective and efficient cluster polygonization process towards mining causal associations. It automatically approximates shapes of clusters and robustly reveals asymmetric causal associations among data layers. Since it does not require domain-specific concept hierarchies, its applicability is enhanced. / PhD Doctorate
|
326 |
Filtering Social Tags for Songs based on Lyrics using Clustering MethodsChawla, Rahul 21 July 2011 (has links)
In the field of Music Data Mining, Mood and Topic information has been considered as a high level metadata. The extraction of mood and topic information is difficult but is regarded as very valuable. The immense growth of Web 2.0 resulted in Social Tags being a direct interaction with users (humans) and their feedback through tags can help in classification and retrieval of music. One of the major shortcomings of the approaches that have been employed so far is the improper filtering of social tags. This thesis delves into the topic of information extraction from songs’ tags and lyrics. The main focus is on removing all erroneous and unwanted tags with help of other features. The hierarchical clustering method is applied to create clusters of tags. The clusters are based on semantic information any given pair of tags share. The lyrics features are utilized by employing CLOPE clustering method to form lyrics clusters, and Naïve Bayes method to compute probability values that aid in classification process. The outputs from classification are finally used to estimate the accuracy of a tag belonging to the song. The results obtained from the experiments all point towards the success of the method proposed and can be utilized by other research projects in the similar field.
|
327 |
An Empirical Study On Fuzzy C-means Clustering For Turkish Banking SystemAltinel, Fatih 01 September 2012 (has links) (PDF)
Banking sector is very sensitive to macroeconomic and political instabilities and they are prone to crises. Since banks are integrated with almost all of the economic agents and with other banks, these crises affect entire societies. Therefore, classification or rating of banks with respect to their credibility becomes important. In this study we examine different models for classification of banks. Choosing one of those models, fuzzy c-means clustering, banks are grouped into clusters using 48 different ratios which can be classified under capital, assets quality, liquidity, profitability, income-expenditure structure, share in sector, share in group and branch ratios. To determine the inter-dependency between these variables, covariance and correlation between variables is analyzed. Principal component analysis is used to decrease the number of factors. As a a result, the representation space of data has been reduced from 48 variables to a 2 dimensional space. The observation is that 94.54% of total variance is produced by these two factors. Empirical results indicate that as the number of clusters is increased, the number of iterations required for minimizing the objective function fluctuates and is not monotonic. Also, as the number of clusters used increases, initial non-optimized maximum objective function values as well as optimized final minimum objective function values monotonically decrease together. Another observation is that the &lsquo / difference between initial non-optimized and final optimized values of objective function&rsquo / starts to diminish as number of clusters increases.
|
328 |
Flexible Mixed-Effect Modeling of Functional Data, with Applications to Process MonitoringMosesova, Sofia 29 May 2007 (has links)
High levels of automation in manufacturing industries are leading to data sets of increasing
size and dimension. The challenge facing statisticians and field professionals is to develop
methodology to help meet this demand.
Functional data is one example of high-dimensional data characterized by observations
recorded as a function of some continuous measure, such as time. An application considered
in this thesis comes from the automotive industry. It involves a production process in which
valve seats are force-fitted by a ram into cylinder heads of automobile engines. For each
insertion, the force exerted by the ram is automatically recorded every fraction of a second
for about two and a half seconds, generating a force profile. We can think of these profiles
as individual functions of time summarized into collections of curves.
The focus of this thesis is the analysis of functional process data such as the valve seat
insertion example. A number of techniques are set forth. In the first part, two ways to
model a single curve are considered: a b-spline fit via linear regression, and a nonlinear
model based on differential equations. Each of these approaches is incorporated into a
mixed effects model for multiple curves, and multivariate process monitoring techniques
are applied to the predicted random effects in order to identify anomalous curves. In the
second part, a Bayesian hierarchical model is used to cluster low-dimensional summaries
of the curves into meaningful groups. The belief is that the clusters correspond to distinct
types of processes (e.g. various types of “good” or “faulty” assembly). New observations
can be assigned to one of these by calculating the probabilities of belonging to each cluster.
Mahalanobis distances are used to identify new observations not belonging to any of the
existing clusters. Synthetic and real data are used to validate the results.
|
329 |
Computational Complexity Of Bi-clusteringWulff, Sharon Jay January 2008 (has links)
In this work we formalize a new natural objective (or cost) function
for bi-clustering - Monochromatic bi-clustering. Our objective function is
suitable for detecting meaningful homogenous clusters based on
categorical valued input matrices. Such problems have arisen recently in
systems biology where researchers have inferred functional classifications
of biological agents based on their pairwise interactions. We
analyze the computational complexity of the resulting optimization
problems. We show that finding optimal solutions is NP-hard and
complement this result by introducing a polynomial time
approximation algorithm for this bi-clustering task. This is the first positive
approximation guarantee for bi-clustering algorithms. We also show
that bi-clustering with our objective function can be viewed as a
generalization of correlation clustering.
|
330 |
Voting-Based Consensus of Data PartitionsAyad, Hanan 08 1900 (has links)
Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles.
|
Page generated in 0.0638 seconds