321 |
Parcellisation de la surface corticale basée sur la connectivité : vers une exploration multimodale / Connectivity-based structural parcellation : toward multimodal analysisLefranc, Sandrine 09 September 2015 (has links)
L’IRM de diffusion est une modalité d’imagerie médicale qui suscite un intérêt croissant dans larecherche en neuro-imagerie. Elle permet de caractériser in vivo l’organisation neuronale et apportepar conséquent de nouvelles informations sur les fibres de la matière blanche. En outre, il a étémontré que chaque région corticale a une signature spécifique pouvant être décrite par des mesuresde connectivité. Notre travail de recherche a ainsi porté sur la conception d’une méthode deparcellisation du cortex entier à partir de ces métriques. En se basant sur de précédents travaux dudomaine (thèse de P. Roca 2011), ce travail propose une nouvelle analyse de groupe permettantl’obtention d’une segmentation individuelle ou moyennée sur la population d'étude. Il s’agit d’unproblème difficile en raison de la variabilité interindividuelle présente dans les données. Laméthode a été testée et évaluée sur les 80 sujets de la base ARCHI. Des aspects multimodaux ontété abordés pour comparer nos parcellisations structurelles avec d’autres parcellisations ou descaractéristiques morphologiques calculées à partir des modalités présentes dans la base de données.Une correspondance avec la variabilité de l’anatomie corticale, ainsi qu’avec des parcellisations dedonnées d’IRM fonctionnelle, a pu être montrée, apportant une première validationneuroscientifique. / Résumé anglais :Diffusion MRI is a medical imaging modality of great interest in neuroimaging research. Thismodality enables the characterization in vivo of neuronal organization and thus providinginformation on the white matter fibers. In addition, each cortical region has been shown to have aspecific signature, which can be described by connectivity measures. Our research has focused onthe design of a whole cortex parcellation method driven by these metrics. Based on the previouswork of P. Roca 2011, a new group analysis is proposed to achieve an individual or populationaveraged segmentation. This is a difficult problem due to the interindividual variability present inthe data. The method was tested and evaluated on the 80 subjects of the ARCHI database.Multimodal aspects were investigated to compare the proposed structural parcelliations with otherparcellations or morphological characteristics derived from the modalities present in the database. Aconnection between the variability of cortical anatomy and parcellations of the functional MRI datawas demonstrated, providing a first neuroscientist validation.
|
322 |
Structured clustering representations and methodsHeilbut, Adrian Mark 21 June 2016 (has links)
Rather than designing focused experiments to test individual hypotheses, scientists now commonly acquire measurements using massively parallel techniques, for post hoc interrogation. The resulting data is both high-dimensional and structured, in that observed variables are grouped and ordered into related subspaces, reflecting both natural physical organization and factorial experimental designs. Such structure encodes critical constraints and clues to interpretation, but typical unsupervised learning methods assume exchangeability and fail to account adequately for the structure of data in a flexible and interpretable way. In this thesis, I develop computational methods for exploratory analysis of structured high-dimensional data, and apply them to study gene expression regulation in Parkinson’s (PD) and Huntington’s diseases (HD).
BOMBASTIC (Block-Organized, Model-Based, Tree-Indexed Clustering) is a methodology to cluster and visualize data organized in pre-specified subspaces, by combining independent clusterings of blocks into hierarchies. BOMBASTIC provides a formal specification of the block-clustering problem and a modular implementation that facilitates integration, visualization, and comparison of diverse datasets and rapid exploration of alternative analyses.
These tools, along with standard methods, were applied to study gene expression in mouse models of neurodegenerative diseases, in collaboration with Dr. Myriam Heiman and Dr. Robert Fenster. In PD, I analyzed cell-type-specific expression following levodopa treatment to study mechanisms underlying levodopa-induced dyskinesia (LID). I identified likely regulators of the transcriptional changes leading to LID and implicated signaling pathways amenable to pharmacological modulation (Heiman, Heilbut et al, 2014). In HD, I analyzed multiple mouse models (Kuhn, 2007), cell-type specific profiles of medium spiny neurons (Fenster, 2011), and an RNA-Seq dataset profiling multiple tissue types over time and across an mHTT allelic series (CHDI, 2015). I found evidence suggesting that altered activity of the PRC2 complex significantly contributes to the transcriptional dysregulation observed in striatal neurons in HD.
|
323 |
Visualização, kernels e subespaços: um estudo prático / Visualization, kernels and subspace: a practical studyBarbosa, Adriano Oliveira 16 December 2016 (has links)
Dados de alta dimensão são tipicamente tratados como pertencentes a um único subespaço do espaço onde estão imersos. Entretanto, dados utilizados em aplicações reais estão usualmente distribuídos entre subespaços independentes e com dimensões distintas. Um objeto de estudo surge a partir dessa afirmação: como essa distribuição em subespaços independentes pode auxiliar tarefas de visualização? Por outro lado, se o dado parece estar embaralhado nesse espaço de alta dimensão, como visualizar seus padrões e realizar tarefas como classificação? Podemos, por exemplo, mapear esse dado num outro espaço utilizando uma função capaz de o desembaralhar, de modo que os padrões intrínsecos fiquem mais claros e, assim, facilitando nossa tarefa de visualização ou classificação. Essa Tese apresenta dois estudos que abordam ambos os problemas. Para o primeiro, utilizamos técnicas de subspace clustering para definir, quando existente, a estrutura de subespaços do dado e estudamos como essa informação pode auxiliar em visualizações utilizando projeções multidimensionais. Para o segundo problema, métodos de kernel, bastante conhecidos na literatura, são as ferramentas a nos auxiliar. Utilizamos a medida de similaridade do kernel para desenvolver uma nova técnica de projeção multidimensional capaz de lidar com dados imersos no espaço de características induzido implicitamente pelo kernel. / High-dimensional data are typically handled as laying in a single subspace of the original space. However, data involved in real applications are usually spread around in distinct subspaces which may have different dimensions. We would like to study how the subspace structure information can be used to improve visualization tasks. On the other hand, what if the data is tangled in this high-dimensional space, how to visualize its patterns or how to accomplish classification tasks? One could, for example, map the data in another high-dimensional space using amapping capable of untangle the data making the patterns clear, rendering the visualization or classification an easy task. This dissertation presents an study for both problems pointed out above. For the former, we use subspace clustering techniques to define, when it exists, a subspace structure, studying how this information can be used to support visualization tasks based on multidimensional projections. For the latter problem we employ kernel methods, well known in the literature, as a tool to assist visualization tasks. We use a similarity measure given by the kernel to develop acompletely new multidimensional projection technique capable of dealing with data embedded in the implicit feature space defined by the kernel.
|
324 |
Propriedades assintóticas e estimadores consistentes para a probabilidade de clustering / Asymptotic properties and consistent estimators for the clustering probabilityMariana Pereira de Melo 23 May 2014 (has links)
Considere um processo estocástico X_m em tempo discreto definido sobre o alfabeto finito A. Seja x_0^k-1 uma palavra fixa sobre A^k. No estudo das propriedades estatísticas na teoria de recorrência de Poincaré, é clássico o estudo do tempo decorrente até que a sequência fixa x_0^k-1 seja encontrada em uma realização do processo. Tipicamente, esta é uma quantidade exponencialmente grande com relação ao comprimento da palavra. Contrariamente, o primeiro tempo de retorno possível para uma sequência dada está definido como sendo o mínimo entre os tempos de entrada de todas as sequências que começam com a própria palavra e é uma quantidade tipicamente pequena, da ordem do tamanho da palavra. Neste trabalho estudamos o comportamento da probabilidade deste primeiro retorno possível de uma palavra x_0^k-1 dado que o processo começa com ela mesma. Esta quantidade mede a intensidade de que, uma vez observado um conjunto alvo, possam ser observados agrupamentos ou clusters. Provamos que, sob certas condições, a taxa de decaimento exponencial desta probabilidade converge para a entropia para quase toda a sequência quando k diverge. Apresentamos também um estimador desta probabilidade para árvores de contexto e mostramos sua consistência. / Considering a stochastic process X_m in a discrete defined time over a finite alphabet A and x_0^k-1 a fixed word over A^k. In the study of the statistical properties of the Poincaré recurrence theory, it is usual the study of the time elapsed until a fixed sequence x_0^k-1 appears in a given realization of process. This quantity is known as the hitting time and it is usually exponentially large in relation to the size of word. On the opposite, the first possible return time of a given word is defined as the minimum among all the hitting times of realizations that begins with the given word x_0^k-1. This quantity is tipically small that is of the order of the length of the sequence. In this work, we study the probability of the first possible return time given that the process begins of the target word. This quantity measures the intensity of that, once observed the target set, it can be observed in clusters. We show that, under certain conditions, the exponential decay rate of this probability converges to the entropy for all almost every word x_0^k-1 as k diverges. We also present an estimator of this probability for context trees and shows its consistency.
|
325 |
Clustering de trajetórias / Trajectory clusteringMarcio Takashi Iura Oshiro 16 September 2015 (has links)
Esta tese teve como objetivo estudar problemas cinéticos de clustering, ou seja, problemas de clustering nos quais os objetos se movimentam. O trabalho se concentrou no caso unidimensional, em que os objetos são pontos se movendo na reta real. Diversas variantes desse caso foram abordadas. Em termos do movimento, consideramos o caso em que cada ponto se move com uma velocidade constante num dado intervalo de tempo, o caso em que os pontos se movem arbitrariamente e temos apenas as suas posições em instantes discretos de tempo, o caso em que os pontos se movem com uma velocidade aleatória em que se conhece apenas o valor esperado da velocidade, e o caso em que, dada uma partição do intervalo de tempo, os pontos se movem com velocidades constantes em cada subintervalo. Em termos do tipo de clustering buscado, nos concentramos no caso em que o número de clusters é um dado do problema e consideramos diferentes medidas de qualidade para o clustering. Duas delas são tradicionais para problemas de clustering: a soma dos diâmetros dos clusters e o diâmetro máximo de um cluster. A terceira medida considerada leva em conta a característica cinética do problema, e permite, de uma maneira controlada, que o clustering mude com o tempo. Para cada uma das variantes do problema, são apresentados algoritmos, exatos ou de aproximação, alguns resultados de complexidade obtidos, e questões que ficaram em aberto. / This work aimed to study kinetic problems of clustering, i.e., clustering problems in which the objects are moving. The study focused on the unidimensional case, where the objects are points moving on the real line. Several variants of this case have been discussed. Regarding the movement, we consider the case where each point moves at a constant velocity in a given time interval, the case where the points move arbitrarily and we only know their positions in discrete time instants, the case where the points move at a random velocity in which only the expected value of the velocity is known, and the case where, given a partition of the time interval, the points move at constant velocities in each sub-interval. Regarding the kind of clustering sought, we focused in the case where the number of clusters is part of the input of the problem and we consider different measures of quality for the clustering. Two of them are traditional measures for clustering problems: the sum of the cluster diameters and the maximum diameter of a cluster. The third measure considered takes into account the kinetic characteristic of the problem, and allows, in a controlled manner, that a cluster change along time. For each of the variants of the problem, we present algorithms, exact or approximation, some obtained complexity results, and open questions.
|
326 |
Multi-Purpose Boundary-Based Clustering on Proximity Graphs for Geographical Data MiningLee, Ickjai Lee January 2002 (has links)
With the growth of geo-referenced data and the sophistication and complexity of spatial databases, data mining and knowledge discovery techniques become essential tools for successful analysis of large spatial datasets. Spatial clustering is fundamental and central to geographical data mining. It partitions a dataset into smaller homogeneous groups due to spatial proximity. Resulting groups represent geographically interesting patterns of concentrations for which further investigations should be undertaken to find possible causal factors. In this thesis, we propose a spatial-dominant generalization approach that mines multivariate causal associations among geographical data layers using clustering analysis. First, we propose a generic framework of multi-purpose exploratory spatial clustering in the form of the Template-Method Pattern. Based on an object-oriented framework, we design and implement an automatic multi-purpose exploratory spatial clustering tool. The first instance of this framework uses the Delaunay diagram as an underlying proximity graph. Our spatial clustering incorporates the peculiar characteristics of spatial data that make space special. Thus, our method is able to identify high-quality spatial clusters including clusters of arbitrary shapes, clusters of heterogeneous densities, clusters of different sizes, closely located high-density clusters, clusters connected by multiple chains, sparse clusters near to high-density clusters and clusters containing clusters within O(n log n) time. It derives values for parameters from data and thus maximizes user-friendliness. Therefore, our approach minimizes user-oriented bias and constraints that hinder exploratory data analysis and geographical data mining. Sheer volume of spatial data stored in spatial databases is not the only concern. The heterogeneity of datasets is a common issue in data-rich environments, but left open by exploratory tools. Our spatial clustering extends to the Minkowski metric in the absence or presence of obstacles to deal with situations where interactions between spatial objects are not adequately modeled by the Euclidean distance. The genericity is such that our clustering methodology extends to various spatial proximity graphs beyond the default Delaunay diagram. We also investigate an extension of our clustering to higher-dimensional datasets that robustly identify higher-dimensional clusters within O(n log n) time. The versatility of our clustering is further illustrated with its deployment to multi-level clustering. We develop a multi-level clustering method that reveals hierarchical structures hidden in complex datasets within O(n log n) time. We also introduce weighted dendrograms to effectively visualize the cluster hierarchies. Interpretability and usability of clustering results are of great importance. We propose an automatic pattern spotter that reveals high level description of clusters. We develop an effective and efficient cluster polygonization process towards mining causal associations. It automatically approximates shapes of clusters and robustly reveals asymmetric causal associations among data layers. Since it does not require domain-specific concept hierarchies, its applicability is enhanced. / PhD Doctorate
|
327 |
Filtering Social Tags for Songs based on Lyrics using Clustering MethodsChawla, Rahul 21 July 2011 (has links)
In the field of Music Data Mining, Mood and Topic information has been considered as a high level metadata. The extraction of mood and topic information is difficult but is regarded as very valuable. The immense growth of Web 2.0 resulted in Social Tags being a direct interaction with users (humans) and their feedback through tags can help in classification and retrieval of music. One of the major shortcomings of the approaches that have been employed so far is the improper filtering of social tags. This thesis delves into the topic of information extraction from songs’ tags and lyrics. The main focus is on removing all erroneous and unwanted tags with help of other features. The hierarchical clustering method is applied to create clusters of tags. The clusters are based on semantic information any given pair of tags share. The lyrics features are utilized by employing CLOPE clustering method to form lyrics clusters, and Naïve Bayes method to compute probability values that aid in classification process. The outputs from classification are finally used to estimate the accuracy of a tag belonging to the song. The results obtained from the experiments all point towards the success of the method proposed and can be utilized by other research projects in the similar field.
|
328 |
An Empirical Study On Fuzzy C-means Clustering For Turkish Banking SystemAltinel, Fatih 01 September 2012 (has links) (PDF)
Banking sector is very sensitive to macroeconomic and political instabilities and they are prone to crises. Since banks are integrated with almost all of the economic agents and with other banks, these crises affect entire societies. Therefore, classification or rating of banks with respect to their credibility becomes important. In this study we examine different models for classification of banks. Choosing one of those models, fuzzy c-means clustering, banks are grouped into clusters using 48 different ratios which can be classified under capital, assets quality, liquidity, profitability, income-expenditure structure, share in sector, share in group and branch ratios. To determine the inter-dependency between these variables, covariance and correlation between variables is analyzed. Principal component analysis is used to decrease the number of factors. As a a result, the representation space of data has been reduced from 48 variables to a 2 dimensional space. The observation is that 94.54% of total variance is produced by these two factors. Empirical results indicate that as the number of clusters is increased, the number of iterations required for minimizing the objective function fluctuates and is not monotonic. Also, as the number of clusters used increases, initial non-optimized maximum objective function values as well as optimized final minimum objective function values monotonically decrease together. Another observation is that the &lsquo / difference between initial non-optimized and final optimized values of objective function&rsquo / starts to diminish as number of clusters increases.
|
329 |
Flexible Mixed-Effect Modeling of Functional Data, with Applications to Process MonitoringMosesova, Sofia 29 May 2007 (has links)
High levels of automation in manufacturing industries are leading to data sets of increasing
size and dimension. The challenge facing statisticians and field professionals is to develop
methodology to help meet this demand.
Functional data is one example of high-dimensional data characterized by observations
recorded as a function of some continuous measure, such as time. An application considered
in this thesis comes from the automotive industry. It involves a production process in which
valve seats are force-fitted by a ram into cylinder heads of automobile engines. For each
insertion, the force exerted by the ram is automatically recorded every fraction of a second
for about two and a half seconds, generating a force profile. We can think of these profiles
as individual functions of time summarized into collections of curves.
The focus of this thesis is the analysis of functional process data such as the valve seat
insertion example. A number of techniques are set forth. In the first part, two ways to
model a single curve are considered: a b-spline fit via linear regression, and a nonlinear
model based on differential equations. Each of these approaches is incorporated into a
mixed effects model for multiple curves, and multivariate process monitoring techniques
are applied to the predicted random effects in order to identify anomalous curves. In the
second part, a Bayesian hierarchical model is used to cluster low-dimensional summaries
of the curves into meaningful groups. The belief is that the clusters correspond to distinct
types of processes (e.g. various types of “good” or “faulty” assembly). New observations
can be assigned to one of these by calculating the probabilities of belonging to each cluster.
Mahalanobis distances are used to identify new observations not belonging to any of the
existing clusters. Synthetic and real data are used to validate the results.
|
330 |
Computational Complexity Of Bi-clusteringWulff, Sharon Jay January 2008 (has links)
In this work we formalize a new natural objective (or cost) function
for bi-clustering - Monochromatic bi-clustering. Our objective function is
suitable for detecting meaningful homogenous clusters based on
categorical valued input matrices. Such problems have arisen recently in
systems biology where researchers have inferred functional classifications
of biological agents based on their pairwise interactions. We
analyze the computational complexity of the resulting optimization
problems. We show that finding optimal solutions is NP-hard and
complement this result by introducing a polynomial time
approximation algorithm for this bi-clustering task. This is the first positive
approximation guarantee for bi-clustering algorithms. We also show
that bi-clustering with our objective function can be viewed as a
generalization of correlation clustering.
|
Page generated in 0.0988 seconds