Spelling suggestions: "subject:"clustering"" "subject:"klustering""
81 |
Clustering redshift : une nouvelle fenêtre sur l'univers / Clustering redshifts : a new window through the UniverseScottez, Vivien 21 September 2015 (has links)
Les principaux objectifs de cette thèse sont de valider, consolider et développer une nouvelle méthode permettant de mesurer la distribution en redshift d'un échantillon de galaxies. Là où les méthodes actuelles - redshifts spectroscopiques et photométriques - sont toutes liées à l'étude de la distribution d'énergie spectrale des sources extragalactiques, l'approche ici présentée repose sur les propriétés d'agrégation des galaxies entre elles. En effet l'agrégation (clustering en anglais) des galaxies due à la gravité leur confère une distribution spatiale - et angulaire - particulière. La méthode des clustering redshifts utilise cette propriété particulière d'agrégation entre une population de galaxies dont le redshift est inconnu et un échantillon d'objets de référence afin de déprojeter l'information et de reconstruire la distribution en redshift de la population inconnue. On peut s'attendre à ce que les systématiques de cette approche soient différentes de celles des méthodes existantes qui elles s'intéressent à la distribution spectrale d'énergie (SED) des galaxies. Ce type d'approche répond à un réel besoin de la part de la communauté scientifique dans le cadre des grands projets d'observations tels que la mission Euclid de l'Agence Spatiale Européenne (ESA). Après avoir situé le contexte scientifique général et avoir mis en évidence le rôle crucial de la mesure des distances en astronomie, je présente les outils statistiques généralement utilisés dans le cadre de l'étude de la répartition de la matière dans l'Univers ainsi que leur modification afin de pouvoir mesurer des distributions en redshift. Après avoir validé cette approche sur un type d'objets extragalactiques particuliers, j'ai ensuite étendu son application à l'ensemble des galaxies existantes. J'ai ensuite exploré la précision et les systématiques affectant ces mesures dans un cas idéal. Puis, je m'en suis éloigné de façon à me trouver en situation réelle. J'ai également poussé plus loin cette analyse et montré que les objets de référence utilisés lors de la mesure n'ont pas besoin de constituer un échantillon dont la magnitude limite est représentative de la population de redshift inconnu. Cette propriété constitue un avantage considérable pour l'utilisation de cette approche dans le cadre des futurs grands projets observationnels comme la mission spatiale Euclid. Pour finir, je résume mes principaux résultats et présente certains de mes futurs projets. / The main goals of this thesis are to validate, consolidate and develop a new method to measure the redshift distribution of a sample of galaxies. Where current methods - spectroscopic and photometric redshifts - rely on the study of the spectral energy distribution of extragalactic sources, the approach presented here is based on the clustering properties of galaxies. Indeed clustering of galaxies caused by gravity gives them a particular spatial - and angular - distribution. In this clustering redshift approach, we use this particular property between a galaxies sample of unknown redshifts and a galaxies sample of reference to reconstruct the redshift distribution of the unknown population. Thus, possible systematics in this approach should be independent of those existing in other methods. This new method responds to a real need from the scientific community in the context of large dark imaging experiments such as the Euclid mission of the European Space Agency (ESA). After introducing the general scientific context and having highlighted the crucial role of distance measurements in astronomy, I present the statistical tools generally used to study the large scale structure of the Universe as well as their modification to infer redshift distributions. After validating this approach on a particular type of extragalactic objects, I generalized its application to all types of galaxies. Then, I explored the precision and some systematic effects by conducting an ideal case study. Thus, I performed a real case study. I also pushed further this analysis and found that the reference sample used in the measurement does not need to have the same limiting magnitude than the population of unknown redshift. This property is a great advantage for the use of this approach in the context of large imaging dark energy experiments like the Euclid space mission. Finally, I summarize my main results and present some of my future projects.
|
82 |
Clustering Methods and Their Applications to Adolescent Healthcare DataMayer-Jochimsen, Morgan 01 January 2013 (has links)
Clustering is a mathematical method of data analysis which identifies trends in data by efficiently separating data into a specified number of clusters so is incredibly useful and widely applicable for questions of interrelatedness of data. Two methods of clustering are considered here. K-means clustering defines clusters in relation to the centroid, or center, of a cluster. Spectral clustering establishes connections between all of the data points to be clustered, then eliminates those connections that link dissimilar points. This is represented as an eigenvector problem where the solution is given by the eigenvectors of the Normalized Graph Laplacian. Spectral clustering establishes groups so that the similarity between points of the same cluster is stronger than similarity between different clusters. K-means and spectral clustering are used to analyze adolescent data from the 2009 California Health Interview Survey. Differences were observed between the results of the clustering methods on 3294 individuals and 22 health-related attributes. K-means clustered the adolescents by exercise, poverty, and variables related to psychological health while spectral clustering groups were informed by smoking, alcohol use, low exercise, psychological distress, low parental involvement, and poverty. We posit some guesses as to this difference, observe characteristics of the clustering methods, and comment on the viability of spectral clustering on healthcare data.
|
83 |
Development of quantitative methods for the following of tumoral angiogenesis with dynamic contrast-enhanced ultrasound / Développement de methodes quantitatives pour le suivi de l'angiogenese tumorale par échographie de contraste ultrasonoreBarrois, Guillaume 30 September 2014 (has links)
L'objectif de ce travail était de développer des méthodes pour permettre une évaluation in vivo plus robuste du réseau vasculaire dans la tumeur par imagerie de contraste ultrasonore. Trois aspects de l'analyse de donnée ont été abordé: 1) la régression des modèles paramétriques de flux sur les données de puissance linéaire, 2) la compensation du mouvement 3) l’évaluation d’une méthode de clustering pour identifier les hétérogénéités dans les tumeurs. Un modèle multiplicatif est proposé pour décrire le signal DCE-US. Une méthode de régression en est dérivée. La caractérisation du signal permet la mise au point d’une méthode de simulation de séquences 2D+T. La méthode de régression permet une diminution de la variabilité des paramètres de flux fonctionnels extraits, sur données simulées expérimentales. La méthode de simulation est appliquée pour évaluer une méthode combinant estimation du mouvement et estimations des paramètres micro-vasculaires dans un unique problème mathématique d'optimisation. Cette nouvelle méthode présente en plus l'avantage d'être indépendante de l'opérateur. Il est montré que dans une large majorité des cas l'estimation du mouvement est meilleure avec la nouvelle méthode qu'avec une méthode de références. Une méthode de clustering est adaptée et évaluée sur données DCE-US simulées et in-vivo. Elle permet de détecter des hétérogénéités dans la structure vasculaire des tumeurs. Les méthodes développées permettent d'améliorer l’évaluation du réseau microvasculaire par DCE-US grâce à une description rigoureuse du signal, à la mise au point d'outil diminuant l'intervention de l'opérateur et la prise en compte de l'hétérogénéité du réseau vasculaire. / This work aimed to develop methods to robustly evaluate in vivo functional flow within the tumor vascular network with Dynamic contrast-enhanced ultrasound (DCE-US). Three aspects of data analysis were addressed: 1) insuring best fit between parametric flow models and the experimentally acquired echo-power curves, 2) compensating sequences for motion and 3) evaluating a method to discriminate between tissues with different functional flow. A multiplicative model is proposed to describe the DCE-US signal. Based on this model, a new parametric regression method of the signal is derived. Characterization of the statistical properties of the noise and signal is also used to develop a new method simulating contrast-enhanced ultrasound 2D+t sequences. A significant decrease in the variability of the functional flow parameters extracted according to the new multiplicative-noise fitting method is demonstrated using both simulated and experimentally-acquired sequences. The new sequence simulations are applied to test a method combining motion estimation and flow-parameter estimation within a single mathematical framework. Because this new method does not require the selection of a reference image, it reduces operator intervention. Tests of the method on both simulations and clinical data and demonstrate in a majority of sequences a more accurate motion estimation than the commonly used image registration method. Finally, a non-parametric method for perfusion curve clustering is evaluated on 2D+t sequences. The aim of this method is to regroup similar filling patterns without a priori knowledge about the patterns. The method is tested on simulated and on pre-clinical data.
|
84 |
Ensembles des modeles en fMRI : l'apprentissage stable à grande échelle / Ensembles of models in fMRI : stable learning in large-scale settingsHoyos-Idrobo, Andrés 20 January 2017 (has links)
En imagerie médicale, des collaborations internationales ont lançé l'acquisition de centaines de Terabytes de données - et en particulierde données d'Imagerie par Résonance Magnétique fonctionelle (IRMf) -pour les mettre à disposition de la communauté scientifique.Extraire de l'information utile de ces données nécessite d'importants prétraitements et des étapes de réduction de bruit. La complexité de ces analyses rend les résultats très sensibles aux paramètres choisis. Le temps de calcul requis augmente plus vite que linéairement: les jeux de données sont si importants qu'il ne tiennent plus dans le cache, et les architectures de calcul classiques deviennent inefficaces.Pour réduire les temps de calcul, nous avons étudié le feature-grouping commetechnique de réduction de dimension. Pour ce faire, nous utilisons des méthodes de clustering. Nous proposons un algorithme de clustering agglomératif en temps linéaire: Recursive Nearest Agglomeration (ReNA). ReNA prévient la création de clusters énormes, qui constitue un défaut des méthodes agglomératives rapidesexistantes. Nous démontrons empiriquement que cet algorithme de clustering engendre des modèles très précis et rapides, et permet d'analyser de grands jeux de données avec des ressources limitées.En neuroimagerie, l'apprentissage statistique peut servir à étudierl'organisation cognitive du cerveau. Des modèles prédictifs permettent d'identifier les régions du cerveau impliquées dans le traitement cognitif d'un stimulus externe. L'entraînement de ces modèles est un problème de très grande dimension, et il est nécéssaire d'introduire un a priori pour obtenir un modèle satisfaisant.Afin de pouvoir traiter de grands jeux de données et d'améliorer lastabilité des résultats, nous proposons de combiner le clustering etl'utilisation d'ensembles de modèles. Nous évaluons la performance empirique de ce procédé à travers de nombreux jeux de données de neuroimagerie. Cette méthode est hautement parallélisable et moins coûteuse que l'état del'art en temps de calcul. Elle permet, avec moins de données d'entraînement,d'obtenir de meilleures prédictions. Enfin, nous montrons que l'utilisation d'ensembles de modèles améliore la stabilité des cartes de poids résultantes et réduit la variance du score de prédiction. / In medical imaging, collaborative worldwide initiatives have begun theacquisition of hundreds of Terabytes of data that are made available to thescientific community. In particular, functional Magnetic Resonance Imaging --fMRI-- data. However, this signal requires extensive fitting and noise reduction steps to extract useful information. The complexity of these analysis pipelines yields results that are highly dependent on the chosen parameters.The computation cost of this data deluge is worse than linear: as datasetsno longer fit in cache, standard computational architectures cannot beefficiently used.To speed-up the computation time, we considered dimensionality reduction byfeature grouping. We use clustering methods to perform this task. We introduce a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We then show empirically how this clustering algorithm yields very fast and accurate models, enabling to process large datasets on budget.In neuroimaging, machine learning can be used to understand the cognitiveorganization of the brain. The idea is to build predictive models that are used to identify the brain regions involved in the cognitive processing of an external stimulus. However, training such estimators is a high-dimensional problem, and one needs to impose some prior to find a suitable model.To handle large datasets and increase stability of results, we propose to useensembles of models in combination with clustering. We study the empirical performance of this pipeline on a large number of brain imaging datasets. This method is highly parallelizable, it has lower computation time than the state-of-the-art methods and we show that, it requires less data samples to achieve better prediction accuracy. Finally, we show that ensembles of models improve the stability of the weight maps and reduce the variance of prediction accuracy.
|
85 |
Distributed Hierarchical ClusteringLoganathan, Satish Kumar January 2018 (has links)
No description available.
|
86 |
Scalable Clustering for Immune Repertoire Sequence AnalysisBhusal, Prem 24 May 2019 (has links)
No description available.
|
87 |
Clustering Multiple Contextually Related Heterogeneous DatasetsHossain, Mahmood 09 December 2006 (has links)
Traditional clustering is typically based on a single feature set. In some domains, several feature sets may be available to represent the same objects, but it may not be easy to compute a useful and effective integrated feature set. We hypothesize that clustering individual datasets and then combining them using a suitable ensemble algorithm will yield better quality clusters compared to the individual clustering or clustering based on an integrated feature set. We present two classes of algorithms to address the problem of combining the results of clustering obtained from multiple related datasets where the datasets represent identical or overlapping sets of objects but use different feature sets. One class of algorithms was developed for combining hierarchical clustering generated from multiple datasets and another class of algorithms was developed for combining partitional clustering generated from multiple datasets. The first class of algorithms, called EPaCH, are based on graph-theoretic principles and use the association strengths of objects in the individual cluster hierarchies. The second class of algorithms, called CEMENT, use an EM (Expectation Maximization) approach to progressively refine the individual clusterings until the mutual entropy between them converges toward a maximum. We have applied our methods to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. After several natural language preprocessing steps, both syntactic and semantic feature sets were extracted. We present empirical results that include the comparison of our algorithms with several baseline clustering schemes using different cluster validation indices. We also present the results of one-tailed paired emph{T}-tests performed on cluster qualities. Our methods are shown to yield higher quality clusters than the baseline clustering schemes that include the clustering based on individual feature sets and clustering based on concatenated feature sets. When the sets of objects represented in two datasets are overlapping but not identical, our algorithms outperform all baseline methods for all indices.
|
88 |
Discovering Intrinsic Points of Interest from Spatial Trajectory Data SourcesPiekenbrock, Matthew J. 13 June 2018 (has links)
No description available.
|
89 |
Schemas of ClusteringTadepalli, Sriram Satish 12 March 2009 (has links)
Data mining techniques, such as clustering, have become a mainstay in many applications such as bioinformatics, geographic information systems, and marketing. Over the last decade, due to new demands posed by these applications, clustering techniques have been significantly adapted and extended. One such extension is the idea of finding clusters in a dataset that preserve information about some auxiliary variable. These approaches tend to guide the clustering algorithms that are traditionally unsupervised learning techniques with the background knowledge of the auxiliary variable. The auxiliary information could be some prior class label attached to the data samples or it could be the relations between data samples across different datasets. In this dissertation, we consider the latter problem of simultaneously clustering several vector valued datasets by taking into account the relationships between the data samples.
We formulate objective functions that can be used to find clusters that are local in each individual dataset and at the same time maximally similar or dissimilar with respect to clusters across datasets. We introduce diverse applications of these clustering algorithms: (1) time series segmentation (2) reconstructing temporal models from time series segmentations (3) simultaneously clustering several datasets according to database schemas using a multi-criteria optimization and (4) clustering datasets with many-many relationships between data samples.
For each of the above, we demonstrate applications, including modeling the yeast cell cycle and the yeast metabolic cycle, understanding the temporal relationships between yeast biological processes, and cross-genomic studies involving multiple organisms and multiple stresses. The key contribution is to structure the design of complex clustering algorithms over a database schema in terms of clustering algorithms over the underlying entity sets. / Ph. D.
|
90 |
VOLATILITY CLUSTERING USING A HETEROGENEOUS AGENT-BASED MODELARREY-MBI, PASCAL EBOT January 2011 (has links)
Volatility clustering is a stylized fact common in nance. Large changes in prices tend to cluster whereas small changes behave likewise. The higher the volatility of a market, the more risky it is said to be and vice versa . Below, we study volatility clustering using an agent-based model. This model looks at the reaction of agents as a result of the variation of asset prices. This is due to the irregular switching of agents between fundamentalist and chartist behaviors generating a time varying volatility. Switching depends on the performances of the various strategies. The expectations of the excess returns of the agents (fundamentalists and chartists) are heterogenous.
|
Page generated in 0.1056 seconds