381 |
Density Based Clustering using Mutual K-Nearest NeighborsDixit, Siddharth January 2015 (has links)
No description available.
|
382 |
Multiclassifier neural networks for handwritten character recognitionChai, Sin-Kuo January 1995 (has links)
No description available.
|
383 |
Graph Coloring and Clustering Algorithms for Science and Engineering ApplicationsBozdag, Doruk January 2008 (has links)
No description available.
|
384 |
Efficient fMRI Analysis and Clustering on GPUsTalasu, Dharneesh 16 December 2011 (has links)
No description available.
|
385 |
Generating fishing boats behaviour based on historic AIS data : A method to generate maritime trajectories based on historicpositional data / Genering av fiskebåtsbeteende baserat på historisk AIS datBergman, Oscar January 2022 (has links)
This thesis describes a method to generate new trajectories based on historic positiondata for a given geographical area. The thesis uses AIS-data from fishing boats to first describe a method that uses DBSCAN and OPTICS algorithms to cluster the data into clustersbased on routes where the boats travel and areas where the boats fish.Here bayesian optimization has been utilized to search for parameters for the clusteringalgorithms. In this given scenario it was shown DBSCAN is better in all fields, but it hasmany points where OPTICS has the potential to become better if it was modified a bit.This is followed by a method describing how to take the clusters and build a nodenetwork that then can be traversed using a path finding algorithm combined with internalrules to generate new routes that can be used in simulations to give a realistic enoughsituation picture. Finally a method to evaluate these generated routes are described andused to compare the routes to each other
|
386 |
Learning Statistical and Geometric Models from Microarray Gene Expression DataZhu, Yitan 01 October 2009 (has links)
In this dissertation, we propose and develop innovative data modeling and analysis methods for extracting meaningful and specific information about disease mechanisms from microarray gene expression data.
To provide a high-level overview of gene expression data for easy and insightful understanding of data structure, we propose a novel statistical data clustering and visualization algorithm that is comprehensively effective for multiple clustering tasks and that overcomes some major limitations of existing clustering methods. The proposed clustering and visualization algorithm performs progressive, divisive hierarchical clustering and visualization, supported by hierarchical statistical modeling, supervised/unsupervised informative gene/feature selection, supervised/unsupervised data visualization, and user/prior knowledge guidance through human-data interactions, to discover cluster structure within complex, high-dimensional gene expression data.
For the purpose of selecting suitable clustering algorithm(s) for gene expression data analysis, we design an objective and reliable clustering evaluation scheme to assess the performance of clustering algorithms by comparing their sample clustering outcome to phenotype categories. Using the proposed evaluation scheme, we compared the performance of our newly developed clustering algorithm with those of several benchmark clustering methods, and demonstrated the superior and stable performance of the proposed clustering algorithm.
To identify the underlying active biological processes that jointly form the observed biological event, we propose a latent linear mixture model that quantitatively describes how the observed gene expressions are generated by a process of mixing the latent active biological processes. We prove a series of theorems to show the identifiability of the noise-free model. Based on relevant geometric concepts, convex analysis and optimization, gene clustering, and model stability analysis, we develop a robust blind source separation method that fits the model to the gene expression data and subsequently identify the underlying biological processes and their activity levels under different biological conditions.
Based on the experimental results obtained on cancer, muscle regeneration, and muscular dystrophy gene expression data, we believe that the research work presented in this dissertation not only contributes to the engineering research areas of machine learning and pattern recognition, but also provides novel and effective solutions to potentially solve many biomedical research problems, for improving the understanding about disease mechanisms. / Ph. D.
|
387 |
AN EMPIRICAL STUDY OF AN INNOVATIVE CLUSTERING APPROACH TOWARDS EFFICIENT BIG DATA ANALYSISBowers, Jacob Robert 01 May 2024 (has links) (PDF)
The dramatic growth of big data presents formidable challenges for traditional clustering methodologies, which often prove unwieldy and computationally expensive when processing vast quantities of data. This study explores a novel clustering approach exemplified by Sow & Grow, a density-based clustering algorithm akin to DBSCAN developed to address the issues inherent to big data by enabling end-users to strategically allocate computational resources toward regions of noted interest. Achieved through a unique procedure of seeding points and subsequently fostering their growth into coherent clusters, this method significantly reduces computational waste by ignoring insignificant segments of the dataset and provides information relevant to the end user. The implementation of this algorithm developed as part of this research showcases promising results in various experimental settings, exhibiting notable speedup over conventional clustering methods. Additionally, the incorporation of dynamic load balancing further enhances the algorithm's performance, ensuring optimal resource utilization across parallel processing threads when handling superclusters or unbalanced data distributions. Through a detailed study of the theoretical underpinnings of this innovative clustering approach and the limitations of traditional clustering techniques, this research demonstrates the practical utility of the Sow & Grow algorithm in expediting the clustering processes while providing results pertinent to end users.
|
388 |
Decision Making System Algorithm On Menopause Data SetBacak, Hikmet Ozge 01 September 2007 (has links) (PDF)
Multiple-centered clustering method and decision making system algorithm on menopause data set depending on multiple-centered clustering are described in this study. This method consists of two stages. At the first stage, fuzzy C-means (FCM) clustering algorithm is applied on the data set under consideration with a high number of cluster centers. As the output of FCM, cluster centers and membership function values for each data member is calculated. At the second stage, original cluster centers obtained in the first stage are merged till the new numbers of clusters are reached. Merging process relies upon a &ldquo / similarity measure&rdquo / between clusters defined in the thesis. During the merging process, the cluster center coordinates do not change but the data members in these clusters are merged in a new cluster. As the output of this method, therefore, one obtains clusters which include many cluster centers.
In the final part of this study, an application of the clustering algorithms &ndash / including the multiple centered clustering method &ndash / a decision making system is constructed using a special data on menopause treatment. The decisions are based on the clusterings created by the algorithms already discussed in the previous chapters of the thesis. A verification of the decision making system /
v
decision aid system is done by a team of experts from the Department of Department of Obstetrics and Gynecology of Hacettepe University under the guidance of Prof. Sinan Beksaç / .
|
389 |
Analysis of the human corneal shape with machine learningBouazizi, Hala 01 1900 (has links)
Cette thèse cherche à examiner les conditions optimales dans lesquelles les surfaces cornéennes antérieures peuvent être efficacement pré-traitées, classifiées et prédites en utilisant des techniques de modélisation géométriques (MG) et d’apprentissage automatiques (AU).
La première étude (Chapitre 2) examine les conditions dans lesquelles la modélisation géométrique peut être utilisée pour réduire la dimensionnalité des données utilisées dans un projet d’apprentissage automatique. Quatre modèles géométriques ont été testés pour leur précision et leur rapidité de traitement : deux modèles polynomiaux (P) – polynômes de Zernike (PZ) et harmoniques sphériques (PHS) – et deux modèles de fonctions rationnelles (R) : fonctions rationnelles de Zernike (RZ) et fonctions rationnelles d’harmoniques sphériques (RSH). Il est connu que les modèles PHS et RZ sont plus précis que les modèles PZ pour un même nombre de coefficients (J), mais on ignore si les modèles PHS performent mieux que les modèles RZ, et si, de manière plus générale, les modèles SH sont plus précis que les modèles R, ou l’inverse. Et prenant en compte leur temps de traitement, est-ce que les modèles les plus précis demeurent les plus avantageux? Considérant des valeurs de J (nombre de coefficients du modèle) relativement basses pour respecter les contraintes de dimensionnalité propres aux taches d’apprentissage automatique, nous avons établi que les modèles HS (PHS et RHS) étaient tous deux plus précis que les modèles Z correspondants (PZ et RR), et que l’avantage de précision conféré par les modèles HS était plus important que celui octroyé par les modèles R. Par ailleurs, les courbes de temps de traitement en fonction de J démontrent qu’alors que les modèles P sont traités en temps quasi-linéaires, les modèles R le sont en temps polynomiaux. Ainsi, le modèle SHR est le plus précis, mais aussi le plus lent (un problème qui peut en partie être remédié en appliquant une procédure de pré-optimisation). Le modèle ZP était de loin le plus rapide, et il demeure une option intéressante pour le développement de projets. SHP constitue le meilleur compromis entre la précision et la rapidité.
La classification des cornées selon des paramètres cliniques a une longue tradition, mais la visualisation des effets moyens de ces paramètres sur la forme de la cornée par des cartes topographiques est plus récente. Dans la seconde étude (Chapitre 3), nous avons construit un atlas de cartes d’élévations moyennes pour différentes variables cliniques qui pourrait s’avérer utile pour l’évaluation et l’interprétation des données d’entrée (bases de données) et de sortie (prédictions, clusters, etc.) dans des tâches d’apprentissage automatique, entre autres. Une base de données constituée de plusieurs milliers de surfaces cornéennes antérieures normales enregistrées sous forme de matrices d’élévation de 101 by 101 points a d’abord été traitée par modélisation géométrique pour réduire sa dimensionnalité à un nombre de coefficients optimal dans une optique d’apprentissage automatique. Les surfaces ainsi modélisées ont été regroupées en fonction de variables cliniques de forme, de réfraction et de démographie. Puis, pour chaque groupe de chaque variable clinique, une surface moyenne a été calculée et représentée sous forme de carte d’élévations faisant référence à sa SMA (sphère la mieux ajustée). Après avoir validé la conformité de la base de donnée avec la littérature par des tests statistiques (ANOVA), l’atlas a été vérifié cliniquement en examinant si les transformations de formes cornéennes présentées dans les cartes pour chaque variable étaient conformes à la littérature. C’était le cas. Les applications possibles d’un tel atlas sont discutées.
La troisième étude (Chapitre 4) traite de la classification non-supervisée (clustering) de surfaces cornéennes antérieures normales. Le clustering cornéen un domaine récent en ophtalmologie. La plupart des études font appel aux techniques d’extraction des caractéristiques pour réduire la dimensionnalité de la base de données cornéennes. Le but est généralement d’automatiser le processus de diagnostique cornéen, en particulier en ce qui a trait à la distinction entre les cornées normales et les cornées irrégulières (kératocones, Fuch, etc.), et dans certains cas, de distinguer différentes sous-classes de cornées irrégulières. L’étude de clustering proposée ici se concentre plutôt sur les cornées normales afin de mettre en relief leurs regroupements naturels. Elle a recours à la modélisation géométrique pour réduire la dimensionnalité de la base de données, utilisant des polynômes de Zernike, connus pour leur interprétativité transparente (chaque terme polynomial est associé à une caractéristique cornéenne particulière) et leur bonne précision pour les cornées normales. Des méthodes de différents types ont été testées lors de prétests (méthodes de clustering dur (hard) ou souple (soft), linéaires or non-linéaires. Ces méthodes ont été testées sur des surfaces modélisées naturelles (non-normalisées) ou normalisées avec ou sans traitement d’extraction de traits, à l’aide de différents outils d’évaluation (scores de séparabilité et d’homogénéité, représentations par cluster des coefficients de modélisation et des surfaces modélisées, comparaisons statistiques des clusters sur différents paramètres cliniques). Les résultats obtenus par la meilleure méthode identifiée, k-means sans extraction de traits, montrent que les clusters produits à partir de surfaces cornéennes naturelles se distinguent essentiellement en fonction de la courbure de la cornée, alors que ceux produits à partir de surfaces normalisées se distinguent en fonction de l’axe cornéen.
La dernière étude présentée dans cette thèse (Chapitre 5) explore différentes techniques d’apprentissage automatique pour prédire la forme de la cornée à partir de données cliniques. La base de données cornéennes a d’abord été traitée par modélisation géométrique (polynômes de Zernike) pour réduire sa dimensionnalité à de courts vecteurs de 12 à 20 coefficients, une fourchette de valeurs potentiellement optimales pour effectuer de bonnes prédictions selon des prétests. Différentes méthodes de régression non-linéaires, tirées de la bibliothèque scikit-learn, ont été testées, incluant gradient boosting, Gaussian process, kernel ridge, random forest, k-nearest neighbors, bagging, et multi-layer perceptron. Les prédicteurs proviennent des variables cliniques disponibles dans la base de données, incluant des variables géométriques (diamètre horizontal de la cornée, profondeur de la chambre cornéenne, côté de l’œil), des variables de réfraction (cylindre, sphère et axe) et des variables démographiques (âge, genre). Un test de régression a été effectué pour chaque modèle de régression, défini comme la sélection d’une des 256 combinaisons possibles de variables cliniques (les prédicteurs), d’une méthode de régression, et d’un vecteur de coefficients de Zernike d’une certaine taille (entre 12 et 20 coefficients, les cibles). Tous les modèles de régression testés ont été évalués à l’aide de score de RMSE établissant la distance entre les surfaces cornéennes prédites (les prédictions) et vraies (les topographies corn¬éennes brutes). Les meilleurs d’entre eux ont été validés sur l’ensemble de données randomisé 20 fois pour déterminer avec plus de précision lequel d’entre eux est le plus performant. Il s’agit de gradient boosting utilisant toutes les variables cliniques comme prédicteurs et 16 coefficients de Zernike comme cibles. Les prédictions de ce modèle ont été évaluées qualitativement à l’aide d’un atlas de cartes d’élévations moyennes élaborées à partir des variables cliniques ayant servi de prédicteurs, qui permet de visualiser les transformations moyennes d’en groupe à l’autre pour chaque variables. Cet atlas a permis d’établir que les cornées prédites moyennes sont remarquablement similaires aux vraies cornées moyennes pour toutes les variables cliniques à l’étude. / This thesis aims to investigate the best conditions in which the anterior corneal surface of normal
corneas can be preprocessed, classified and predicted using geometric modeling (GM) and machine
learning (ML) techniques. The focus is on the anterior corneal surface, which is the main
responsible of the refractive power of the cornea.
Dealing with preprocessing, the first study (Chapter 2) examines the conditions in which GM
can best be applied to reduce the dimensionality of a dataset of corneal surfaces to be used in ML
projects. Four types of geometric models of corneal shape were tested regarding their accuracy and
processing time: two polynomial (P) models – Zernike polynomial (ZP) and spherical harmonic
polynomial (SHP) models – and two corresponding rational function (R) models – Zernike rational
function (ZR) and spherical harmonic rational function (SHR) models. SHP and ZR are both known
to be more accurate than ZP as corneal shape models for the same number of coefficients, but which
type of model is the most accurate between SHP and ZR? And is an SHR model, which is both an
SH model and an R model, even more accurate? Also, does modeling accuracy comes at the cost
of the processing time, an important issue for testing large datasets as required in ML projects?
Focusing on low J values (number of model coefficients) to address these issues in consideration
of dimensionality constraints that apply in ML tasks, it was found, based on a number of evaluation
tools, that SH models were both more accurate than their Z counterparts, that R models were both
more accurate than their P counterparts and that the SH advantage was more important than the R
advantage. Processing time curves as a function of J showed that P models were processed in quasilinear time, R models in polynomial time, and that Z models were fastest than SH models.
Therefore, while SHR was the most accurate geometric model, it was the slowest (a problem that
can partly be remedied by applying a preoptimization procedure). ZP was the fastest model, and
with normal corneas, it remains an interesting option for testing and development, especially for
clustering tasks due to its transparent interpretability. The best compromise between accuracy and
speed for ML preprocessing is SHP.
The classification of corneal shapes with clinical parameters has a long tradition, but the
visualization of their effects on the corneal shape with group maps (average elevation maps,
standard deviation maps, average difference maps, etc.) is relatively recent. In the second study
(Chapter 3), we constructed an atlas of average elevation maps for different clinical variables
(including geometric, refraction and demographic variables) that can be instrumental in the
evaluation of ML task inputs (datasets) and outputs (predictions, clusters, etc.). A large dataset of
normal adult anterior corneal surface topographies recorded in the form of 101×101 elevation
matrices was first preprocessed by geometric modeling to reduce the dimensionality of the dataset
to a small number of Zernike coefficients found to be optimal for ML tasks. The modeled corneal
surfaces of the dataset were then grouped in accordance with the clinical variables available in the
dataset transformed into categorical variables. An average elevation map was constructed for each
group of corneal surfaces of each clinical variable in their natural (non-normalized) state and in
their normalized state by averaging their modeling coefficients to get an average surface and by
representing this average surface in reference to the best-fit sphere in a topographic elevation map.
To validate the atlas thus constructed in both its natural and normalized modalities, ANOVA tests
were conducted for each clinical variable of the dataset to verify their statistical consistency with
the literature before verifying whether the corneal shape transformations displayed in the maps
were themselves visually consistent. This was the case. The possible uses of such an atlas are
discussed.
The third study (Chapter 4) is concerned with the use of a dataset of geometrically modeled
corneal surfaces in an ML task of clustering. The unsupervised classification of corneal surfaces is
recent in ophthalmology. Most of the few existing studies on corneal clustering resort to feature
extraction (as opposed to geometric modeling) to achieve the dimensionality reduction of the dataset. The goal is usually to automate the process of corneal diagnosis, for instance by
distinguishing irregular corneal surfaces (keratoconus, Fuch, etc.) from normal surfaces and, in
some cases, by classifying irregular surfaces into subtypes. Complementary to these corneal
clustering studies, the proposed study resorts mainly to geometric modeling to achieve
dimensionality reduction and focuses on normal adult corneas in an attempt to identify their natural
groupings, possibly in combination with feature extraction methods. Geometric modeling was
based on Zernike polynomials, known for their interpretative transparency and sufficiently accurate
for normal corneas. Different types of clustering methods were evaluated in pretests to identify the
most effective at producing neatly delimitated clusters that are clearly interpretable. Their
evaluation was based on clustering scores (to identify the best number of clusters), polar charts and
scatter plots (to visualize the modeling coefficients involved in each cluster), average elevation
maps and average profile cuts (to visualize the average corneal surface of each cluster), and
statistical cluster comparisons on different clinical parameters (to validate the findings in reference
to the clinical literature). K-means, applied to geometrically modeled surfaces without feature
extraction, produced the best clusters, both for natural and normalized surfaces. While the clusters
produced with natural corneal surfaces were based on the corneal curvature, those produced with
normalized surfaces were based on the corneal axis. In each case, the best number of clusters was
four. The importance of curvature and axis as grouping criteria in corneal data distribution is
discussed.
The fourth study presented in this thesis (Chapter 5) explores the ML paradigm to verify whether
accurate predictions of normal corneal shapes can be made from clinical data, and how. The
database of normal adult corneal surfaces was first preprocessed by geometric modeling to reduce
its dimensionality into short vectors of 12 to 20 Zernike coefficients, found to be in the range of
appropriate numbers to achieve optimal predictions. The nonlinear regression methods examined
from the scikit-learn library were gradient boosting, Gaussian process, kernel ridge, random forest,
k-nearest neighbors, bagging, and multilayer perceptron. The predictors were based on the clinical
variables available in the database, including geometric variables (best-fit sphere radius, white-towhite diameter, anterior chamber depth, corneal side), refraction variables (sphere, cylinder, axis)
and demographic variables (age, gender). Each possible combination of regression method, set of
clinical variables (used as predictors) and number of Zernike coefficients (used as targets) defined
a regression model in a prediction test. All the regression models were evaluated based on their
mean RMSE score (establishing the distance between the predicted corneal surfaces and the raw
topographic true surfaces). The best model identified was further qualitatively assessed based on
an atlas of predicted and true average elevation maps by which the predicted surfaces could be
visually compared to the true surfaces on each of the clinical variables used as predictors. It was
found that the best regression model was gradient boosting using all available clinical variables as
predictors and 16 Zernike coefficients as targets. The most explicative predictor was the best-fit
sphere radius, followed by the side and refractive variables. The average elevation maps of the true
anterior corneal surfaces and the predicted surfaces based on this model were remarkably similar
for each clinical variable.
|
390 |
Visual Analysis of High-Dimensional Point Clouds using Topological AbstractionOesterling, Patrick 17 May 2016 (has links) (PDF)
This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\'s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\'s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve.
|
Page generated in 0.0939 seconds