251 |
Nuevo Método de Clustering Basado en Programación Genética y Teoría de la InformaciónBoric Bargetto, Neven Tomislav January 2009 (has links)
No description available.
|
252 |
Machine learning for text categorization: Experiments using clustering and classificationBikki, Poojitha January 1900 (has links)
Master of Science / Department of Computer Science / William H. Hsu / This work describes a comparative study of empirical methods for categorization of new articles within text corpora: unsupervised learning for an unlabeled corpus of text documents and supervised learning for hand-labeled corpus. The goal of text categorization is to organize natural language (i.e. human language) documents into categories that are either predefined or that are inherently grouped by similar meaning. The first approach, automatic classification of texts, can be handy when handling massive amounts of data and has many applications such as automated indexing of scientific articles, spam filtering, classification of news articles etc. Classification using supervised or semi-supervised inductive learning involves labeled data, which can be expensive to acquire and may require semantically deep understanding of the meaning of texts. The second approach falls under the general rubric of document clustering, based on the statistical distribution and co-occurrence of words in a full-text document. Developing a full pipeline for document categorization draws on methods from information retrieval (IR), natural language processing (NLP), and machine learning (ML).
In this project, experiments are conducted on two text corpora: news aggregator data, which contains news headlines collected from a web aggregator and a news data set consisting of original news articles from the British Broadcasting Corporation (BBC). First, the training data is developed from these corpora. Next, common types of supervised classifiers, such as linear, Bayesian, ensemble models and support vector machines (SVM) are trained, on the labelled data and the trained classification models are used to predict the category of an article, given the related text. The results obtained are analyzed and compared to determine the best performing model. Then, two unsupervised learning techniques – k-means and Latent Dirichlet Allocation (LDA) are applied to obtain clusters of data points. k-means separates the documents into disjoint clusters of similar news. Additionally, LDA was used, which treats documents as a mixture of topics, to find latent topics in text. Finally, visualizations of the results are produced for evaluation: to allow qualitative assessment of cluster separation in the case of unsupervised learning, or to understand the confusion matrix for the supervised classification task by heat map visualization as well as precision, recall, and other holistic metrics. From an application standpoint, the unsupervised techniques applied can be used to find news that are similar in content and can be categorized under a specific topic.
|
253 |
Modelo de representación de la demanda por bloques para la planificación de a transmisiónMuñoz Tapia, Juan Eduardo January 2007 (has links)
No description available.
|
254 |
Galaxy Cluster Detection using Nonparametric Maximum Likelihood Estimation of Features in Voronoi TessellationsPizarro Pizarro, Daniel Iván January 2007 (has links)
No description available.
|
255 |
La visualisation d’information à l’ère du Big Data : résoudre les problèmes de scalabilité par l’abstraction multi-échelle / Information Visualization in the Big Data era : tackling scalability issues using multiscale abstractionsPerrot, Alexandre 27 November 2017 (has links)
L’augmentation de la quantité de données à visualiser due au phénomène du Big Data entraîne de nouveaux défis pour le domaine de la visualisation d’information. D’une part, la quantité d’information à représenter dépasse l’espace disponible à l’écran, entraînant de l’occlusion. D’autre part, ces données ne peuvent pas être stockées et traitées sur une machine conventionnelle. Un système de visualisation de données massives doit permettre la scalabilité de perception et de performances. Dans cette thèse, nous proposons une solution à ces deux problèmes au travers de l’abstraction multi-échelle des données. Plusieurs niveaux de détail sont précalculés sur une infrastructure Big Data pour permettre de visualiser de grands jeux de données jusqu’à plusieurs milliards de points. Pour cela, nous proposons deux approches pour implémenter l’algorithme de canopy clustering sur une plateforme de calcul distribué. Nous présentons une application de notre méthode à des données géolocalisées représentées sous forme de carte de chaleur, ainsi qu’à des grands graphes. Ces deux applications sont réalisées à l’aide de la bibliothèque de visualisation dynamique Fatum, également présentée dans cette thèse. / With the advent of the Big Data era come new challenges for Information Visualization. First, the amount of data to be visualized exceeds the available screen space. Second, the data cannot be stored and processed on a conventional computer. To alleviate both of these problems, a Big Data visualization system must provide perceptual and performance scalability. In this thesis, we propose to use multi-scale abstractions as a solution to both of these issues. Several levels of detail can be precomputed using a Big Data Infrastructure in order to visualize big datasets up to several billion points. For that, we propose two approaches to implementing the canopy clustering algorithm for a distributed computation cluster. We present applications of our method to geolocalized data visualized through a heatmap, and big graphs. Both of these applications use the dynamic visualization library, which is also presented in this thesis
|
256 |
Facing the real challenges in wireless sensor network-based applications : an adaptative cross-layer self-organization WSN protocol / Se confronter aux exigences des applications à base de réseaux de capteurs en environnement réel : une approche cross-layer adaptative et auto-configuranteGuzzo, Natale 15 December 2015 (has links)
Le réseau de capteurs sans fil (WSN) est un des protagonistes contribuant à l’évolution et au développement de l’Internet des objets (IoT). Plusieurs cas d’usage peuvent être trouvés dans les différents domaines comme l’industrie du transport maritime où le fret conteneurisé compte environ pour 60% du commerce mondial. Dans ce contexte, la société TRAXENS a développé un dispositif radio alimenté par batterie appelé TRAX-BOX et conçu pour être fixé aux containeurs dans l’objectif de les traquer et les surveiller tout au long de la chaine logistique. Dans cette thèse, je vais présenter une nouvelle pile protocolaire WSN appelée TRAX-NET et conçue pour permettre aux TRAX-BOX de s’auto-organiser dans un réseau sans fil et coopérer pour délivrer les données acquises au serveur TRAXENS d’une façon énergiquement efficiente. Les résultats des simulations et des tests sur le terrain montrent que TRAX-NET est bien optimisé pour les différents scenarios pour lesquels il a été développé et satisfait les exigences de l’application concernée mieux que les autres solutions étudiées dans la littérature. TRAX-NET est une solution complète et adaptée au suivi des conteneurs de fret de par le monde. / Wireless Sensor Networks (WSN) is one of the protagonists contributing to the evolution and the development of the Internet of Things (IoT). Several use cases can be found today in the different fields of the modern technology including the container shipping industry where containerized cargo accounts for about 60 percent of all world seaborne trade. In this context, TRAXENS developed a battery-powered device named TRAX-BOX designed to be attached to the freight containers in order to track and monitor the shipping goods along the whole supply chain. In this thesis, we present a new energy-efficient self-organizing WSN protocol stack named TRAX-NET designed to allow the TRAX-BOX devices to cooperate to deliver the sensed data to the TRAXENS platform.The results of simulations and field tests show that TRAX-NET well perform in the different scenarios in which it is supposed to operate and better fulfil the requirements of the assumed application in comparison with the existing schemes.
|
257 |
Learning and identification of fuzzy systemsLee, Shin-Jye January 2011 (has links)
This thesis concentrates on learning and identification of fuzzy systems, and this thesis is composed about learning fuzzy systems from data for regression and function approximation by constructing complete, compact, and consistent fuzzy systems. Fuzzy systems are prevalent to solve pattern recognition problems and function approximation problems as a result of the good knowledge representation. With the development of fuzzy systems, a lot of sophisticated methods based on them try to completely solve pattern recognition problems and function approximation problems by constructing a great diversity of mathematical models. However, there exists a conflict between the degree of the interpretability and the accuracy of the approximation in general fuzzy systems. Thus, how to properly make the best compromise between the accuracy of the approximation and the degree of the interpretability in the entire system is a significant study of the subject.The first work of this research is concerned with the clustering technique on constructing fuzzy models in fuzzy system identification, and this method is a part of clustering based learning of fuzzy systems. As the determination of the proper number of clusters and the appropriate location of clusters is one of primary considerations on constructing an effectively fuzzy model, the task of the clustering technique aims at recognizing the proper number of clusters and the appropriate location as far as possible, which gives a good preparation for the construction of fuzzy models. In order to acquire the mutually exclusive performance by constructing effectively fuzzy models, a modular method to fuzzy system identification based on a hybrid clustering-based technique has been considered. Due to the above reasons, a hybrid clustering algorithm concerning input, output, generalization and specialization has hence been introduced in this work. Thus, the primary advantage of this work is the proposed clustering technique integrates a variety of clustering properties to positively identify the proper number of clusters and the appropriate location of clusters by carrying out a good performance of recognizing the precise position of each dataset, and this advantage brings fuzzy systems more complete.The second work of this research is an extended work of the first work, and two ways to improve the original work have been considered in the extended work, including the pruning strategy for simplifying the structure of fuzzy systems and the optimization scheme for parameters optimization. So far as the pruning strategy is concerned, the purpose of which aims at refining rule base by the similarity analysis of fuzzy sets, fuzzy numbers, fuzzy membership functions or fuzzy rules. By other means, through the similarity analysis of which, the complete rules can be kept and the redundant rules can be reduced probably in the rule base of fuzzy systems. Also, the optimization scheme can be regarded as a two-layer parameters optimization in the extended work, because the parameters of the initial fuzzy model have been fine tuning by two phases gradation on layer. Hence, the extended work primarily puts focus on enhancing the performance of the initial fuzzy models toward the positive reliability of the final fuzzy models. Thus, the primary advantage of this work consists of the simplification of fuzzy rule base by the similarity-based pruning strategy, as well as more accuracy of the optimization by the two-layer optimization scheme, and these advantages bring fuzzy systems more compact and precise.So far as a perfect modular method for fuzzy system identification is concerned, in addition to positively solve pattern recognition problems and function approximation problems, it should primarily comprise the following features, including the well-understanding interpretability, low-degree dimensionality, highly reliability, stable robustness, highly accuracy of the approximation, less computational cost, and maximum performance. However, it is extremely difficult to meet all of these conditions above. Inasmuch as attaining the highly achievement from the features above as far as possible, the research works of this thesis try to present a modular method concerning a variety of requirements to fuzzy systems identification.
|
258 |
Optimization Frameworks for Graph ClusteringLuke N Veldt (6636218) 15 May 2019 (has links)
<div>In graph theory and network analysis, communities or clusters are sets of nodes in a graph that share many internal connections with each other, but are only sparsely connected to nodes outside the set. Graph clustering, the computational task of detecting these communities, has been studied extensively due to its widespread applications and its theoretical richness as a mathematical problem. This thesis presents novel optimization tools for addressing two major challenges associated with graph clustering.</div><div></div><div>The first major challenge is that there already exists a plethora of algorithms and objective functions for graph clustering. The relationship between different methods is often unclear, and it can be very difficult to determine in practice which approach is the best to use for a specific application. To address this challenge, we introduce a generalized discrete optimization framework for graph clustering called LambdaCC, which relies on a single tunable parameter. The value of this parameter controls the balance between the internal density and external sparsity of clusters that are formed by optimizing an underlying objective function. LambdaCC unifies the landscape of graph clustering techniques, as a large number of previously developed approaches can be recovered as special cases for a fixed value of the LambdaCC input parameter. </div><div> </div><div>The second major challenge of graph clustering is the computational intractability of detecting the best way to cluster a graph with respect to a given NP-hard objective function. To address this intractability, we present new optimization tools and results which apply to LambdaCC as well as a broader class of graph clustering problems. In particular, we develop polynomial time approximation algorithms for LambdaCC and other more generalized clustering objectives. In particular, we show how to obtain a polynomial-time 2-approximation for cluster deletion, which improves upon the previous best approximation factor of 3. We also present a new optimization framework for solving convex relaxations of NP-hard graph clustering problems, which are frequently used in the design of approximation algorithms. Finally, we develop a new framework for efficiently setting tunable parameters for graph clustering objective functions, so that practitioners can work with graph clustering techniques that are especially well suited to their application. </div>
|
259 |
Enhancing preprocessing and clustering of single-cell RNA sequencing dataWang, Zhe 04 October 2021 (has links)
Single-cell RNA sequencing (scRNA-seq) is the leading technique for characterizing cellular heterogeneity in biological samples. Various scRNA-seq protocols have been developed that can measure the transcriptome from thousands of cells in a single experiment. With these methods readily available, the ability to transform raw data into biological understanding of complex systems is now a rate-limiting step. In this dissertation, I introduce novel computational software and tools which enhance preprocessing and clustering of scRNA-seq data and evaluate their performance compared to existing methods.
First, I present scruff, an R/Bioconductor package that preprocesses data generated from scRNA-seq protocols including CEL-Seq or CEL-Seq2 and reports comprehensive data quality metrics and visualizations. scruff rapidly demultiplexes, aligns, and counts the reads mapped to genomic features with deduplication of unique molecular identifier (UMI) tags and provides novel and extensive functions to visualize both pre- and post-alignment data quality metrics for cells from multiple experiments.
Second, I present Celda, a novel Bayesian hierarchical model that can perform simultaneous co-clustering of genes into transcriptional modules and cells into subpopulations for scRNA-seq data. Celda identified novel cell subpopulations in a publicly available peripheral blood mononuclear cell (PBMC) dataset and outperformed a PCA-based approach for gene clustering on simulated data.
Third, I extend the application of Celda by developing a multimodal clustering method that utilizes both mRNA and protein expression information generated from single-cell sequencing datasets with multiple modalities, and demonstrate that Celda multimodal clustering captured meaningful biological patterns which are missed by transcriptome- or protein-only clustering methods.
Collectively, this work addresses limitations present in the computational analyses of scRNA-seq data by providing novel methods and solutions that enhance scRNA-seq data preprocessing and clustering.
|
260 |
Graph clustering as a method to investigate riboswitch variation:Crum, Matthew January 2021 (has links)
Thesis advisor: Michelle M. Meyer / Non-coding RNA (ncRNA) perform vital functions in cells, but the impact of diversity across structure and function of homologous motifs has yet to be fully investigated. One reason for this is that the standard phylogenetic analysis used to address these questions in proteins cannot easily be applied to ncRNA due to their inherent characteristics. Compared to proteins, ncRNA have shorter sequence lengths, lower sequence conservation, and secondary structures that need to be incorporated into the analysis. This has necessitated an effort to develop methodology for investigating the evolutionary and functional relationship between sets of ncRNA. In this pursuit, I studied closely related riboswitches. Riboswitches are structured ncRNA found in bacterial mRNA that regulate gene expressions using their two major components: the aptamer and the expression platform. The aptamer of a riboswitch is able to bind a specific small molecule (ligand), and the bound/unbound state of the aptamer influences conformational changes in the expressions platform that can lead to increased or decreased downstream gene expression. Utilizing sequence and structural similarity metrics combined with graph clustering and de novo community detection algorithms I have determined a methodology for investigating the functional and evolutionary relationship between closely related riboswitches, and other ncRNA by extension, that are found across a range of diverse phyla. / Thesis (PhD) — Boston College, 2021. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Biology.
|
Page generated in 0.0834 seconds