31 |
Parallelisation of EST clusteringRanchod, Pravesh 23 March 2006 (has links)
Master of Science - Science / The field of bioinformatics has been developing steadily, with computational problems related
to biology taking on an increased importance as further advances are sought. The large data sets involved in problems within computational biology have dictated a search for good, fast approximations to computationally complex problems. This research aims to improve a method used to discover and understand genes, which are small subsequences of DNA. A difficulty arises because genes contain parts we know to be functional and other parts we assume are non-functional as there functions have not been
determined. Isolating the functional parts requires the use of natural biological processes
which perform this separation. However, these processes cannot read long sequences, forcing
biologists to break a long sequence into a large number of small sequences, then reading these. This creates the computational difficulty of categorizing the short fragments according to gene membership.
Expressed Sequence Tag Clustering is a technique used to facilitate the identification of expressed genes by grouping together similar fragments with the assumption that they belong to the same gene.
The aim of this research was to investigate the usefulness of distributed memory parallelisation
for the Expressed Sequence Tag Clustering problem. This was investigated empirically,
with a distributed system tested for speed against a sequential one. It was found that distributed memory parallelisation can be very effective in this domain.
The results showed a super-linear speedup for up to 100 processors, with higher numbers not tested, and likely to produce further speedups. The system was able to cluster 500000 ESTs in 641 minutes using 101 processors.
|
32 |
Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance / Unsupervided learning of massive data streams : application to Big Data in insuranceGhesmoune, Mohammed 25 November 2016 (has links)
Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode G-Stream. G-Stream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. G-Stream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en micro-batch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GH-Stream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation. / The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks.
|
33 |
Energy-efficient routing protocols for heterogeneous wireless sensor networks with smart buildings evacuationAl-Aboody, Nadia Ali Qassim January 2017 (has links)
The number of devices connected to the Internet will increase exponentially by 2020, which is smoothly migrating the Internet from an Internet of people towards an Internet of Things (IoT). These devices can communicate with each other and exchange information forming a wide Wireless Sensor Network (WSN). WSNs are composed mainly of a large number of small devices that run on batteries, which makes the energy limited. Therefore, it is essential to use an energy efficient routing protocol for WSNs that are scalable and robust in terms of energy consumption and lifetime. Using routing protocols that are based on clustering can be used to solve energy problems. Cluster-based routing protocols provide an efficient approach to reduce the energy consumption of sensor nodes and maximize the network lifetime of WSNs. In this thesis, a single hop cluster-based network layer routing protocol, referred to as HRHP, is designed. It applies centralized and deterministic approaches for the selection of cluster heads, in relation to offer an improved network lifetime for large-scaled and dense WSN deployments. The deterministic approach for selecting CHs is based on the positive selection mechanism in the human thymus cells (T-cells). HRHP was tested over six different scenarios with BS position outer the sensing area, it achieved a maximum average of 78% in terms of life time. To further reduce energy consumption in WSN, a multi-hop algorithm, referred to as MLHP, is proposed for prolonging the lifetime of WSN. In this algorithm, the sensing area is divided into three levels to reduce the communication cost by reducing the transmission distances for both inter-cluster and intra-cluster communication. MLHP was tested over fourteen cases with different heterogeneity factors and area sizes and achieved a maximum of 80% improvement in terms of life time. Finally, a real-time and autonomous emergency evacuation approach is proposed, referred to as ARTC-WSN, which integrates cloud computing with WSN in order to improve evacuation accuracy and efficiency for smart buildings. The approach is designed to perform localized, autonomous navigation by calculating the best evacuation paths in a distributed manner using two types of sensor nodes (SNs), a sensing node and a decision node. ARTC-WSN was tested in five scenarios with different hazard intensity, occupation ratio and exit availability over three different areas of evacuation and achieved an average of 98% survival ratio for different cases.
|
34 |
Clustering for ClassificationEvans, Reuben James Emmanuel January 2007 (has links)
Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.
|
35 |
Functional Analysis of Real World Truck Fuel Consumption DataVogetseder, Georg January 2008 (has links)
<p>This thesis covers the analysis of sparse and irregular fuel consumption data of long</p><p>distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results. The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers. The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data.</p>
|
36 |
Correlations in the Cosmic Far-infrared Background at 250, 350, and 500 μm Reveal Clustering of Star-forming GalaxiesViero, Marco Paolo 23 February 2011 (has links)
We demonstrate the application of CMB techniques to measure the clustering of infrared emitting star-forming galaxies. We detect correlations in the cosmic far-infrared background due to the clustering of star-forming galaxies in observations made with the Balloon-borne
Large Aperture Submillimeter Telescope, BLAST, at 250, 350, and 500μm. We perform
jackknife and other tests to confirm the reality of the signal. The measured correlations are well fit by a power law over scales of 5–25 arcminutes, with ∆I/I = 15.1 ± 1.7%. We adopt a specific model for submillimeter sources in which the contribution to clustering comes from sources in the redshift ranges 1.3≤z≤2.2, 1.5≤z≤2.7,and1.7≤z≤3.2,at 250, 350 and 500 μm, respectively. With these distributions, our measurement of the power spectrum, P(kθ), corresponds to linear bias parameters, b = 3.8±0.6,3.9±0.6 and 4.4±0.7,
respectively. We further interpret the results in terms of the halo model, and find that at the smaller scales, the simplest halo model fails to fit our results. One way to improve the fit is to increase the radius at which dark matter halos are artificially truncated in the model, which is equivalent to having some star-forming galaxies at z ≥ 1 located in the outskirts of groups and clusters. In the context of this model we find a minimum halo mass required to host a
galaxy is log(Mmin/M⊙) = 11.5+0.4, and we derive effective biases beff = 2.2 ± 0.2, 2.4 ± 0.2, −0.1 and 2.6 ± 0.2, and effective masses log(Meff/M⊙) = 12.9 ± 0.3, 12.8 ± 0.2, and 12.7 ± 0.2 , at 250, 350 and 500 μm, corresponding to spatial correlation lengths of r0 = 4.9, 5.0, and 5.2 ±0.7 h−1 Mpc, respectively. Finally, we discuss implications for clustering measurement strategies with Herschel and Planck.
|
37 |
Network Clustering in Vehicular Communication NetworksLi, Weiwei 25 August 2011 (has links)
This thesis proposes a clustering algorithm for vehicular communication networks. A novel clustering metric and an improved clustering framework are introduced. The novel clustering metric, network criticality, is a global metric on undirected graphs which quantifies the robustness of the graph against changes in environmental parameters, and point-to-point network criticality is also defined to measure the resistance between different points of a graph. We localize the notion of network criticality for a node of a vehicular network which can potentially be promoted as the cluster header. We use the localized notion of node criticality in conjunction with a universal link metric, Link Expiration Time (LET), to derive a clustering algorithm for the vehicular network. We employ a distributed multi-hop clustering algorithm based on the notion of network criticality. Simulation results show that the proposed clustering algorithm forms a more robust cluster structure.
|
38 |
Network Clustering in Vehicular Communication NetworksLi, Weiwei 25 August 2011 (has links)
This thesis proposes a clustering algorithm for vehicular communication networks. A novel clustering metric and an improved clustering framework are introduced. The novel clustering metric, network criticality, is a global metric on undirected graphs which quantifies the robustness of the graph against changes in environmental parameters, and point-to-point network criticality is also defined to measure the resistance between different points of a graph. We localize the notion of network criticality for a node of a vehicular network which can potentially be promoted as the cluster header. We use the localized notion of node criticality in conjunction with a universal link metric, Link Expiration Time (LET), to derive a clustering algorithm for the vehicular network. We employ a distributed multi-hop clustering algorithm based on the notion of network criticality. Simulation results show that the proposed clustering algorithm forms a more robust cluster structure.
|
39 |
Correlations in the Cosmic Far-infrared Background at 250, 350, and 500 μm Reveal Clustering of Star-forming GalaxiesViero, Marco Paolo 23 February 2011 (has links)
We demonstrate the application of CMB techniques to measure the clustering of infrared emitting star-forming galaxies. We detect correlations in the cosmic far-infrared background due to the clustering of star-forming galaxies in observations made with the Balloon-borne
Large Aperture Submillimeter Telescope, BLAST, at 250, 350, and 500μm. We perform
jackknife and other tests to confirm the reality of the signal. The measured correlations are well fit by a power law over scales of 5–25 arcminutes, with ∆I/I = 15.1 ± 1.7%. We adopt a specific model for submillimeter sources in which the contribution to clustering comes from sources in the redshift ranges 1.3≤z≤2.2, 1.5≤z≤2.7,and1.7≤z≤3.2,at 250, 350 and 500 μm, respectively. With these distributions, our measurement of the power spectrum, P(kθ), corresponds to linear bias parameters, b = 3.8±0.6,3.9±0.6 and 4.4±0.7,
respectively. We further interpret the results in terms of the halo model, and find that at the smaller scales, the simplest halo model fails to fit our results. One way to improve the fit is to increase the radius at which dark matter halos are artificially truncated in the model, which is equivalent to having some star-forming galaxies at z ≥ 1 located in the outskirts of groups and clusters. In the context of this model we find a minimum halo mass required to host a
galaxy is log(Mmin/M⊙) = 11.5+0.4, and we derive effective biases beff = 2.2 ± 0.2, 2.4 ± 0.2, −0.1 and 2.6 ± 0.2, and effective masses log(Meff/M⊙) = 12.9 ± 0.3, 12.8 ± 0.2, and 12.7 ± 0.2 , at 250, 350 and 500 μm, corresponding to spatial correlation lengths of r0 = 4.9, 5.0, and 5.2 ±0.7 h−1 Mpc, respectively. Finally, we discuss implications for clustering measurement strategies with Herschel and Planck.
|
40 |
Clustering in the Presence of NoiseHaghtalab, Nika 08 August 2013 (has links)
Clustering, which is partitioning data into groups of similar objects, has a wide range of applications. In many cases unstructured data makes up a significant part of the input. Attempting to cluster such part of the data, which can be referred to as noise, can disturb the clustering on the remaining domain points. Despite the practical need for a framework of clustering that allows a portion of the data to remain unclustered, little research has been done so far in that direction. In this thesis, we take a step towards addressing the issue of clustering in the presence of noise in two parts. First, we develop a platform for clustering that has a cluster devoted to the "noise" points. Second, we examine the problem of "robustness" of clustering algorithms to the addition of noise.
In the first part, we develop a formal framework for clustering that has a designated noise cluster. We formalize intuitively desirable input-output properties of clustering algorithms that have a noise cluster. We review some previously known algorithms, introduce new algorithms for this setting, and examine them with respect to the introduced properties.
In the second part, we address the problem of robustness of clustering algorithms to the addition of unstructured data. We propose a simple and efficient method to turn any centroid-based clustering algorithm into a noise robust one that has a noise cluster. We discuss several rigorous measures of robustness and prove performance guarantees for our method with respect to these measures under the assumption that the noise-free data satisfies some niceness properties and the noise satisfies some mildness properties. We also prove that more straightforward ways of adding robustness to clustering algorithms fail to achieve the above mentioned guarantees.
|
Page generated in 0.0986 seconds