Spelling suggestions: "subject:"clustering"" "subject:"klustering""
31 
Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance / Unsupervided learning of massive data streams : application to Big Data in insuranceGhesmoune, Mohammed 25 November 2016 (has links)
Le travail de recherche exposé dans cette thèse concerne le développement d'approches à base de growing neural gas (GNG) pour le clustering de flux de données massives. Nous proposons trois extensions de l'approche GNG : séquentielle, distribuée et parallèle, et une méthode hiérarchique; ainsi qu'une nouvelle modélisation pour le passage à l'échelle en utilisant le paradigme MapReduce et l'application de ce modèle pour le clustering au fil de l'eau du jeu de données d'assurance. Nous avons d'abord proposé la méthode GStream. GStream, en tant que méthode "séquentielle" de clustering, permet de découvrir de manière incrémentale des clusters de formes arbitraires et en ne faisant qu'une seule passe sur les données. GStream utilise une fonction d'oubli an de réduire l'impact des anciennes données dont la pertinence diminue au fil du temps. Les liens entre les nœuds (clusters) sont également pondérés par une fonction exponentielle. Un réservoir de données est aussi utilisé an de maintenir, de façon temporaire, les observations très éloignées des prototypes courants. L'algorithme batchStream traite les données en microbatch (fenêtre de données) pour le clustering de flux. Nous avons défini une nouvelle fonction de coût qui tient compte des sous ensembles de données qui arrivent par paquets. La minimisation de la fonction de coût utilise l'algorithme des nuées dynamiques tout en introduisant une pondération qui permet une pénalisation des données anciennes. Une nouvelle modélisation utilisant le paradigme MapReduce est proposée. Cette modélisation a pour objectif de passer à l'échelle. Elle consiste à décomposer le problème de clustering de flux en fonctions élémentaires (Map et Reduce). Ainsi de traiter chaque sous ensemble de données pour produire soit les clusters intermédiaires ou finaux. Pour l'implémentation de la modélisation proposée, nous avons utilisé la plateforme Spark. Dans le cadre du projet Square Predict, nous avons validé l'algorithme batchStream sur les données d'assurance. Un modèle prédictif combinant le résultat du clustering avec les arbres de décision est aussi présenté. L'algorithme GHStream est notre troisième extension de GNG pour la visualisation et le clustering de flux de données massives. L'approche présentée a la particularité d'utiliser une structure hiérarchique et topologique, qui consiste en plusieurs arbres hiérarchiques représentant des clusters, pour les tâches de clustering et de visualisation. / The research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the GStream method. GStream, as a “sequential" clustering method, is a onepass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. GStream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a microbatch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each subdataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GHStream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks.

32 
Energyefficient routing protocols for heterogeneous wireless sensor networks with smart buildings evacuationAlAboody, Nadia Ali Qassim January 2017 (has links)
The number of devices connected to the Internet will increase exponentially by 2020, which is smoothly migrating the Internet from an Internet of people towards an Internet of Things (IoT). These devices can communicate with each other and exchange information forming a wide Wireless Sensor Network (WSN). WSNs are composed mainly of a large number of small devices that run on batteries, which makes the energy limited. Therefore, it is essential to use an energy efficient routing protocol for WSNs that are scalable and robust in terms of energy consumption and lifetime. Using routing protocols that are based on clustering can be used to solve energy problems. Clusterbased routing protocols provide an efficient approach to reduce the energy consumption of sensor nodes and maximize the network lifetime of WSNs. In this thesis, a single hop clusterbased network layer routing protocol, referred to as HRHP, is designed. It applies centralized and deterministic approaches for the selection of cluster heads, in relation to offer an improved network lifetime for largescaled and dense WSN deployments. The deterministic approach for selecting CHs is based on the positive selection mechanism in the human thymus cells (Tcells). HRHP was tested over six different scenarios with BS position outer the sensing area, it achieved a maximum average of 78% in terms of life time. To further reduce energy consumption in WSN, a multihop algorithm, referred to as MLHP, is proposed for prolonging the lifetime of WSN. In this algorithm, the sensing area is divided into three levels to reduce the communication cost by reducing the transmission distances for both intercluster and intracluster communication. MLHP was tested over fourteen cases with different heterogeneity factors and area sizes and achieved a maximum of 80% improvement in terms of life time. Finally, a realtime and autonomous emergency evacuation approach is proposed, referred to as ARTCWSN, which integrates cloud computing with WSN in order to improve evacuation accuracy and efficiency for smart buildings. The approach is designed to perform localized, autonomous navigation by calculating the best evacuation paths in a distributed manner using two types of sensor nodes (SNs), a sensing node and a decision node. ARTCWSN was tested in five scenarios with different hazard intensity, occupation ratio and exit availability over three different areas of evacuation and achieved an average of 98% survival ratio for different cases.

33 
Clustering for ClassificationEvans, Reuben James Emmanuel January 2007 (has links)
Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.

34 
Functional Analysis of Real World Truck Fuel Consumption DataVogetseder, Georg January 2008 (has links)
<p>This thesis covers the analysis of sparse and irregular fuel consumption data of long</p><p>distance haulage articulate trucks. It is shown that this kind of data is hard to analyse with multivariate as well as with functional methods. To be able to analyse the data, Principal Components Analysis through Conditional Expectation (PACE) is used, which enables the use of observations from many trucks to compensate for the sparsity of observations in order to get continuous results. The principal component scores generated by PACE, can then be used to get rough estimates of the trajectories for single trucks as well as to detect outliers. The data centric approach of PACE is very useful to enable functional analysis of sparse and irregular data. Functional analysis is desirable for this data to sidestep feature extraction and enabling a more natural view on the data.</p>

35 
Correlations in the Cosmic Farinfrared Background at 250, 350, and 500 μm Reveal Clustering of Starforming GalaxiesViero, Marco Paolo 23 February 2011 (has links)
We demonstrate the application of CMB techniques to measure the clustering of infrared emitting starforming galaxies. We detect correlations in the cosmic farinfrared background due to the clustering of starforming galaxies in observations made with the Balloonborne
Large Aperture Submillimeter Telescope, BLAST, at 250, 350, and 500μm. We perform
jackknife and other tests to confirm the reality of the signal. The measured correlations are well fit by a power law over scales of 5–25 arcminutes, with ∆I/I = 15.1 ± 1.7%. We adopt a specific model for submillimeter sources in which the contribution to clustering comes from sources in the redshift ranges 1.3≤z≤2.2, 1.5≤z≤2.7,and1.7≤z≤3.2,at 250, 350 and 500 μm, respectively. With these distributions, our measurement of the power spectrum, P(kθ), corresponds to linear bias parameters, b = 3.8±0.6,3.9±0.6 and 4.4±0.7,
respectively. We further interpret the results in terms of the halo model, and find that at the smaller scales, the simplest halo model fails to fit our results. One way to improve the fit is to increase the radius at which dark matter halos are artificially truncated in the model, which is equivalent to having some starforming galaxies at z ≥ 1 located in the outskirts of groups and clusters. In the context of this model we find a minimum halo mass required to host a
galaxy is log(Mmin/M⊙) = 11.5+0.4, and we derive effective biases beff = 2.2 ± 0.2, 2.4 ± 0.2, −0.1 and 2.6 ± 0.2, and effective masses log(Meff/M⊙) = 12.9 ± 0.3, 12.8 ± 0.2, and 12.7 ± 0.2 , at 250, 350 and 500 μm, corresponding to spatial correlation lengths of r0 = 4.9, 5.0, and 5.2 ±0.7 h−1 Mpc, respectively. Finally, we discuss implications for clustering measurement strategies with Herschel and Planck.

36 
Network Clustering in Vehicular Communication NetworksLi, Weiwei 25 August 2011 (has links)
This thesis proposes a clustering algorithm for vehicular communication networks. A novel clustering metric and an improved clustering framework are introduced. The novel clustering metric, network criticality, is a global metric on undirected graphs which quantifies the robustness of the graph against changes in environmental parameters, and pointtopoint network criticality is also defined to measure the resistance between different points of a graph. We localize the notion of network criticality for a node of a vehicular network which can potentially be promoted as the cluster header. We use the localized notion of node criticality in conjunction with a universal link metric, Link Expiration Time (LET), to derive a clustering algorithm for the vehicular network. We employ a distributed multihop clustering algorithm based on the notion of network criticality. Simulation results show that the proposed clustering algorithm forms a more robust cluster structure.

37 
Network Clustering in Vehicular Communication NetworksLi, Weiwei 25 August 2011 (has links)
This thesis proposes a clustering algorithm for vehicular communication networks. A novel clustering metric and an improved clustering framework are introduced. The novel clustering metric, network criticality, is a global metric on undirected graphs which quantifies the robustness of the graph against changes in environmental parameters, and pointtopoint network criticality is also defined to measure the resistance between different points of a graph. We localize the notion of network criticality for a node of a vehicular network which can potentially be promoted as the cluster header. We use the localized notion of node criticality in conjunction with a universal link metric, Link Expiration Time (LET), to derive a clustering algorithm for the vehicular network. We employ a distributed multihop clustering algorithm based on the notion of network criticality. Simulation results show that the proposed clustering algorithm forms a more robust cluster structure.

38 
Correlations in the Cosmic Farinfrared Background at 250, 350, and 500 μm Reveal Clustering of Starforming GalaxiesViero, Marco Paolo 23 February 2011 (has links)
We demonstrate the application of CMB techniques to measure the clustering of infrared emitting starforming galaxies. We detect correlations in the cosmic farinfrared background due to the clustering of starforming galaxies in observations made with the Balloonborne
Large Aperture Submillimeter Telescope, BLAST, at 250, 350, and 500μm. We perform
jackknife and other tests to confirm the reality of the signal. The measured correlations are well fit by a power law over scales of 5–25 arcminutes, with ∆I/I = 15.1 ± 1.7%. We adopt a specific model for submillimeter sources in which the contribution to clustering comes from sources in the redshift ranges 1.3≤z≤2.2, 1.5≤z≤2.7,and1.7≤z≤3.2,at 250, 350 and 500 μm, respectively. With these distributions, our measurement of the power spectrum, P(kθ), corresponds to linear bias parameters, b = 3.8±0.6,3.9±0.6 and 4.4±0.7,
respectively. We further interpret the results in terms of the halo model, and find that at the smaller scales, the simplest halo model fails to fit our results. One way to improve the fit is to increase the radius at which dark matter halos are artificially truncated in the model, which is equivalent to having some starforming galaxies at z ≥ 1 located in the outskirts of groups and clusters. In the context of this model we find a minimum halo mass required to host a
galaxy is log(Mmin/M⊙) = 11.5+0.4, and we derive effective biases beff = 2.2 ± 0.2, 2.4 ± 0.2, −0.1 and 2.6 ± 0.2, and effective masses log(Meff/M⊙) = 12.9 ± 0.3, 12.8 ± 0.2, and 12.7 ± 0.2 , at 250, 350 and 500 μm, corresponding to spatial correlation lengths of r0 = 4.9, 5.0, and 5.2 ±0.7 h−1 Mpc, respectively. Finally, we discuss implications for clustering measurement strategies with Herschel and Planck.

39 
Clustering in the Presence of NoiseHaghtalab, Nika 08 August 2013 (has links)
Clustering, which is partitioning data into groups of similar objects, has a wide range of applications. In many cases unstructured data makes up a significant part of the input. Attempting to cluster such part of the data, which can be referred to as noise, can disturb the clustering on the remaining domain points. Despite the practical need for a framework of clustering that allows a portion of the data to remain unclustered, little research has been done so far in that direction. In this thesis, we take a step towards addressing the issue of clustering in the presence of noise in two parts. First, we develop a platform for clustering that has a cluster devoted to the "noise" points. Second, we examine the problem of "robustness" of clustering algorithms to the addition of noise.
In the first part, we develop a formal framework for clustering that has a designated noise cluster. We formalize intuitively desirable inputoutput properties of clustering algorithms that have a noise cluster. We review some previously known algorithms, introduce new algorithms for this setting, and examine them with respect to the introduced properties.
In the second part, we address the problem of robustness of clustering algorithms to the addition of unstructured data. We propose a simple and efficient method to turn any centroidbased clustering algorithm into a noise robust one that has a noise cluster. We discuss several rigorous measures of robustness and prove performance guarantees for our method with respect to these measures under the assumption that the noisefree data satisfies some niceness properties and the noise satisfies some mildness properties. We also prove that more straightforward ways of adding robustness to clustering algorithms fail to achieve the above mentioned guarantees.

40 
An OntologyBased Personalized Document Clustering ApproachHuang, Tsehsiu 05 August 2004 (has links)
With the proliferation of electronic commerce and knowledge economy environments, both persons and organizations increasingly have generated and consumed large amounts of online information, typically available as textual documents. To manage this rapid growth of the number of textual documents, people often use categories or folders to organize their documents. These document grouping behaviors are intentional acts that reflect the persons¡¦ (or organizations¡¦) preferences with regard to semantic coherency, or relevant groupings between subjects. For this thesis, we design and implement an ontologybased personalized document clustering (OnPEC) technique by incorporating both an individual user¡¦s partial clustering and an ontology into the document clustering process. Our use of a target user¡¦s partial clustering supports the personalization of document categorization, whereas our use of the ontology turns document clustering from a featurebased to a conceptbased approach. In addition, we combine two hierarchical agglomerative clustering (HAC) approaches (i.e., preclusterbased and atomicbased) in our proposed OnPEC technique. Using the clustering effectiveness achieved by a traditional contentbased document clustering technique and previously proposed featurebased document clustering (PEC) techniques as performance benchmarks, we find that use of partial clusters improves document clustering effectiveness, as measured by cluster precision and cluster recall. Moreover, for both OnPEC and PEC techniques, the clustering effectiveness of preclusterbased HAC methods greatly outperforms that of atomicbased HAC methods.

Page generated in 0.099 seconds