Global ETD Search

1	Algorithmes de classification répartis sur le cloud / Distributed clustering algorithms over a cloud computing platform Durut, Matthieu 28 September 2012 (has links) Les thèmes de recherche abordés dans ce manuscrit ont trait à la parallélisation d’algorithmes de classiﬁcation non-supervisée (clustering) sur des plateformes de Cloud Computing. Le chapitre 2 propose un tour d’horizon de ces technologies. Nous y présentons d’une manière générale le Cloud Computing comme plateforme de calcul. Le chapitre 3 présente l’offre cloud de Microsoft : Windows Azure. Le chapitre suivant analyse certains enjeux techniques de la conception d’applications cloud et propose certains éléments d’architecture logicielle pour de telles applications. Le chapitre 5 propose une analyse du premier algorithme de classiﬁcation étudié : le Batch K-Means. En particulier, nous approfondissons comment les versions réparties de cet algorithme doivent être adaptées à une architecture cloud. Nous y montrons l’impact des coûts de communication sur l’efﬁcacité de cet algorithme lorsque celui-ci est implémenté sur une plateforme cloud. Les chapitres 6 et 7 présentent un travail de parallélisation d’un autre algorithme de classiﬁcation : l’algorithme de Vector Quantization (VQ). Dans le chapitre 6 nous explorons quels schémas de parallélisation sont susceptibles de fournir des résultats satisfaisants en terme d’accélération de la convergence. Le chapitre 7 présente une implémentation de ces schémas de parallélisation. Les détails pratiques de l’implémentation soulignent un résultat de première importance : c’est le caractère en ligne du VQ qui permet de proposer une implémentation asynchrone de l’algorithme réparti, supprimant ainsi une partie des problèmes de communication rencontrés lors de la parallélisation du Batch K-Means. / He subjects addressed in this thesis are inspired from research problems faced by the Lokad company. These problems are related to the challenge of designing efﬁcient parallelization techniques of clustering algorithms on a Cloud Computing platform. Chapter 2 provides an introduction to the Cloud Computing technologies, especially the ones devoted to intensivecomputations. Chapter 3 details more speciﬁcally Microsoft Cloud Computing offer : Windows Azure. The following chapter details technical aspects of cloud application development and provides some cloud design patterns. Chapter 5 is dedicated to the parallelization of a well-known clustering algorithm: the Batch K-Means. It provides insights on the challenges of a cloud implementation of distributed Batch K-Means, especially the impact of communication costs on the implementation efﬁciency. Chapters 6 and 7 are devoted to the parallelization of another clustering algorithm, the Vector Quantization (VQ). Chapter 6 provides an analysis of different parallelization schemes of VQ and presents the various speedups to convergence provided by them. Chapter 7 provides a cloud implementation of these schemes. It highlights that it is the online nature of the VQ technique that enables an asynchronous cloud implementation, which drastically reducesthe communication costs introduced in Chapter 5. Algorithme des k-moyennes Cloud computing K-means clustering Cloud computing
2	A comparison of driving characteristics and environmental characteristics using factor analysis and k-means clustering algorithm Jung, Heejin 19 September 2012 (has links) The dissertation aims to classify drivers based on driving and environmental behaviors. The research determined significant factors using factor analysis, identified different driver types using k-means clustering, and studied how the same drivers map in each classification domain. The research consists of two study cases. In the first study case, a new variable is proposed and then is used for classification. The drivers were divided into three groups. Two alternatives were designed to evaluate the environmental impact of driving behavior changes. In the second study case, two types of data sets were constructed: driving data and environmental data. The driving data represents driving behavior of individual drivers. The environmental data represents emissions and fuel consumption estimated by microscopic energy and emissions models. Significant factors were explored in each data set using factor analysis. A pair of factors was defined for each data set. Each pair of factors was used for each k-means clustering: driving clustering and environmental clustering. Then the factors were used to identify groups of drivers in each clustering domain. In the driving clustering, drivers were grouped into three clusters. In the environmental clustering, drivers were clustered into two groups. The groups from the driving clustering were compared to the groups from the environmental clustering in terms of emissions and fuel consumption. The three groups of drivers from the driving clustering were also mapped in the environmental domain. The results indicate that the differences in driving patterns among the three driver groups significantly influenced the emissions of HC, CO, and NOx. As a result, it was determined that the average target operating acceleration and braking did essentially influence the amount of emissions in terms of HC, CO, and NOx. Therefore, if drivers were to change their driving behavior to be more defensive, it is expected that emissions of HC, CO, and NOx would decrease. It was also found that spacing-based driving tended to produce less emissions but consumed more fuel than other groups, while speed-based driving produced relatively more emissions. On the other hand, the defensively moderate drivers consumed less fuel and produced fewer emissions. / Ph. D. NGSIM driving characteristics factor analysis k-means clustering CMEM
3	Clusters (k) Identification without Triangle Inequality : A newly modelled theory / Clustering(k) without Triangle Inequality : A newly modelled theory Narreddy, Naga Sambu Reddy, Durgun, Tuğrul January 2012 (has links) Cluster analysis characterizes data that are similar enough and useful into meaningful groups (clusters).For example, cluster analysis can be applicable to find group of genes and proteins that are similar, to retrieve information from World Wide Web, and to identify locations that are prone to earthquakes. So the study of clustering has become very important in several fields, which includes psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning and data mining [1] [2]. Cluster analysis is the one of the widely used technique in the area of data mining. According to complexity and amount of data in a system, we can use variety of cluster analysis algorithms. K-means clustering is one of the most popular and widely used among the ten algorithms in data mining [3]. Like other clustering algorithms, it is not the silver bullet. K-means clustering requires pre analysis and knowledge before the number of clusters and their centroids are determined. Recent studies show a new approach for K-means clustering which does not require any pre knowledge for determining the number of clusters [4]. In this thesis, we propose a new clustering procedure to solve the central problem of identifying the number of clusters (k) by imitating the desired number of clusters with proper properties. The proposed algorithm is validated by investigating different characteristics of the analyzed data with modified theory, analyze parameters efficiency and their relationships. The parameters in this theory include the selection of embryo-size (m), significance level (α), distributions (d), and training set (n), in the identification of clusters (k). K-means clustering modifying K-means clustering nearest neighbor clustering general clustering procedure Kolmogorov Simonov-test parameters descriptions Computer and Information Sciences Data- och informationsvetenskap
4	Supervised Learning Techniques : A comparison of the Random Forest and the Support Vector Machine Arnroth, Lukas, Fiddler Dennis, Jonni January 2016 (has links) This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features. machine learning biomarkers cross-validation receiver operating characteristic k-means clustering feature selection binary classification
5	Optimal Clustering: Genetic Constrained K-Means and Linear Programming Algorithms Zhao, Jianmin 01 January 2006 (has links) Methods for determining clusters of data under- specified constraints have recently gained popularity. Although general constraints may be used, we focus on clustering methods with the constraint of a minimal cluster size. In this dissertation, we propose two constrained k-means algorithms: Linear Programming Algorithm (LPA) and Genetic Constrained K-means Algorithm (GCKA). Linear Programming Algorithm modifies the k-means algorithm into a linear programming problem with constraints requiring that each cluster have m or more subjects. In order to achieve an acceptable clustering solution, we run the algorithm with a large number of random sets of initial seeds, and choose the solution with minimal Root Mean Squared Error (RMSE) as our final solution for a given data set. We evaluate LPA with both generic data and simulated data and the results indicate that LPA can obtain a reasonable clustering solution. Genetic Constrained K-Means Algorithm (GCKA) hybridizes the Genetic Algorithm with a constrained k-means algorithm. We define Selection Operator, Mutation Operator and Constrained K-means operator. Using finite Markov chain theory, we prove that the GCKA converges in probability to the global optimum. We test the algorithm with several datasets. The analysis shows that we can achieve a good clustering solution by carefully choosing parameters such as population size, mutation probability and generation. We also propose a Bi-Nelder algorithm to search for an appropriate cluster number with minimal RMSE. cluster analysis hierarchical clustering K-means clustering LPA algorithm evaluation Biostatistics Physical Sciences and Mathematics Statistics and Probability
6	Privacy-Enhancing Techniques for Data Analytics Fang-Yu Rao (6565679) 10 June 2019 (has links) <div> <div> <div> <p>Organizations today collect and aggregate huge amounts of data from individuals under various scenarios and for different purposes. Such aggregation of individuals’ data when combined with techniques of data analytics allows organizations to make informed decisions and predictions. But in many situations, different portions of the data associated with individuals are collected and curated by different organizations. To derive more accurate conclusions and predictions, those organization may want to conduct the analysis based on their joint data, which cannot be simply accomplished by each organization exchanging its own data with other organizations due to the sensitive nature of data. Developing approaches for collaborative privacy-preserving data analytics, however, is a nontrivial task. At least two major challenges have to be addressed. The first challenge is that the security of the data possessed by each organization should always be properly protected during and after the collaborative analysis process, whereas the second challenge is the high computational complexity usually accompanied by cryptographic primitives used to build such privacy-preserving protocols. </p><p><br></p><p> </p><div> <div> <div> <p>In this dissertation, based on widely adopted primitives in cryptography, we address the aforementioned challenges by developing techniques for data analytics that not only allow multiple mutually distrustful parties to perform data analysis on their joint data in a privacy-preserving manner, but also reduce the time required to complete the analysis. More specifically, using three common data analytics tasks as concrete examples, we show how to construct the respective privacy-preserving protocols under two different scenarios: (1) the protocols are executed by a collaborative process only involving the participating parties; (2) the protocols are outsourced to some service providers in the cloud. Two types of optimization for improving the efficiency of those protocols are also investigated. The first type allows each participating party access to a statistically controlled leakage so as to reduce the amount of required computation, while the second type utilizes the parallelism that could be incorporated into the task and pushes some computation to the offline phase to reduce the time needed for each participating party without any additional leakage. Extensive experiments are also conducted on real-world datasets to demonstrate the effectiveness of our proposed techniques.<br></p> <p> </p> </div> </div> </div> </div> </div> </div> Computer System Security Data Encryption Differential Privacy Secure Multiparty Computation Record Linkage K-Means Clustering
7	Evaluation of decentralized email architecture and social network analysis based on email attachment sharing Tsipenyuk, Gregory January 2018 (has links) Present day email is provided by centralized services running in the cloud. The services transparently connect users behind middleboxes and provide backup, redundancy, and high availability at the expense of user privacy. In present day mobile environments, users can access and modify email from multiple devices with updates reconciled on the central server. Prioritizing updates is difficult and may be undesirable. Moreover, legacy email protocols do not provide optimal email synchronization and access. Recent phenomena of the Internet of Things (IoT) will see the number of interconnected devices grow to 27 billion by 2021. In the first part of my dissertation I am proposing a decentralized email architecture which takes advantage of user's a IoT devices to maintain a complete email history. This addresses the email reconciliation issue and places data under user control. I replace legacy email protocols with a synchronization protocol to achieve eventual consistency of email and optimize bandwidth and energy usage. The architecture is evaluated on a Raspberry Pi computer. There is an extensive body of research on Social Network Analysis (SNA) based on email archives. Typically, the analyzed network reflects either communication between users or a relationship between the email and the information found in the email's header and the body. This approach discards either all or some email attachments that cannot be converted to text; for instance, images. Yet attachments may use up to 90% of an email archive size. In the second part of my dissertation I suggest extracting the network from email attachments shared between users. I hypothesize that the network extracted from shared email attachments might provide more insight into the social structure of the email archive. I evaluate communication and shared email attachments networks by analyzing common centrality measures and classication and clustering algorithms. I further demonstrate how the analysis of the shared attachments network can be used to optimize the proposed decentralized email architecture.
8	Statistische Eigenschaften von Clusterverfahren / Statistical properties of cluster procedures Schorsch, Andrea January 2008 (has links) Die vorliegende Diplomarbeit beschäftigt sich mit zwei Aspekten der statistischen Eigenschaften von Clusterverfahren. Zum einen geht die Arbeit auf die Frage der Existenz von unterschiedlichen Clusteranalysemethoden zur Strukturfindung und deren unterschiedlichen Vorgehensweisen ein. Die Methode des Abstandes zwischen Mannigfaltigkeiten und die K-means Methode liefern ausgehend von gleichen Daten unterschiedliche Endclusterungen. Der zweite Teil dieser Arbeit beschäftigt sich näher mit den asymptotischen Eigenschaften des K-means Verfahrens. Hierbei ist die Menge der optimalen Clusterzentren konsistent. Bei Vergrößerung des Stichprobenumfangs gegen Unendlich konvergiert diese in Wahrscheinlichkeit gegen die Menge der Clusterzentren, die das Varianzkriterium minimiert. Ebenfalls konvergiert die Menge der optimalen Clusterzentren für n gegen Unendlich gegen eine Normalverteilung. Es hat sich dabei ergeben, dass die einzelnen Clusterzentren voneinander abhängen. / The following thesis describes two different views onto the statistical characterics of clustering procedures. At first it adresses the questions whether different clustering methods exist to ascertain the structure of clusters and in what ays the strategies of these methods differ from each other. The method of distance between the manifolds as well as the k-means method provide different final clusters based on equal initial data. The second part of the thesis concentrates on asymptotic properties of the k-means procedure. Here the amount of optimal clustering centres is consistent. If the size of the sample range is enlarged towards infinity, it also converges in probability towards the amount of clustering centres which minimized the whithin cluster sum of squares. Likewise the amount of optimal clustering centres converges for infinity towards the normal distribution. The main result shows that the individual clustering centres are dependent on each other. Clusteranalyse K-Means Verfahren asymptotische Normalverteilung cluster analysis k-means clustering asymptotical normal distribution Mathematics
9	Structure Pattern Analysis Using Term Rewriting and Clustering Algorithm Fu, Xuezheng 27 June 2007 (has links) Biological data is accumulated at a fast pace. However, raw data are generally difficult to understand and not useful unless we unlock the information hidden in the data. Knowledge/information can be extracted as the patterns or features buried within the data. Thus data mining, aims at uncovering underlying rules, relationships, and patterns in data, has emerged as one of the most exciting fields in computational science. In this dissertation, we develop efficient approaches to the structure pattern analysis of RNA and protein three dimensional structures. The major techniques used in this work include term rewriting and clustering algorithms. Firstly, a new approach is designed to study the interaction of RNA secondary structures motifs using the concept of term rewriting. Secondly, an improved K-means clustering algorithm is proposed to estimate the number of clusters in data. A new distance descriptor is introduced for the appropriate representation of three dimensional structure segments of RNA and protein three dimensional structures. The experimental results show the improvements in the determination of the number of clusters in data, evaluation of RNA structure similarity, RNA structure database search, and better understanding of the protein sequence-structure correspondence. Bioinformatics K-means clustering algorithm Term rewriting Stability Knowledge discovery Data mining Validation measure Computer Sciences
10	The General Quantization Problem for Distributions with Regular Support Pötzelberger, Klaus January 1999 (has links) (PDF) We study the asymptotic behavior of the quantization error for general information functions and prove results for distributions P with regular support. We characterize the information functions for which the uniform distribution on the set of prototypes converges weakly to P. (author's abstract) / Series: Forschungsberichte / Institut für Statistik MSC 60D05, 62H30, 68T10

Search results