41 |
Modélisation et exécution des applications d'analyse de données multi-dimentionnelles sur architectures distribuées.Pan, Jie 13 December 2010 (has links) (PDF)
Des quantités de données colossalles sont générées quotidiennement. Traiter de grands volumes de données devient alors un véritable challenge pour les logiciels d'analyse des données multidimensionnelles. De plus, le temps de réponse exigé par les utilisateurs de ces logiciels devient de plus en plus court, voire intéractif. Pour répondre à cette demande, une approche basée sur le calcul parallèle est une solution. Les approches traditionnelles reposent sur des architectures performantes, mais coûteuses, comme les super-calculateurs. D'autres architectures à faible coût sont également disponibles, mais les méthodes développées sur ces architectures sont souvent bien moins efficaces. Dans cette thèse, nous utilisons un modèle de programmation parallèle issu du Cloud Computing, dénommé MapReduce, pour paralléliser le traitement des requêtes d'analyse de données multidimensionnelles afin de bénéficier de mécanismes de bonne scalabilité et de tolérance aux pannes. Dans ce travail, nous repensons les techniques existantes pour optimiser le traitement de requête d'analyse de données multidimensionnelles, y compris les étapes de pré-calcul, d'indexation, et de partitionnement de données. Nous avons aussi résumé le parallélisme de traitement de requêtes. Ensuite, nous avons étudié le modèle MapReduce en détail. Nous commençons par présenter le principe de MapReduce et celles du modèle étendu, MapCombineReduce. En particulier, nous analysons le coût de communication pour la procédure de MapReduce. Après avoir présenté le stockage de données qui fonctionne avec MapReduce, nous présentons les caractéristiques des applications de gestion de données appropriées pour le Cloud Computing et l'utilisation de MapReduce pour les applications d'analyse de données dans les travaux existants. Ensuite, nous nous concentrons sur la parallélisation des Multiple Group-by query, une requête typique utilisée dans l'exploration de données multidimensionnelles. Nous présentons la mise en oeuvre de l'implémentation initiale basée sur MapReduce et une optimisation basée sur MapCombineReduce. Selon les résultats expérimentaux, notre version optimisée montre un meilleur speed-up et une meilleure scalabilité que la version initiale. Nous donnons également une estimation formelle du temps d'exécution pour les deux implémentations. Afin d'optimiser davantage le traitement du Multiple Group-by query, une phase de restructuration de données est proposée pour optimiser les jobs individuels. Nous re-definissons l'organisation du stockage des données, et nous appliquons les techniques suivantes, le partitionnement des données, l'indexation inversée et la compression des données, au cours de la phase de restructuration des données. Nous redéfinissons les calculs effectués dans MapReduce et dans l'ordonnancement des tâches en utilisant cette nouvelle structure de données. En nous basant sur la mesure du temps d'exécution, nous pouvons donner une estimation formelle et ainsi déterminer les facteurs qui impactent les performances, telles que la sélectivité de requête, le nombre de mappers lancés sur un noeud, la distribution des données " hitting ", la taille des résultats intermédiaires, les algorithmes de sérialisation adoptée, l'état du réseau, le fait d'utiliser ou non le combiner, ainsi que les méthodes adoptées pour le partitionnement de données. Nous donnons un modèle d'estimation des temps d'exécution et en particulier l'estimation des valeurs des paramètres différents pour les exécutions utilisant le partitionnement horizontal. Afin de soutenir la valeur-unique-wise-ordonnancement, qui est plus flexible, nous concevons une nouvelle structure de données compressées, qui fonctionne avec un partitionnement vertical. Cette approche permet l'agrégation sur une certaine valeur dans un processus continu.
|
42 |
Applying MapReduce Island-based Genetic Algorithm-Particle Swarm Optimization to the inference of large Gene Regulatory Network in Cloud Computing environmentHuang, Wei-Jhe 13 September 2012 (has links)
The construction of Gene Regulatory Networks (GRNs) is one of the most important issues in systems biology. To infer a large-scale GRN with a nonlinear mathematical model, researchers need to encounter the time-consuming problem due to the large number of network parameters involved. In recent years, the cloud computing technique has been widely used to solve large-scale problems. Among others, Hadoop is currently the most well-known and reliable cloud computing framework, which allows users to analyze large amount of data in a distributed environment (i.e., MapReduce). It also supports data backup and data recovery mechanisms.
This study proposes an Island-based GAPSO algorithm under the Hadoop cloud computing environment to infer large-scale GRNs. GAPSO exploited the position and velocity functions of PSO, and integrated the operations of Genetic Algorithm. This approach is often used to derive the optimal solution in nonlinear mathematical models. Several sets of experiments have been conducted, in which the number of network nodes varied from 50 to 125. The experiments were executed in the Hadoop distributed environment with 10, 20, and 26 computers, respectively. In the experiments of inferring the network with 125 gene nodes on the largest Hadoop cluster (i.e. 26 computers), the proposed framework performed up to 9.7 times faster than the stand-alone computer. It means that our work can successfully reduce 90% of the computation time in a single experimental run.
|
43 |
Enabling Large-Scale Mining Software Repositories (MSR) Studies Using Web-Scale PlatformsShang, Weiyi 31 May 2010 (has links)
The Mining Software Repositories (MSR) field analyzes software data to uncover knowledge and assist software developments. Software projects and products continue to grow in size and complexity. In-depth analysis of these large systems and their evolution is needed to better understand the characteristics of such large-scale systems and projects. However, classical software analysis platforms (e.g., Prolog-like, SQL-like, or specialized programming scripts) face many challenges when performing large-scale MSR studies. Such software platforms rarely scale easily out of the box. Instead, they often require analysis-specific one-time ad hoc scaling tricks and designs that are not reusable for other types of analysis and that are costly to maintain. We believe that the web community has faced many of the scaling challenges facing the software engineering community, as they cope with the enormous growth of the web data. In this thesis, we report on our experience in using MapReduce and Pig, two web-scale platforms, to perform large MSR studies. Through our case studies, we carefully demonstrate the benefits and challenges of using web platforms to prepare (i.e., Extract, Transform, and Load, ETL) software data for further analysis. The results of our studies show that: 1) web-scale platforms provide an effective and efficient platform for large-scale MSR studies; 2) many of the web community’s guidelines for using web-scale platforms must be modified to achieve the optimal performance for large-scale MSR studies. This thesis will help other software engineering researchers who want to scale their studies. / Thesis (Master, Computing) -- Queen's University, 2010-05-28 00:37:19.443
|
44 |
Scalable Embeddings for Kernel Clustering on MapReduceElgohary, Ahmed 14 February 2014 (has links)
There is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications.
The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically.
|
45 |
Maresia : an approach to deal with the single points of failure of the MapReduce model / Maresi: uma abordagem para lidar com os pontos de falha única do modelo MapReduceMarcos, Pedro de Botelho January 2013 (has links)
Durante os últimos anos, a quantidade de dados gerada pelas aplicações cresceu consideravelmente. No entanto, para tornarem-se relevantes estes dados precisam ser processados. Para atender este objetivo, novos modelos de programação para processamento paralelo e distribuído foram propostos. Um exemplo é o modelo MapReduce, o qual foi proposto pela Google. Este modelo, no entanto, possui pontos de falha única (SPOF), os quais podem comprometer a sua execução. Assim, este trabalho apresenta uma nova arquitetura, inspirada pelo Chord, para lidar com os SPOFs do modelo. A avaliação da proposta foi realizada através de modelagem analítica e de testes experimentais. Os resultados mostram a viabilidade de usar a arquitetura proposta para executar o MapReduce. / During the last years, the amount of data generated by applications grew considerably. To become relevant, however, this data should be processed. With this goal, new programming models for parallel and distributed processing were proposed. An example is the MapReduce model, which was proposed by Google. This model, nevertheless, has Single Points of Failure (SPOF), which can compromise the execution of a job. Thus, this work presents a new architecture, inspired by Chord, to avoid the SPOFs on MapReduce. The evaluation was performed through an analytical model and an experimental setup. The results show the feasibility of using the proposed architecture to execute MapReduce jobs.
|
46 |
Adequação da computação intensiva em dados para ambientes desktop grid com uso de MapReduce / Adequacy of intensive data computing to desktop grid environment with using of mapreduceAnjos, Julio Cesar Santos dos January 2012 (has links)
O surgimento de volumes de dados na ordem de petabytes cria a necessidade de desenvolver-se novas soluções que viabilizem o tratamento dos dados através do uso de sistemas de computação intensiva, como o MapReduce. O MapReduce é um framework de programação que apresenta duas funções: uma de mapeamento, chamada Map, e outra de redução, chamada Reduce, aplicadas a uma determinada entrada de dados. Este modelo de programação é utilizado geralmente em grandes clusters e suas tarefas Map ou Reduce são normalmente independentes entre si. O programador é abstraído do processo de paralelização como divisão e distribuição de dados, tolerância a falhas, persistência de dados e distribuição de tarefas. A motivação deste trabalho é aplicar o modelo de computação intensiva do MapReduce com grande volume de dados para uso em ambientes desktop grid. O objetivo então é investigar os algoritmos do MapReduce para adequar a computação intensiva aos ambientes heterogêneos. O trabalho endereça o problema da heterogeneidade de recursos, não tratando neste momento a volatilidade das máquinas. Devido às deficiências encontradas no MapReduce em ambientes heterogêneos foi proposto o MR-A++, que é um MapReduce com algoritmos adequados ao ambiente heterogêneo. O modelo do MR-A++ cria uma tarefa de medição para coletar informações, antes de ocorrer a distribuição dos dados. Assim, as informações serão utilizadas para gerenciar o sistema. Para avaliar os algoritmos alterados foi empregada a Análise 2k Fatorial e foram executadas simulações com o simulador MRSG. O simulador MRSG foi construído para o estudo de ambientes (homogêneos e heterogêneos) em larga escala com uso do MapReduce. O pequeno atraso introduzido na fase de setup da computação é compensado com a adequação do ambiente heterogêneo à capacidade computacional das máquinas, com ganhos de redução de tempo de execução dos jobs superiores a 70 % em alguns casos. / The emergence of data volumes in the order of petabytes creates the need to develop new solutions that make possible the processing of data through the use of intensive computing systems, as MapReduce. MapReduce is a programming framework that has two functions: one called Map, mapping, and another reducing called Reduce, applied to a particular data entry. This programming model is used primarily in large clusters and their tasks are normally independent. The programmer is abstracted from the parallelization process such as division and data distribution, fault tolerance, data persistence and distribution of tasks. The motivation of this work is to apply the intensive computation model of MapReduce with large volume of data in desktop grid environments. The goal then is to investigate the intensive computing in heterogeneous environments with use MapReduce model. First the problem of resource heterogeneity is solved, not treating the moment of the volatility. Due to deficiencies of the MapReduce model in heterogeneous environments it was proposed the MR-A++; a MapReduce with algorithms adequated to heterogeneous environments. The MR-A++ model creates a training task to gather information prior to the distribution of data. Therefore the information will be used to manager the system. To evaluate the algorithms change it was employed a 2k Factorial analysis and simulations with the simulant MRSG built for the study of environments (homogeneous and heterogeneous) large-scale use of MapReduce. The small delay introduced in phase of setup of computing compensates with the adequacy of heterogeneous environment to computational capacity of the machines, with gains in the run-time reduction of jobs exceeding 70% in some cases.
|
47 |
Uma abordagem distribuída para preservação de privacidade na publicação de dados de trajetória / A distributed approach for privacy preservation in the publication of trajectory dataBrito, Felipe Timbó January 2016 (has links)
BRITO, Felipe Timbó. Uma abordagem distribuída para preservação de privacidade na publicação de dados de trajetória. 2016. 66 f. Dissertação (mestrado em computação)- Universidade Federal do Ceará, Fortaleza-CE, 2016. / Submitted by Elineudson Ribeiro (elineudsonr@gmail.com) on 2016-03-31T18:54:31Z
No. of bitstreams: 1
2016_dis_ftbrito.pdf: 3114981 bytes, checksum: 501bbf667d876e76c74a7911fc7b2c3b (MD5) / Approved for entry into archive by Rocilda Sales (rocilda@ufc.br) on 2016-04-25T12:34:13Z (GMT) No. of bitstreams: 1
2016_dis_ftbrito.pdf: 3114981 bytes, checksum: 501bbf667d876e76c74a7911fc7b2c3b (MD5) / Made available in DSpace on 2016-04-25T12:34:13Z (GMT). No. of bitstreams: 1
2016_dis_ftbrito.pdf: 3114981 bytes, checksum: 501bbf667d876e76c74a7911fc7b2c3b (MD5)
Previous issue date: 2016 / Advancements in mobile computing techniques along with the pervasiveness of location-based services have generated a great amount of trajectory data. These data can be used for various data analysis purposes such as traffic flow analysis, infrastructure planning, understanding of human behavior, etc. However, publishing this amount of trajectory data may lead to serious risks of privacy breach. Quasi-identifiers are trajectory points that can be linked to external information and be used to identify individuals associated with trajectories. Therefore, by analyzing quasi-identifiers, a malicious user may be able to trace anonymous trajectories back to individuals with the aid of location-aware social networking applications, for example. Most existing trajectory data anonymization approaches were proposed for centralized computing environments, so they usually present poor performance to anonymize large trajectory data sets. In this work we propose a distributed and efficient strategy that adopts the $k^m$-anonymity privacy model and uses the scalable MapReduce paradigm, which allows finding quasi-identifiers in larger amount of data. We also present a technique to minimize the loss of information by selecting key locations from the quasi-identifiers to be suppressed. Experimental evaluation results demonstrate that our proposed approach for trajectory data anonymization is more scalable and efficient than existing works in the literature. / Avanços em técnicas de computação móvel aliados à difusão de serviços baseados em localização têm gerado uma grande quantidade de dados de trajetória. Tais dados podem ser utilizados para diversas finalidades, tais como análise de fluxo de tráfego, planejamento de infraestrutura, entendimento do comportamento humano, etc. No entanto, a publicação destes dados pode levar a sérios riscos de violação de privacidade. Semi-identificadores são pontos de trajetória que podem ser combinados com informações externas e utilizados para identificar indivíduos associados à sua trajetória. Por esse motivo, analisando semi-identificadores, um usuário malicioso pode ser capaz de restaurar trajetórias anonimizadas de indivíduos por meio de aplicações de redes sociais baseadas em localização, por exemplo. Muitas das abordagens já existentes envolvendo anonimização de dados foram propostas para ambientes de computação centralizados, assim elas geralmente apresentam um baixo desempenho para anonimizar grandes conjuntos de dados de trajetória. Neste trabalho propomos uma estratégia distribuída e eficiente que adota o modelo de privacidade km-anonimato e utiliza o escalável paradigma MapReduce, o qual permite encontrar semi-identificadores em um grande volume de dados. Nós também apresentamos uma técnica que minimiza a perda de informação selecionando localizações chaves a serem removidas a partir do conjunto de semi-identificadores. Resultados de avaliação experimental demonstram que nossa solução de anonimização é mais escalável e eficiente que trabalhos já existentes na literatura.
|
48 |
Developing a music player mobile application with cloud serverChen, Ying January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / Daniel A. Andresen / A music player mobile application for Android is developed along with cloud server using Google’s App Engine and Firebase. This music player application provides various ways of navigating to an audio file and different music visualizer options. What’s more, the application also provides three major features: 1 user sign in and sign out, 2 display the most popular songs based on input, 3 users can submit comments and suggestions. These features are implemented by utilizing cloud services of Google’s App Engine and Firebase. Specifically, an application running on App Engine plays as a server’s role to verify user sign in. It also runs App Engine MapReduce jobs to consume large data stored in Google Cloud Storage and serves relatively small result about popular songs for the app. In addition, user’s comments and suggestions are automatically synchronized with Firebase which makes modifying and analyzing synchronized data really convenient.
|
49 |
Semantic Keyword Search on Large-Scale Semi-Structured DataJanuary 2016 (has links)
abstract: Keyword search provides a simple and user-friendly mechanism for information search, and has become increasingly popular for accessing structured or semi-structured data. However, there are two open issues of keyword search on semi/structured data which are not well addressed by existing work yet.
First, while an increasing amount of investigation has been done in this important area, most existing work concentrates on efficiency instead of search quality and may fail to deliver high quality results from semantic perspectives. Majority of the existing work generates minimal sub-graph results that are oblivious to the entity and relationship semantics embedded in the data and in the user query. There are also studies that define results to be subtrees or subgraphs that contain all query keywords but are not necessarily ``minimal''. However, such result construction method suffers from the same problem of semantic mis-alignment between data and user query. In this work the semantics of how to {\em define} results that can capture users' search intention and then the generation of search intention aware results is studied.
Second, most existing research is incapable of handling large-scale structured data. However, as data volume has seen rapid growth in recent years, the problem of how to efficiently process keyword queries on large-scale structured data becomes important. MapReduce is widely acknowledged as an effective programming model to process big data. For keyword query processing on data graph, first graph algorithms which can efficiently return query results that are consistent with users' search intention are proposed. Then these algorithms are migrated to MapReduce to support big data. For keyword query processing on schema graph, it first transforms a keyword query into multiple SQL queries, then all generated SQL queries are run on the structured data. Therefore it is crucial to find the optimal way to execute a SQL query using MapReduce, which can minimize the processing time. In this work, a system called SOSQL is developed which generates the optimal query execution plan using MapReduce for a SQL query $Q$ with time complexity $O(n^2)$, where $n$ is the number of input tables of $Q$. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2016
|
50 |
Maresia : an approach to deal with the single points of failure of the MapReduce model / Maresi: uma abordagem para lidar com os pontos de falha única do modelo MapReduceMarcos, Pedro de Botelho January 2013 (has links)
Durante os últimos anos, a quantidade de dados gerada pelas aplicações cresceu consideravelmente. No entanto, para tornarem-se relevantes estes dados precisam ser processados. Para atender este objetivo, novos modelos de programação para processamento paralelo e distribuído foram propostos. Um exemplo é o modelo MapReduce, o qual foi proposto pela Google. Este modelo, no entanto, possui pontos de falha única (SPOF), os quais podem comprometer a sua execução. Assim, este trabalho apresenta uma nova arquitetura, inspirada pelo Chord, para lidar com os SPOFs do modelo. A avaliação da proposta foi realizada através de modelagem analítica e de testes experimentais. Os resultados mostram a viabilidade de usar a arquitetura proposta para executar o MapReduce. / During the last years, the amount of data generated by applications grew considerably. To become relevant, however, this data should be processed. With this goal, new programming models for parallel and distributed processing were proposed. An example is the MapReduce model, which was proposed by Google. This model, nevertheless, has Single Points of Failure (SPOF), which can compromise the execution of a job. Thus, this work presents a new architecture, inspired by Chord, to avoid the SPOFs on MapReduce. The evaluation was performed through an analytical model and an experimental setup. The results show the feasibility of using the proposed architecture to execute MapReduce jobs.
|
Page generated in 0.0165 seconds