Global ETD Search

71	Distributed frequent subgraph mining in the cloud / Fouille de sous-graphes fréquents dans les nuages Aridhi, Sabeur 29 November 2013 (has links) Durant ces dernières années, l’utilisation de graphes a fait l’objet de nombreux travaux, notamment en bases de données, apprentissage automatique, bioinformatique et en analyse des réseaux sociaux. Particulièrement, la fouille de sous-graphes fréquents constitue un défi majeur dans le contexte de très grandes bases de graphes. De ce fait, il y a un besoin d’approches efficaces de passage à l’échelle pour la fouille de sous-graphes fréquents surtout avec la haute disponibilité des environnements de cloud computing. Cette thèse traite la fouille distribuée de sous-graphe fréquents sur cloud. Tout d’abord, nous décrivons le matériel nécessaire pour comprendre les notions de base de nos deux domaines de recherche, à savoir la fouille de sous-graphe fréquents et le cloud computing. Ensuite, nous présentons les contributions de cette thèse. Dans le premier axe, une nouvelle approche basée sur le paradigme MapReduce pour approcher la fouille de sous-graphes fréquents à grande échelle. L’approche proposée offre une nouvelle technique de partitionnement qui tient compte des caractéristiques des données et qui améliore le partitionnement par défaut de MapReduce. Une telle technique de partitionnement permet un équilibrage des charges de calcul sur une collection de machine distribuée et de remplacer la technique de partitionnement par défaut de MapReduce. Nous montrons expérimentalement que notre approche réduit considérablement le temps d’exécution et permet le passage à l’échelle du processus de fouille de sous-graphe fréquents à partir de grandes bases de graphes. Dans le deuxième axe, nous abordons le problème d’optimisation multi-critères des paramètres liés à l’extraction distribuée de sous-graphes fréquents dans un environnement de cloud tout en optimisant le coût monétaire global du stockage et l’interrogation des données dans le nuage. Nous définissons des modèles de coûts de gestion et de fouille de données avec une plateforme de fouille de sous-graphe à grande échelle sur une architecture cloud. Nous présentons une première validation expérimentale des modèles de coûts proposés. / Recently, graph mining approaches have become very popular, especially in certain domains such as bioinformatics, chemoinformatics and social networks. One of the most challenging tasks in this setting is frequent subgraph discovery. This task has been highly motivated by the tremendously increasing size of existing graph databases. Due to this fact, there is urgent need of efficient and scaling approaches for frequent subgraph discovery especially with the high availability of cloud computing environments. This thesis deals with distributed frequent subgraph mining in the cloud. First, we provide the required material to understand the basic notions of our two research fields, namely graph mining and cloud computing. Then, we present the contributions of this thesis. In the first axis, we propose a novel approach for large-scale subgraph mining, using the MapReduce framework. The proposed approach provides a data partitioning technique that consider data characteristics. It uses the densities of graphs in order to partition the input data. Such a partitioning technique allows a balanced computational loads over the distributed collection of machines and replace the default arbitrary partitioning technique of MapReduce. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases. In the second axis, we address the multi-criteria optimization problem of tuning thresholds related to distributed frequent subgraph mining in cloud computing environments while optimizing the global monetary cost of storing and querying data in the cloud. We define cost models for managing and mining data with a large scale subgraph mining framework over a cloud architecture. We present an experimental validation of the proposed cost models in the case of distributed subgraph mining in the cloud. Fouille de sous-graphes Partitionnement de graphes Densité de graphe MapReduce Informatique dans les nuages Modèles de coûts Frequent subgraph mining Graph partitioning Graph density MapReduce Cloud computing Cost models
72	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Cordeiro, Robson Leonardo Ferreira 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Correlation clustering Dados de média à alta dimensionalidade Labeling and summarization MapReduce MapReduce Moderante-to-high dimensionality data Rotulação e sumarização Terabyte-scale data mining
73	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS Veloso, Lays Helena Lopes 29 April 2015 (has links) Made available in DSpace on 2017-07-21T14:19:24Z (GMT). No. of bitstreams: 1 Lays Veloso.pdf: 1140015 bytes, checksum: c544c69a03612a2909b7011c936788ee (MD5) Previous issue date: 2015-04-29 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / This study aimed to investigate the use of a parallel K-means clustering algorithm,based on parallel MapReduce model, to improve the response time of the data mining. The parallel K-Means was implemented in three phases, performed in each iteration: assignment of samples to groups with nearest centroid by Mappers, in parallel; local grouping of samples assigned to the same group from Mappers using a Combiner and update of the centroids by the Reducer. The performance of the algorithm was evaluated in respect to SpeedUp and ScaleUp. To achieve this, experiments were run in single-node mode and on a Hadoop cluster consisting of six of-the-shelf computers. The data were clustered comprise flux towers measurements from agricultural regions and belong to Ameriflux. The results showed performance gains with increasing number of machines and the best time was obtained using six machines reaching the speedup of 3,25. To support our results, ANOVA analysis was applied from repetitions using 3, 4 and 6 machines in the cluster, respectively. The ANOVA show low variance between the execution times obtained for the same number of machines and a significant difference between means of each number of machines. The ScaleUp analysis show that the application scale well with an equivalent increase in data size and the number of machines, achieving similar performance. With the results as expected, this paper presents a parallel and scalable implementation of the K-Means to run on a Hadoop cluster and improve the response time of clustering to large databases. / Este trabalho teve como objetivo investigar a utilização de um algoritmo de agrupamento K-Means paralelo, com base no modelo paralelo MapReduce, para melhorar o tempo de resposta da mineração de dados. O K-Means paralelo foi implementado em três fases, executadas em cada iteração: atribuição das amostras aos grupos com centróide mais próximo pelos Mappers, em paralelo; agrupamento local das amostras atribuídas ao mesmo grupo pelos Mappers usando um Combiner e atualização dos centróides pelo Reducer. O desempenho do algoritmo foi avaliado quanto ao SpeedUp e ScaleUp. Para isso foram executados experimentos em modo single-node e em um cluster Hadoop formado por seis computadores de hardware comum. Os dados agrupados são medições de torres de fluxo de regiões agrícolas e pertencem a Ameriflux. Os resultados mostraram que com o aumento do número de máquinas houve ganho no desempenho, sendo que o melhor tempo obtido foi usando seis máquinas chegando ao SpeedUp de 3,25. Para apoiar nossos resultados foi construída uma tabela ANOVA a partir de repetições usando 3, 4 e 6 máquinas no cluster, pespectivamente. Os resultados da análise ANOVA mostram que existe pouca variância entre os tempos de execução obtidos com o mesmo número de máquinas e existe uma diferença significativa entre as médias para cada número de máquinas. A partir dos experimentos para analisar o ScaleUp verificou-se que a aplicação escala bem com o aumento equivalente do tamanho dos dados e do número de máquinas no cluster,atingindo um desempenho próximo. Com os resultados conforme esperados, esse trabalho apresenta uma implementação paralela e escalável do K-Means para ser executada em um cluster Hadoop e melhorar o tempo de resposta do agrupamento de grandes bases de dados. K-Means Paralelo MapReduce Hadoop dados de fluxo mineração de dados Parallel K-Means MapReduce Hadoop flux data data mining
74	Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows Kolb, Lars 11 December 2014 (has links) (PDF) In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen. Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n²) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match beziehungsweise Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows. Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar. Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert. Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen. MapReduce Hadoop Entity Resolution Load Balancing Similarity Computation Transitive Closure MapReduce Hadoop Entity Resolution Load Balancing Similarity Computation Transitive Closure ddc:500
75	Model transformation on distributed platforms : decentralized persistence and distributed processing / Transformation de modèles sur plates-formes réparties : persistance décentralisée et traitement distribué Benelallam, Amine 07 December 2016 (has links) Grâce à sa promesse de réduire les efforts de développement et maintenance du logiciel, l’Ingénierie Dirigée par les Modèles (IDM) attire de plus en plus les acteurs industriels. En effet, elle a été adoptée avec succès dans plusieurs domaines tels que le génie civil, l’industrie automobile et la modernisation de logiciels.Toutefois, la taille croissante des modèles utilisés nécessite de concevoir des solutions passant à l’échelle afin de les traiter (transformer), et stocker (persister) de manière efficace. Une façon de pallier cette problématique est d’utiliser les systèmes et les bases de données répartis. D’une part, les paradigmes de programmation distribuée tels que MapReduce et Pregel peuvent simplifier la distribution de transformations des modèles (TM). Et d’autre part, l’avènement des base de données NoSQL permet le stockage efficace des modèles d’une manière distribuée. Dans le cadre de cette thèse, nous proposons une approche pour la transformation ainsi que pour la persistance de grands modèles.Nous nous basons d’un côté, sur le haut niveau d’abstraction fourni par les langages déclaratifs (relationnels) de transformation et d’un autre côté, sur la sémantique bien définie des paradigmes existants de programmation distribués, afin de livrer un moteur distribué de TM. La distribution est implicite et la syntaxe du langage n’est pas modifiée (aucune primitive de parallélisation n’est ajoutée). Nous étendons cette solution avec un algorithme efficace de distribution de modèles qui se base sur l’analyse statique des transformations et sur résultats récents sur le partitionnement équilibré des graphes. Nous avons appliqué notre approche à ATL, un langage relationnel de TM et MapReduce, un paradigme de programmation distribué. Finalement, nous proposons une solution pour stocker des modèles à l’aide de bases de données NoSQL, en particulier au travers d’un cadre d’applications de persistance répartie. / Model-Driven Engineering (MDE) is gaining ground in industrial environments, thanks to its promise of lowering software development and maintenance effort. It has been adopted with success in producing software for several domains like civil engineering, car manufacturing and modernization of legacy software systems. As the models that need to be handled in model-driven engineering grow in scale, it became necessary to design scalable algorithms for model transformation (MT) as well as well-suitable persistence frameworks. One way to cope with these issues is to exploit the wide availability of distributed clusters in the Cloud for the distributed execution of model transformations and their persistence. On one hand, programming models such as MapReduce and Pregel may simplify the development of distributed model transformations. On the other hand, the availability of different categories of NoSQL databases may help to store efficiently the models. However, because of the dense interconnectivity of models and the complexity of transformation logics, scalability in distributed model processing is challenging. In this thesis, we propose our approach for scalable model transformation and persistence. We exploit the high-level of abstraction of relational MT languages and the well-defined semantics of existing distributed programming models to provide a relational model transformation engine with implicit distributed execution. The syntax of the MT language is not modified and no primitive for distribution is added. Hence developers are not required to have any acquaintance with distributed programming.We extend this approach with an efficient model distribution algorithm, based on the analysis of relational model transformation and recent results on balanced partitioning of streaming graphs. We applied our approach to a popular MT language, ATL, on top of a well-known distributed programming model, MapReduce. Finally, we propose a multi-persistence backend for manipulating and storing models in NoSQL databases according to the modeling scenario. Especially, we focus on decentralized model persistence for distributed model transformations. Ingénierie Dirigée par les Modèles Distribution Atl MapReduce Model Driven Engineering Model Transformation & Persistence Distribution Atl MapReduce
76	Using MapReduce to scale event correlation discovery for process mining / Utilisation de MapReduce pour le passage à l'échelle de la corrélation des événements métiers dans le contexte de fouilles de processus Reguieg, Hicham 19 February 2014 (has links) Le volume des données relatives à l'exécution des processus métiers augmente de manière significative dans l'entreprise. Beaucoup de sources de données comprennent les événements liés à l'exécution des mêmes processus dans différents systèmes ou applications. La corrélation des événements est la tâche de l'analyse d'un référentiel de journaux d'événements afin de trouver l'ensemble des événements qui appartiennent à la même trace d'exécution du processus métier. Il s'agit d'une étape clé dans la découverte des processus à partir de journaux d'événements d'exécution. La corrélation des événements est une tâche de calcul intensif dans le sens où elle nécessite une analyse approfondie des relations entre les événements dans des dépôts très grande et qui évolue de plus en plus, et l'exploration de différentes relations possibles entre ces événements. Dans cette thèse, nous présentons une technique d'analyse de données évolutives pour soutenir d'une manière efficace la corrélation des événements pour les fouilles des processus métiers. Nous proposons une approche en deux étapes pour calculer les conditions de corrélation et héritier entraîné des instances de processus de journaux d'événements en utilisant la plateforme MapReduce. Les résultats expérimentaux montrent que l'algorithme s'adapte parfaitement à de grands ensembles de données. / The volume of data related to business process execution is increasing significantly in the enterprise. Many of data sources include events related to the execution of the same processes in various systems or applications. Event correlation is the task of analyzing a repository of event logs in order to find out the set of events that belong to the same business process execution instance. This is a key step in the discovery of business processes from event execution logs. Event correlation is a computationally-intensive task in the sense that it requires a deep analysis of very large and growing repositories of event logs, and exploration of various possible relationships among the events. In this dissertation, we present a scalable data analysis technique to support efficient event correlation for mining business processes. We propose a two-stages approach to compute correlation conditions and their entailed process instances from event logs using MapReduce framework. The experimental results show that the algorithm scales well to large datasets. MapReduce Processus métiers Fouilles de processus Découverte de la logique de processus Corrélation des événements Partitionnement de données MapReduce Business Process Process Mining Process Discovery Event Correlation
77	Loop parallelization in the cloud using OpenMP and MapReduce / Paralelização de laços na nuvem usando OpenMP e MapReduce Wottrich, Rodolfo Guilherme, 1990- 04 September 2014 (has links) Orientadores: Guido Costa Souza de Araújo, Rodolfo Jardim de Azevedo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-24T12:44:05Z (GMT). No. of bitstreams: 1 Wottrich_RodolfoGuilherme_M.pdf: 2132128 bytes, checksum: b8ac1197909b6cdaf96b95d6097649f3 (MD5) Previous issue date: 2014 / Resumo: A busca por paralelismo sempre foi um importante objetivo no projeto de sistemas computacionais, conduzida principalmente pelo constante interesse na redução de tempos de execução de aplicações. Programação paralela é uma área de pesquisa ativa, na qual o interesse tem crescido devido à emergência de arquiteturas multicore. Por outro lado, aproveitar as grandes capacidades de computação e armazenamento da nuvem e suas características desejáveis de flexibilidade e escalabilidade oferece várias oportunidades interessantes para abordar problemas de pesquisa relevantes em computação científica. Infelizmente, em muitos casos a implementação de aplicações na nuvem demanda conhecimento específico de interfaces de programação paralela e APIs, o que pode se tornar um fardo na programação de aplicações complexas. Para superar tais limitações, neste trabalho propomos OpenMR, um modelo de execução baseado na sintaxe e nos princípios da API OpenMP que facilita a tarefa de programar sistemas distribuídos (isto é, clusters locais ou a nuvem remota). Especificamente, este trabalho aborda o problema de executar a paralelização de laços, usando OpenMR, em um ambiente distribuído, através do mapeamento de iterações do laço para nós MapReduce. Assim, a interface de programação para a nuvem se torna a própria linguagem, livrando o desenvolvedor da tarefa de se preocupar com detalhes da distribuição de cargas de trabalho e dados. Para avaliar a validade da proposta, modificamos benchmarks da suite SPEC OMP2012 para se encaixarem no modelo proposto, desenvolvemos outros toy benchmarks que são I/O-bound e executamo-os em duas configurações: (a) um cluster de computadores disponível localmente através de uma LAN padrão; e (b) clusters disponíveis remotamente através dos serviços Amazon AWS. Comparamos os resultados com a execução utilizando OpenMP em uma arquitetura SMP e mostramos que a técnica de paralelização proposta é factível e demonstra boa escalabilidade / Abstract: The pursuit of parallelism has always been an important goal in the design of computer systems, driven mainly by the constant interest in reducing program execution time. Parallel programming is an active research area, which has grown in interest due to the emergence of multicore architectures. On the other hand, harnessing the large computing and storage capabilities of the cloud and its desirable flexibility and scaling features offers a number of interesting opportunities to address some relevant research problems in scientific computing. Unfortunately, in many cases the implementation of applications on the cloud demands specific knowledge of parallel programming interfaces and APIs, which may become a burden when programming complex applications. To overcome such limitations, in this work we propose OpenMR, an execution model based on the syntax and principles of the OpenMP API which eases the task of programming distributed systems (i.e. local clusters or remote cloud). Specifically, this work addresses the problem of performing loop parallelization, using OpenMR, in a distributed environment, through the mapping of loop iterations to MapReduce nodes. By doing so, the cloud programming interface becomes the programming language itself, freeing the developer from the task of worrying about the details of distributing workload and data. To assess the validity of the proposal, we modified benchmarks from the SPEC OMP2012 suite to fit the proposed model, developed other I/O-bound toy benchmarks and executed them in two settings: (a) a computer cluster locally available through a standard LAN; and (b) clusters remotely available through the Amazon AWS services. We compare the results to the execution using OpenMP in an SMP architecture and show that the proposed parallelization technique is feasible and demonstrates good scalability / Mestrado / Ciência da Computação / Mestre em Ciência da Computação Programação paralela (Computação) Computação em nuvem OpenMP (Programação paralela) Parallel programming (Computer science) Cloud computing OpenMP (Parallel programming)
78	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Robson Leonardo Ferreira Cordeiro 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Dados de média à alta dimensionalidade MapReduce Rotulação e sumarização Correlation clustering Labeling and summarization MapReduce Moderante-to-high dimensionality data Terabyte-scale data mining
79	Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows Kolb, Lars 08 December 2014 (has links) In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen. Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n²) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match beziehungsweise Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows. Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar. Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert. Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen. info:eu-repo/classification/ddc/500 ddc:500
80	A framework for automatic optimization of MapReduce programs based on job parameter configurations. Lakkimsetti, Praveen Kumar January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Mitchell L. Neilsen / Recently, cost-effective and timely processing of large datasets has been playing an important role in the success of many enterprises and the scientific computing community. Two promising trends ensure that applications will be able to deal with ever increasing data volumes: first, the emergence of cloud computing, which provides transparent access to a large number of processing, storage and networking resources; and second, the development of the MapReduce programming model, which provides a high-level abstraction for data-intensive computing. MapReduce has been widely used for large-scale data analysis in the Cloud [5]. The system is well recognized for its elastic scalability and fine-grained fault tolerance. However, even to run a single program in a MapReduce framework, a number of tuning parameters have to be set by users or system administrators to increase the efficiency of the program. Users often run into performance problems because they are unaware of how to set these parameters, or because they don't even know that these parameters exist. With MapReduce being a relatively new technology, it is not easy to find qualified administrators [4]. The major objective of this project is to provide a framework that optimizes MapReduce programs that run on large datasets. This is done by executing the MapReduce program on a part of the dataset using stored parameter combinations and setting the program with the most efficient combination and this modified program can be executed over the different datasets. We know that many MapReduce programs are used over and over again in applications like daily weather analysis, log analysis, daily report generation etc. So, once the parameter combination is set, it can be used on a number of data sets efficiently. This feature can go a long way towards improving the productivity of users who lack the skills to optimize programs themselves due to lack of familiarity with MapReduce or with the data being processed. Hadoop mapreduce Optimization Performance Parallel processing Job configuration parameters Distributed computing Computer Science (0984)

Search results