Global ETD Search

111	Processing Big Data in Main Memory and on GPU Fathi Salmi, Meisam 08 June 2016 (has links) No description available. Computer Science Information Technology Statistics big-data GPU Spark Hadoop data analytics
112	Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map Tasks Shrimal, Shubhendra 18 July 2016 (has links) No description available. Computer Science optimization fault-tolerance Hadoop MapReduce asymmetric inputs parallelism YARN
113	Approaches to Automatically Constructing Polarity Lexicons for Sentiment Analysis on Social Networks Khuc, Vinh Ngoc 16 August 2012 (has links) No description available. Computer Science social networks sentiment analysis sentiment lexicon graph propagation hadoop topic models lda
114	Energy savings and performance improvements with SSDs in the Hadoop Distributed File System / Economia de energia e aumento de desempenho usando SSDs no Hadoop Distributed File System Polato, Ivanilton 29 August 2016 (has links) Energy issues gathered strong attention over the past decade, reaching IT data processing infrastructures. Now, they need to cope with such responsibility, adjusting existing platforms to reach acceptable performance while promoting energy consumption reduction. As the de facto platform for Big Data, Apache Hadoop has evolved significantly over the last years, with more than 60 releases bringing new features. By implementing the MapReduce programming paradigm and leveraging HDFS, its distributed file system, Hadoop has become a reliable and fault tolerant middleware for parallel and distributed computing over large datasets. Nevertheless, Hadoop may struggle under certain workloads, resulting in poor performance and high energy consumption. Users increasingly demand that high performance computing solutions address sustainability and limit energy consumption. In this thesis, we introduce HDFSH, a hybrid storage mechanism for HDFS, which uses a combination of Hard Disks and Solid-State Disks to achieve higher performance while saving power in Hadoop computations. HDFSH brings, to the middleware, the best from HDs (affordable cost per GB and high storage capacity) and SSDs (high throughput and low energy consumption) in a configurable fashion, using dedicated storage zones for each storage device type. We implemented our mechanism as a block placement policy for HDFS, and assessed it over six recent releases of Hadoop with different architectural properties. Results indicate that our approach increases overall job performance while decreasing the energy consumption under most hybrid configurations evaluated. Our results also showed that, in many cases, storing only part of the data in SSDs results in significant energy savings and execution speedups / Ao longo da última década, questões energéticas atraíram forte atenção da sociedade, chegando às infraestruturas de TI para processamento de dados. Agora, essas infraestruturas devem se ajustar a essa responsabilidade, adequando plataformas existentes para alcançar desempenho aceitável enquanto promovem a redução no consumo de energia. Considerado um padrão para o processamento de Big Data, o Apache Hadoop tem evoluído significativamente ao longo dos últimos anos, com mais de 60 versões lançadas. Implementando o paradigma de programação MapReduce juntamente com o HDFS, seu sistema de arquivos distribuídos, o Hadoop tornou-se um middleware tolerante a falhas e confiável para a computação paralela e distribuída para grandes conjuntos de dados. No entanto, o Hadoop pode perder desempenho com determinadas cargas de trabalho, resultando em elevado consumo de energia. Cada vez mais, usuários exigem que a sustentabilidade e o consumo de energia controlado sejam parte intrínseca de soluções de computação de alto desempenho. Nesta tese, apresentamos o HDFSH, um sistema de armazenamento híbrido para o HDFS, que usa uma combinação de discos rígidos e discos de estado sólido para alcançar maior desempenho, promovendo economia de energia em aplicações usando Hadoop. O HDFSH traz ao middleware o melhor dos HDs (custo acessível por GB e grande capacidade de armazenamento) e SSDs (alto desempenho e baixo consumo de energia) de forma configurável, usando zonas de armazenamento dedicadas para cada dispositivo de armazenamento. Implementamos nosso mecanismo como uma política de alocação de blocos para o HDFS e o avaliamos em seis versões recentes do Hadoop com diferentes arquiteturas de software. Os resultados indicam que nossa abordagem aumenta o desempenho geral das aplicações, enquanto diminui o consumo de energia na maioria das configurações híbridas avaliadas. Os resultados também mostram que, em muitos casos, armazenar apenas uma parte dos dados em SSDs resulta em economia significativa de energia e aumento na velocidade de execução Armazenamento híbrido Computação verde Discos de estado sólido Distributed file systems Eficiência energética Energy efficiency Green computing Hadoop Hadoop HDFS HDFS Hybrid storage Parallel file systems Sistema de arquivos distribuído Sistemas de arquivos paralelo Solid-state disk SSDs SSDs
115	Optimisation des performances dans les entrepôts distribués avec Mapreduce : traitement des problèmes de partionnement et de distribution des données / Optimizing data management for large-scale distributed data warehouses using MapReduce Arres, Billel 08 February 2016 (has links) Dans ce travail de thèse, nous abordons les problèmes liés au partitionnement et à la distribution des grands volumes d’entrepôts de données distribués avec Mapreduce. Dans un premier temps, nous abordons le problème de la distribution des données. Dans ce cas, nous proposons une stratégie d’optimisation du placement des données, basée sur le principe de la colocalisation. L’objectif est d’optimiser les traitements lors de l’exécution des requêtes d’analyse à travers la définition d’un schéma de distribution intentionnelle des données permettant de réduire la quantité des données transférées entre les noeuds lors des traitements, plus précisément lors phase de tri (shuffle). Nous proposons dans un second temps une nouvelle démarche pour améliorer les performances du framework Hadoop, qui est l’implémentation standard du paradigme Mapreduce. Celle-ci se base sur deux principales techniques d’optimisation. La première consiste en un pré-partitionnement vertical des données entreposées, réduisant ainsi le nombre de colonnes dans chaque fragment. Ce partitionnement sera complété par la suite par un autre partitionnement d’Hadoop, qui est horizontal, appliqué par défaut. L’objectif dans ce cas est d’améliorer l’accès aux données à travers la réduction de la taille des différents blocs de données. La seconde technique permet, en capturant les affinités entre les attributs d’une charge de requêtes et ceux de l’entrepôt, de définir un placement efficace de ces blocs de données à travers les noeuds qui composent le cluster. Notre troisième proposition traite le problème de l’impact du changement de la charge de requêtes sur la stratégie de distribution des données. Du moment que cette dernière dépend étroitement des affinités des attributs des requêtes et de l’entrepôt. Nous avons proposé, à cet effet, une approche dynamique qui permet de prendre en considération les nouvelles requêtes d’analyse qui parviennent au système. Pour pouvoir intégrer l’aspect de "dynamicité", nous avons utilisé un système multi-agents (SMA) pour la gestion automatique et autonome des données entreposées, et cela, à travers la redéfinition des nouveaux schémas de distribution et de la redistribution des blocs de données. Enfin, pour valider nos contributions nous avons conduit un ensemble d’expérimentations pour évaluer nos différentes approches proposées dans ce manuscrit. Nous étudions l’impact du partitionnement et la distribution intentionnelle sur le chargement des données, l’exécution des requêtes d’analyses, la construction de cubes OLAP, ainsi que l’équilibrage de la charge (Load Balacing). Nous avons également défini un modèle de coût qui nous a permis d’évaluer et de valider la stratégie de partitionnement proposée dans ce travail. / In this manuscript, we addressed the problems of data partitioning and distribution for large scale data warehouses distributed with MapReduce. First, we address the problem of data distribution. In this case, we propose a strategy to optimize data placement on distributed systems, based on the collocation principle. The objective is to optimize queries performances through the definition of an intentional data distribution schema of data to reduce the amount of data transferred between nodes during treatments, specifically during MapReduce’s shuffling phase. Secondly, we propose a new approach to improve data partitioning and placement in distributed file systems, especially Hadoop-based systems, which is the standard implementation of the MapReduce paradigm. The aim is to overcome the default data partitioning and placement policies which does not take any relational data characteristics into account. Our proposal proceeds according to two steps. Based on queries workload, it defines an efficient partitioning schema. After that, the system defines a data distribution schema that meets the best user’s needs, and this, by collocating data blocks on the same or closest nodes. The objective in this case is to optimize queries execution and parallel processing performances, by improving data access. Our third proposal addresses the problem of the workload dynamicity, since users analytical needs evolve through time. In this case, we propose the use of multi-agents systems (MAS) as an extension of our data partitioning and placement approach. Through autonomy and self-control that characterize MAS, we developed a platform that defines automatically new distribution schemas, as new queries appends to the system, and apply a data rebalancing according to this new schema. This allows offloading the system administrator of the burden of managing load balance, besides improving queries performances by adopting careful data partitioning and placement policies. Finally, to validate our contributions we conduct a set of experiments to evaluate our different approaches proposed in this manuscript. We study the impact of an intentional data partitioning and distribution on data warehouse loading phase, the execution of analytical queries, OLAP cubes construction, as well as load balancing. We also defined a cost model that allowed us to evaluate and validate the partitioning strategy proposed in this work. Entrepôts de données Analyse en ligne (OLAP) Big Data Mapreduce Cloud computing Hadoop HDFS Systèmes Multi-Agents Data warehouses Online Analytical Processing (OLAP) Data partitioning Big Data Mapreduce Cloud computing Hadoop HDFS Multiagent systems
116	Energy savings and performance improvements with SSDs in the Hadoop Distributed File System / Economia de energia e aumento de desempenho usando SSDs no Hadoop Distributed File System Ivanilton Polato 29 August 2016 (has links) Energy issues gathered strong attention over the past decade, reaching IT data processing infrastructures. Now, they need to cope with such responsibility, adjusting existing platforms to reach acceptable performance while promoting energy consumption reduction. As the de facto platform for Big Data, Apache Hadoop has evolved significantly over the last years, with more than 60 releases bringing new features. By implementing the MapReduce programming paradigm and leveraging HDFS, its distributed file system, Hadoop has become a reliable and fault tolerant middleware for parallel and distributed computing over large datasets. Nevertheless, Hadoop may struggle under certain workloads, resulting in poor performance and high energy consumption. Users increasingly demand that high performance computing solutions address sustainability and limit energy consumption. In this thesis, we introduce HDFSH, a hybrid storage mechanism for HDFS, which uses a combination of Hard Disks and Solid-State Disks to achieve higher performance while saving power in Hadoop computations. HDFSH brings, to the middleware, the best from HDs (affordable cost per GB and high storage capacity) and SSDs (high throughput and low energy consumption) in a configurable fashion, using dedicated storage zones for each storage device type. We implemented our mechanism as a block placement policy for HDFS, and assessed it over six recent releases of Hadoop with different architectural properties. Results indicate that our approach increases overall job performance while decreasing the energy consumption under most hybrid configurations evaluated. Our results also showed that, in many cases, storing only part of the data in SSDs results in significant energy savings and execution speedups / Ao longo da última década, questões energéticas atraíram forte atenção da sociedade, chegando às infraestruturas de TI para processamento de dados. Agora, essas infraestruturas devem se ajustar a essa responsabilidade, adequando plataformas existentes para alcançar desempenho aceitável enquanto promovem a redução no consumo de energia. Considerado um padrão para o processamento de Big Data, o Apache Hadoop tem evoluído significativamente ao longo dos últimos anos, com mais de 60 versões lançadas. Implementando o paradigma de programação MapReduce juntamente com o HDFS, seu sistema de arquivos distribuídos, o Hadoop tornou-se um middleware tolerante a falhas e confiável para a computação paralela e distribuída para grandes conjuntos de dados. No entanto, o Hadoop pode perder desempenho com determinadas cargas de trabalho, resultando em elevado consumo de energia. Cada vez mais, usuários exigem que a sustentabilidade e o consumo de energia controlado sejam parte intrínseca de soluções de computação de alto desempenho. Nesta tese, apresentamos o HDFSH, um sistema de armazenamento híbrido para o HDFS, que usa uma combinação de discos rígidos e discos de estado sólido para alcançar maior desempenho, promovendo economia de energia em aplicações usando Hadoop. O HDFSH traz ao middleware o melhor dos HDs (custo acessível por GB e grande capacidade de armazenamento) e SSDs (alto desempenho e baixo consumo de energia) de forma configurável, usando zonas de armazenamento dedicadas para cada dispositivo de armazenamento. Implementamos nosso mecanismo como uma política de alocação de blocos para o HDFS e o avaliamos em seis versões recentes do Hadoop com diferentes arquiteturas de software. Os resultados indicam que nossa abordagem aumenta o desempenho geral das aplicações, enquanto diminui o consumo de energia na maioria das configurações híbridas avaliadas. Os resultados também mostram que, em muitos casos, armazenar apenas uma parte dos dados em SSDs resulta em economia significativa de energia e aumento na velocidade de execução Armazenamento híbrido Computação verde Discos de estado sólido Eficiência energética Hadoop HDFS Sistema de arquivos distribuído Sistemas de arquivos paralelo SSDs Distributed file systems Energy efficiency Green computing Hadoop HDFS Hybrid storage Parallel file systems Solid-state disk SSDs
117	Apprentissage supervisé de données symboliques et l'adaptation aux données massives et distribuées / Supervised learning of Symbolic Data and adaptation to Big Data Haddad, Raja 23 November 2016 (has links) Cette thèse a pour but l'enrichissement des méthodes supervisées d'analyse de données symboliques et l'extension de ce domaine aux données volumineuses, dites "Big Data". Nous proposons à cette fin une méthode supervisée nommée HistSyr. HistSyr convertit automatiquement les variables continues en histogrammes les plus discriminants pour les classes d'individus. Nous proposons également une nouvelle méthode d'arbres de décision symbolique, dite SyrTree. SyrTree accepte tous plusieurs types de variables explicatives et à expliquer pour construire l'arbre de décision symbolique. Enfin, nous étendons HistSyr aux Big Data, en définissant une méthode distribuée nommée CloudHistSyr. CloudHistSyr utilise Map/Reduce pour créer les histogrammes les plus discriminants pour des données trop volumineuses pour HistSyr. Nous avons testé CloudHistSyr sur Amazon Web Services (AWS). Nous démontrons la scalabilité et l’efficacité de notre méthode sur des données simulées et sur les données expérimentales. Nous concluons sur l’utilité de CloudHistSyr qui , grâce à ses résultats, permet l'étude de données massives en utilisant les méthodes d'analyse symboliques existantes. / This Thesis proposes new supervised methods for Symbolic Data Analysis (SDA) and extends this domain to Big Data. We start by creating a supervised method called HistSyr that converts automatically continuous variables to the most discriminant histograms for classes of individuals. We also propose a new method of symbolic decision trees that we call SyrTree. SyrTree accepts many types of inputs and target variables and can use all symbolic variables describing the target to construct the decision tree. Finally, we extend HistSyr to Big Data, by creating a distributed method called CloudHistSyr. Using the Map/Reduce framework, CloudHistSyr creates of the most discriminant histograms for data too big for HistSyr. We tested CloudHistSyr on Amazon Web Services. We show the efficiency of our method on simulated data and on actual car traffic data in Nantes. We conclude on overall utility of CloudHistSyr which, through its results, allows the study of massive data using existing symbolic analysis methods. Analyse de Données Symboliques (ADS) Histogrammes Arbres de décision symboliques Big Data Map/Reduce Hadoop Amazon Web Services Symbolic Data Analysis (SDA) Histograms Symbolic decision trees Big Data Map/Reduce Hadoop Amazon Web Services 005.7
118	EXPLOITING THE SPATIAL DIMENSION OF BIG DATA JOBS FOR EFFICIENT CLUSTER JOB SCHEDULING Akshay Jajoo (9530630) 16 December 2020 (has links) With the growing business impact of distributed big data analytics jobs, it has become crucial to optimize their execution and resource consumption. In most cases, such jobs consist of multiple sub-entities called tasks and are executed online in a large shared distributed computing system. The ability to accurately estimate runtime properties and coordinate execution of sub-entities of a job allows a scheduler to efficiently schedule jobs for optimal scheduling. This thesis presents the first study that highlights spatial dimension, an inherent property of distributed jobs, and underscores its importance in efficient cluster job scheduling. We develop two new classes of spatial dimension based algorithms to<br>address the two primary challenges of cluster scheduling. First, we propose, validate, and design two complete systems that employ learning algorithms exploiting spatial dimension. We demonstrate high similarity in runtime properties between sub-entities of the same job by detailed trace analysis on four different industrial cluster traces. We identify design challenges and propose principles for a sampling based learning system for two examples, first for a coflow scheduler, and second for a cluster job scheduler.<br>We also propose, design, and demonstrate the effectiveness of new multi-task scheduling algorithms based on effective synchronization across the spatial dimension. We underline and validate by experimental analysis the importance of synchronization between sub-entities (flows, tasks) of a distributed entity (coflow, data analytics jobs) for its efficient execution. We also highlight that by not considering sibling sub-entities when scheduling something it may also lead to sub-optimal overall cluster performance. We propose, design, and implement a full coflow scheduler based on these assertions. Computer Engineering Distributed Computing Networking and Communications Computer Communications Networks Big Data Research Big Data Framework Background cluster scheduling Data center networks data center network Hadoop MapReduce HADOOP Online Learning scheduling decision
119	一個基於記憶體內運算之多維度多顆粒度資料探勘之研究-以yahoo user profile為例 / A Research of Multi-dimensional and Multigranular Data Mining with In-memory Computingwith yahoo user profile 林洸儂, Lin, Guang-Nung Unknown Date (has links) 近年來雲端運算技術的發展與電腦設備效能提升，使得以大量電腦主機以水平擴充的方式組成叢集運算系統，成為一可行的選擇。Apache Hadoop 是Apache 基金會的一個開源軟體框架，它是由Google 公司的MapReduce 與Google 檔案系統實作成的分布式系統，可以管理數千台以上的電腦群集。Hadoop 利用分散式檔案系統HDFS 可以提供PB 級以上的資料存放空間，透過MapReduce 框架可以將應用程式分割成小工作分散到叢集中的運算節點上執行。此外，企業累積了巨量的資料，如何處理與分析這些結構化或者是非結構化的資料成了現在熱門研究的議題。因此傳統的資料挖掘方式與演算法必須因應新的雲端運算技術與分散式框架的概念，進行調整與改良，發展新的方法。關聯規則是分析資料庫龐大的資料中，項目之間隱含的關聯，常見的應用為購物籃分析。一般情形下會在特定的維度與特定的顆粒度範圍內挖掘關聯規則，但這樣的方式無法找出更細微範圍下之規則，例如挖掘一個年度的交易資料無法發現消費者在聖誕節為了慶祝而購買的商品項目間的規則，但若將時間限縮在 12 月份即可挖掘出這些規則。 Apriori 演算法是挖掘關聯規則的一個著名的演算法，透過產生候選項目集合與使用自訂的最小支持度進行篩選，產生高頻項目集合，接著以最小信賴度篩選獲得關聯規則的結果。若有k 種單一項目集合，則候選項目集合最多有2𝑘 − 1 個，計算高頻項目時則需反覆掃描整個資料庫，Apriori 這兩個主要步驟需要耗費相當大量的運算能力。因此本研究將資料庫分割成多個資料區塊挖掘關聯規則，再將結果逐步更新的演算法，解決大範圍挖掘遺失關聯規則的問題，結合spark 分散式運算的架構實作程式，在電腦群集上平行運算減少關聯規則的挖掘時間。 / Because of improving technique of cloud-computing and increasing capability of computer equipment, it is feasible to use clusters of computers by horizon scalable a lot of computers. Apache Hadoop is an open-source software of Apache. It allows the management of cluster resource, a distributed storage system named Hadoop Distributed File System (HDFS), and a parallel processing technique called MapReduce. Enterprises have accumulated a huge amount of data. It is a hot issue to process and analyze these structured or unstructured data. Traditional methods and algorithms of data mining must make adjustments and improvement to new cloud computing technology and concept of decentralized framework. Association rules is the relations of items from large database. In general, we find association rules in fixed dimensions and granular database. However, it might loss infrequent association rules. Apriori algorithm is one famous algorithm of mining association rule. There are two main steps in this algorithm spend a lot of computing resource. To generate Candidate itemset has quantity 2𝑘 − 1, if there are k different item. Second step is to find frequent, this step must compare all tractions in the database. This approach divides database to segmentations and finds association rules of these segmentations. Then, we combine rules of segmentations. It can solve the problem of missing infrequent itemset. In addition, we implement this method in Spark and reduce the time of computing. 關聯規則 Apriori 演算法資料挖掘 Association Rule Apriori Algorithm Data mining Hadoop Spark
120	Optimisation of a Hadoop cluster based on SDN in cloud computing for big data applications Khaleel, Ali January 2018 (has links) Big data has received a great deal attention from many sectors, including academia, industry and government. The Hadoop framework has emerged for supporting its storage and analysis using the MapReduce programming module. However, this framework is a complex system that has more than 150 parameters and some of them can exert a considerable effect on the performance of a Hadoop job. The optimum tuning of the Hadoop parameters is a difficult task as well as being time consuming. In this thesis, an optimisation approach is presented to improve the performance of a Hadoop framework by setting the values of the Hadoop parameters automatically. Specifically, genetic programming is used to construct a fitness function that represents the interrelations among the Hadoop parameters. Then, a genetic algorithm is employed to search for the optimum or near the optimum values of the Hadoop parameters. A Hadoop cluster is configured on two severe at Brunel University London to evaluate the performance of the proposed optimisation approach. The experimental results show that the performance of a Hadoop MapReduce job for 20 GB on Word Count Application is improved by 69.63% and 30.31% when compared to the default settings and state of the art, respectively. Whilst on Tera sort application, it is improved by 73.39% and 55.93%. For better optimisation, SDN is also employed to improve the performance of a Hadoop job. The experimental results show that the performance of a Hadoop job in SDN network for 50 GB is improved by 32.8% when compared to traditional network. Whilst on Tera sort application, the improvement for 50 GB is on average 38.7%. An effective computing platform is also presented in this thesis to support solar irradiation data analytics. It is built based on RHIPE to provide fast analysis and calculation for solar irradiation datasets. The performance of RHIPE is compared with the R language in terms of accuracy, scalability and speedup. The speed up of RHIPE is evaluated by Gustafson's Law, which is revised to enhance the performance of the parallel computation on intensive irradiation data sets in a cluster computing environment like Hadoop. The performance of the proposed work is evaluated using a Hadoop cluster based on the Microsoft azure cloud and the experimental results show that RHIPE provides considerable improvements over the R language. Finally, an effective routing algorithm based on SDN to improve the performance of a Hadoop job in a large scale cluster in a data centre network is presented. The proposed algorithm is used to improve the performance of a Hadoop job during the shuffle phase by allocating efficient paths for each shuffling flow, according to the network resources demand of each flow as well as their size and number. Furthermore, it is also employed to allocate alternative paths for each shuffling flow in the case of any link crashing or failure. This algorithm is evaluated by two network topologies, namely, fat tree and leaf-spine, built by EstiNet emulator software. The experimental results show that the proposed approach improves the performance of a Hadoop job in a data centre network.

Search results