81 |
Big Data, capacitações dinâmicas e valor para o negócio. / Big data, dynamic capabilities and business value.Seller, Michel Lens 17 May 2018 (has links)
A conjunção das recentes tecnologias de mídias sociais, mobilidade e computação em nuvem coloca à disposição das empresas um grande volume de dados variados e recebidos em grande velocidade. Muitas empresas começam a perceber neste fenômeno, conhecido como Big Data, oportunidades de extração de valor para seus negócios. A literatura aponta diversos mecanismos pelos quais Big Data se transforma em valor para a empresa. O primeiro deles é pela geração de agilidade, aqui entendida como a capacidade de perceber e rapidamente reagir a mudanças e oportunidades em seu ambiente competitivo. Outro mecanismo é a utilização de Big Data como facilitador de capacitações dinâmicas que resultam em melhorias operacionais, por meio do aprofundamento (exploit) de alguma capacitação específica. Por fim, Big Data pode ser facilitador de capacitações dinâmicas que resultem em inovação (explore de novas capacitações) e no lançamento de novos produtos e serviços no mercado. Dentro deste contexto, o presente estudo se propõe a investigar a abordagem da utilização de Big Data por empresas inseridas em diferentes contextos competitivos e com diferentes níveis de capacitação de TI. Faz parte também do objetivo da pesquisa entender como as empresas adequaram seus processos de negócio para incorporar o grande volume de dados que têm à disposição. Por meio de estudos de caso realizados em empresas de grande porte de diferentes segmentos e com grande variabilidade na utilização de Big Data, o estudo verifica utilização de Big Data como viabilizador de capacitações dinâmicas atuando no aperfeiçoamento de capacitações operacionais, na diversificação de negócios e na inovação. Além disso, verifica-se a tendência de acoplamento de machine learning às soluções de Big Data, quando o objetivo é a obtenção de agilidade operacional. A capacitação de TI também se mostra determinante da quantidade e complexidade das ações competitivas lançadas pelas empresas com o uso de Big Data. Por fim, é possível antever que, graças às facilidades trazidas pela tecnologia de cloud, recursos de TI serão crescentemente liberados para atuação junto ao negócio - como, por exemplo, em iniciativas de Big Data - fortalecendo as capacitações dinâmicas da empresa e gerando vantagem competitiva. / The combination of the technologies of social media, mobility and cloud computing has dramatically increased the volume, variety and velocity of data available for firms. Many companies have been looking at this phenomenon, also known as Big Data, as a source of value to business. The literature shows different mechanisms for transforming Big Data in business value. First of them is agility, herein understood as the ability of sensing and rapidly responding to changes and opportunities in the competitive environment. Other mechanism is the usage of Big Data as an enabler of dynamic capabilities that result in operational improvements, through the deepening (exploit) of determined operational capability. Finally, Big Data can be the facilitator of dynamic capabilities that result in innovation (explore of new capabilities) and in the launching of new product and services in the market. Within this context, the goal of this study is to investigate the approach for Big Data usage in companies from different competitive scenarios and with different levels of IT capability. It is also part of the objectives to investigate how companies changed their processes to accommodate the huge volume of data available from Big Data. Through case studies in companies of different industries and with different Big Data approaches, the study shows Big Data as an enabler of dynamic capabilities that result in the improvement of operational capabilities, in the diversification of business and in innovation. It has also been identified the trend of association of machine learning to Big Data when the objective is operational agility. IT capability shows to be determinant of the quantity and complexity of the competitive actions launched from Big Data. To conclude, it is valid to anticipate that due to simplification coming from cloud technologies, IT resources will be increasingly released to working close to business - as, for example, in Big Data initiatives - strengthening dynamic capabilities and creating value to business.
|
82 |
Griddler : uma estratégia configurável para armazenamento distribuído de objetos peer-to-peer que combina replicação e erasure coding com sistema de cache /Caetano, André Francisco Morielo. January 2017 (has links)
Orientador: Carlos Roberto Valêncio / Banca: Geraldo Francisco Donega Zafalon / Banca: Pedro Luiz Pizzigatti Correa / Resumo: Sistemas de gerenciamento de banco de dados, na sua essência, almejam garantir o armazenamento confiável da informação. Também é tarefa de um sistema de gerenciamento de banco de dados oferecer agilidade no acesso às informações. Nesse contexto, é de grande interesse considerar alguns fenômenos recentes: a progressiva geração de conteúdo não-estruturado, como imagens e vídeo, o decorrente aumento do volume de dados em formato digital nas mais diversas mídias e o grande número de requisições por parte de usuários cada vez mais exigentes. Esses fenômenos fazem parte de uma nova realidade, denominada Big Data, que impõe aos projetistas de bancos de dados um aumento nos requisitos de flexibilidade, escalabilidade, resiliência e velocidade dos seus sistemas. Para suportar dados não-estruturados foi preciso se desprender de algumas limitações dos bancos de dados convencionais e definir novas arquiteturas de armazenamento. Essas arquiteturas definem padrões para gerenciamento dos dados, mas um sistema de armazenamento deve ter suas especificidades ajustadas em cada nível de implementação. Em termos de escalabilidade, por exemplo, cabe a escolha entre sistemas com algum tipo de centralização ou totalmente descentralizados. Por outro lado, em termos de resiliência, algumas soluções utilizam um esquema de replicação para preservar a integridade dos dados por meio de cópias, enquanto outras técnicas visam a otimização do volume de dados armazenados. Por fim, ao mesmo tempo que são... / Abstract: Database management systems, in essence, aim to ensure the reliable storage of information. It is also the task of a database management system to provide agility in accessing information. In this context, it is of great interest to consider some recent phenomena: the progressive generation of unstructured content such as images and video, the consequent increase in the volume of data in digital format in the most diverse media and the large number of requests by users increasingly demanding. These phenomena are part of a new reality, named Big Data, that imposes on database designers an increase in the flexibility, scalability, resiliency, and speed requirements of their systems. To support unstructured data, it was necessary to get rid of some limitations of conventional databases and define new storage architectures. These architectures define standards for data management, but a storage system must have its specificities adjusted at each level of implementation. In terms of scalability, for example, it is up to the choice between systems with some type of centralization or totally decentralized. On the other hand, in terms of resiliency, some solutions utilize a replication scheme to preserve the integrity of the data through copies, while other techniques are aimed at optimizing the volume of stored data. Finally, at the same time that new network and disk technologies are being developed, one might think of using caching to optimize access to what is stored. This work explores and analyzes the different levels in the development of distributed storage systems. This work objective is to present an architecture that combines different resilience techniques. The scientific contribution of this work is, in addition to a totally decentralized suggestion of data allocation, the use of an access cache structure with adaptive algorithms in this environment / Mestre
|
83 |
algorithmes de big data adaptés aux réseaux véhiculaires pour modélisation de comportement de conducteur / big data algorithms adapted to vehicular networks for driver's behavior modelingBourdy, Emilien 03 December 2018 (has links)
Les technologies Big Data gagnent de plus en plus d’attentions de communautés de recherches variées, surtout depuis que les données deviennent si volumineuses, qu’elles posent de réels problèmes, et que leurs traitements ne sont maintenant possibles que grâce aux grandes capacités de calculs des équipements actuels. De plus, les réseaux véhiculaires, aussi appelés VANET pour Vehicular Ad-hoc Networks, se développent considérablement et ils constituent une part de plus en plus importante du marché du véhicule. La topologie de ces réseaux en constante évolution est accompagnée par des données massives venant d’un volume croissant de véhicules connectés.Dans cette thèse, nous discutons dans notre première contribution des problèmes engendrés par la croissance rapide des VANET, et nous étudions l’adaptation des technologies liées aux Big Data pour les VANET. Ainsi, pour chaque étape clé du Big Data, nous posons le problème des VANET.Notre seconde contribution est l’extraction des caractéristiques liées aux VANET afin d’obtenir des données provenant de ceux-ci. Pour ce faire, nous discutons de comment établir des scénarios de tests, et comment émuler un environnement afin, dans un premier temps, de tester une implémentation dans un environnement contrôlé, avant de pouvoir effectuer des tests dans un environnement réel, afin d’obtenir de vraies données provenant des VANET.Pour notre troisième contribution, nous proposons une approche originale de la modélisation du comportement de conducteur. Cette approche est basée sur un algorithme permettant d’extraire des représentants d’une population, appelés exemplaires, en utilisant un concept de densité locale dans un voisinage. / Big Data is gaining lots of attentions from various research communities as massive data are becoming real issues and processing such data is now possible thanks to available high-computation capacity of today’s equipment. In the meanwhile, it is also the beginning of Vehicular Ad-hoc Networks (VANET) era. Connected vehicles are being manufactured and will become an important part of vehicle market. Topology in this type of network is in constant evolution accompanied by massive data coming from increasing volume of connected vehicles in the network.In this thesis, we handle this interesting topic by providing our first contribution on discussing different aspects of Big Data in VANET. Thus, for each key step of Big Data, we raise VANET issues.The second contribution is the extraction of VANET characteristics in order to collect data. To do that, we discuss how to establish tests scenarios, and to how emulate an environment for these tests. First we conduct an implementation in a controlled environment, before performing tests on real environment in order to obtain real VANET data.For the third contribution, we propose an original approach for driver's behavior modeling. This approach is based on an algorithm permitting extraction of representatives population, called samples, using a local density in a neighborhood concept.
|
84 |
Ferramenta de programação e processamento para execução de aplicações com grandes quantidades de dados em ambientes distribuídos. / Programming and processing tool for execution of applications with large amounts of data in distributed environments.Darlon Vasata 03 September 2018 (has links)
A temática envolvendo o processamento de grandes quantidades de dados é um tema amplamente discutido nos tempos atuais, envolvendo seus desafios e aplicabilidade. Neste trabalho é proposta uma ferramenta de programação para desenvolvimento e um ambiente de execução para aplicações com grandes quantidades de dados. O uso da ferramenta visa obter melhor desempenho de aplicações neste cenário, explorando o uso de recursos físicos como múltiplas linhas de execução em processadores com diversos núcleos e a programação distribuída, que utiliza múltiplos computadores interligados por uma rede de comunicação, de forma que estes operam conjuntamente em uma mesma aplicação, dividindo entre tais máquinas sua carga de processamento. A ferramenta proposta consiste na utilização de blocos de programação, de forma que tais blocos sejam compostos por tarefas, e sejam executados utilizando o modelo produtor consumidor, seguindo um fluxo de execução definido. A utilização da ferramenta permite que a divisão das tarefas entre as máquinas seja transparente ao usuário. Com a ferramenta, diversas funcionalidades podem ser utilizadas, como o uso de ciclos no fluxo de execução ou no adiantamento de tarefas, utilizando a estratégia de processamento especulativo. Os resultados do trabalho foram comparados a duas outras ferramentas de processamento de grandes quantidades de dados, Hadoop e que o uso da ferramenta proporciona aumento no desempenho das aplicações, principalmente quando executado em clusters homogêneos. / The topic involving the processing of large amounts of data is widely discussed subject currently, about its challenges and applicability. This work proposes a programming tool for development and an execution environment for applications with large amounts of data. The use of the tool aims to achieve better performance of applications in this scenario, exploring the use of physical resources such as multiple lines of execution in multi-core processors and distributed programming, which uses multiple computers interconnected by a communication network, so that they operate jointly in the same application, dividing such processing among such machines. The proposed tool consists of the use of programming blocks, so that these blocks are composed of tasks, and the blocks are executed using the producer consumer model, following an execution flow. The use of the tool allows the division of tasks between the machines to be transparent to the user. With the tool, several functionalities can be used, such as cycles in the execution flow or task advancing using the strategy of speculative processing. The results were compared with two other frameworks, Hadoop and Spark. These results indicate that the use of the tool provides an increase in the performance of the applications, mostly when executed in homogeneous clusters.
|
85 |
Business Intelligence - det stora kartläggningspusslet : En studie om insamling och analys av konsumentinformation i livsmedelsbranschenDousa, Robin, Pers, Alexander January 2014 (has links)
Syfte: Syftet med uppsatsen är att undersöka hur företag, med hjälp av den moderna alltmer avancerade och utvecklade teknologin, systematiskt kartlägger konsumenternas köpbeteenden, genom s.k. business intelligence. Uppsatsen ämnar ta reda på hur teknologin appliceras hos företag samt hur och i vilken mån den data som samlas in används för att få konsumenter till önskade köpbeslut. Teori: Arbetets teoretiska kärna utgörs dels av ett teoretiskt ramverk, i vilket redogörs för business intelligence, samt ett avsnitt där teorier om konsumenternas köpbeteende presenteras. Metod: Arbetet har sin metodologiska utgångspunkt i en kvalitativ forskningsansats där fokus ligger på semistrukturerade intervjuer. För att besvara forskningsfrågorna har ett djupgående angreppssätt används. Empiri: Intervjuer med respondenter, i form av ämnesspecifik expertis, från IBM Sverige, HUI Research och Coop Sverige AB har genomförts. Resultat: Resultatet som presenteras i undersökningen visar på att konsumentförståelse i huvudsak genereras ur kundkortsdatan vilken fungerar som såväl informationsinsamlare som relationsskapare. Utöver det finns det ett växande intresse för kartläggning av rörelsemönster i butik. Vidare påvisar resultatet att det föreligger en problematik vad gäller resursutnyttjandet av data vilket främst grundar sig på att företag inte förmår att utnyttja den insamlade datan på effektivt sätt, något som bl.a. förklaras av såväl en resurs- som integritetsmässig problematik. Slutsats: Studien visar att den teknologiska utvecklingen har medgett en datainsamling större än vad som i många fall kan hanteras och utnyttjas på ett för verksamheter och organisationer maximalt sätt. Företag utnyttjar inte den data de har tillgång till i proportion till den kapacitet som de teknologiska verktyg man har för datainsamling besitter. / Purpose: The purpose of this thesis is to examine how companies, with the help of advanced and developed technology, are able to understand consumer buying behavior through so called business intelligence. The purpose is also to find out how this technology is applied within the companies and the extent to which data is gathered in order to lead get consumers to desired purchasing decisions Theory: The theoretical core consists partly of a theoretical framework that describes business intelligence, as well as a section in which theories of consumer buying behavior is presented. Method: The thesis has its methodological basis on a qualitative research approach where focus is pointed at semi-structured interviews. A profound approach has been used to answer the research question. Data: Interviews has been conducted with topic-specific expertise, the respondents are from the following companies: IBM Sverige, HUI Research and Coop Sverige AB. Result: The results presented in the thesis show that the understanding companies have of consumer behavior mainly are extracted from data that loyalty cards can produce. The information from these cards can be collected in order to build a better relationship with the customers. Beyond that, there is a growing interest in identifying patterns of customer movement in stores. Furthermore, the results indicate that there is a problem concerning utilization of data resource, which mainly is based on the fact that companies are not able to utilize the data collected in an efficient manner which partly can be explained by problems concerning effective utilization of resources as well as the privacy concerns of customers. Conclusion: The study shows that technological improvements have made it possible to obtain a larger amount of data than most companies are able to utilize in an efficient way. Companies do not use the data they have access to in proportion to the capacity of the technological tools the companies possesses making it possible to obtain large amount of data.
|
86 |
Analyse des différences dans le Big Data : Exploration, Explication, Évolution / Difference Analysis in Big Data : Exploration, Explanation, EvolutionKleisarchaki, Sofia 28 November 2016 (has links)
La Variabilité dans le Big Data se réfère aux données dont la signification change de manière continue. Par exemple, les données des plateformes sociales et les données des applications de surveillance, présentent une grande variabilité. Cette variabilité est dûe aux différences dans la distribution de données sous-jacente comme l’opinion de populations d’utilisateurs ou les mesures des réseaux d’ordinateurs, etc. L’Analyse de Différences a comme objectif l’étude de la variabilité des Données Massives. Afin de réaliser cet objectif, les data scientists ont besoin (a) de mesures de comparaison de données pour différentes dimensions telles que l’âge pour les utilisateurs et le sujet pour le traffic réseau, et (b) d’algorithmes efficaces pour la détection de différences à grande échelle. Dans cette thèse, nous identifions et étudions trois nouvelles tâches analytiques : L’Exploration des Différences, l’Explication des Différences et l’Evolution des Différences.L’Exploration des Différences s’attaque à l’extraction de l’opinion de différents segments d’utilisateurs (ex., sur un site de films). Nous proposons des mesures adaptées à la com- paraison de distributions de notes attribuées par les utilisateurs, et des algorithmes efficaces qui permettent, à partir d’une opinion donnée, de trouver les segments qui sont d’accord ou pas avec cette opinion. L’Explication des Différences s’intéresse à fournir une explication succinte de la différence entre deux ensembles de données (ex., les habitudes d’achat de deux ensembles de clients). Nous proposons des fonctions de scoring permettant d’ordonner les explications, et des algorithmes qui guarantissent de fournir des explications à la fois concises et informatives. Enfin, l’Evolution des Différences suit l’évolution d’un ensemble de données dans le temps et résume cette évolution à différentes granularités de temps. Nous proposons une approche basée sur le requêtage qui utilise des mesures de similarité pour comparer des clusters consécutifs dans le temps. Nos index et algorithmes pour l’Evolution des Différences sont capables de traiter des données qui arrivent à différentes vitesses et des types de changements différents (ex., soudains, incrémentaux). L’utilité et le passage à l’échelle de tous nos algorithmes reposent sur l’exploitation de la hiérarchie dans les données (ex., temporelle, démographique).Afin de valider l’utilité de nos tâches analytiques et le passage à l’échelle de nos algo- rithmes, nous réalisons un grand nombre d’expériences aussi bien sur des données synthé- tiques que réelles.Nous montrons que l’Exploration des Différences guide les data scientists ainsi que les novices à découvrir l’opinion de plusieurs segments d’internautes à grande échelle. L’Explication des Différences révèle la nécessité de résumer les différences entre deux ensembles de donnes, de manière parcimonieuse et montre que la parcimonie peut être atteinte en exploitant les relations hiérarchiques dans les données. Enfin, notre étude sur l’Evolution des Différences fournit des preuves solides qu’une approche basée sur les requêtes est très adaptée à capturer des taux d’arrivée des données variés à plusieurs granularités de temps. De même, nous montrons que les approches de clustering sont adaptées à différents types de changement. / Variability in Big Data refers to data whose meaning changes continuously. For instance, data derived from social platforms and from monitoring applications, exhibits great variability. This variability is essentially the result of changes in the underlying data distributions of attributes of interest, such as user opinions/ratings, computer network measurements, etc. {em Difference Analysis} aims to study variability in Big Data. To achieve that goal, data scientists need: (a) measures to compare data in various dimensions such as age for users or topic for network traffic, and (b) efficient algorithms to detect changes in massive data. In this thesis, we identify and study three novel analytical tasks to capture data variability: {em Difference Exploration, Difference Explanation} and {em Difference Evolution}.Difference Exploration is concerned with extracting the opinion of different user segments (e.g., on a movie rating website). We propose appropriate measures for comparing user opinions in the form of rating distributions, and efficient algorithms that, given an opinion of interest in the form of a rating histogram, discover agreeing and disargreeing populations. Difference Explanation tackles the question of providing a succinct explanation of differences between two datasets of interest (e.g., buying habits of two sets of customers). We propose scoring functions designed to rank explanations, and algorithms that guarantee explanation conciseness and informativeness. Finally, Difference Evolution tracks change in an input dataset over time and summarizes change at multiple time granularities. We propose a query-based approach that uses similarity measures to compare consecutive clusters over time. Our indexes and algorithms for Difference Evolution are designed to capture different data arrival rates (e.g., low, high) and different types of change (e.g., sudden, incremental). The utility and scalability of all our algorithms relies on hierarchies inherent in data (e.g., time, demographic).We run extensive experiments on real and synthetic datasets to validate the usefulness of the three analytical tasks and the scalability of our algorithms. We show that Difference Exploration guides end-users and data scientists in uncovering the opinion of different user segments in a scalable way. Difference Explanation reveals the need to parsimoniously summarize differences between two datasets and shows that parsimony can be achieved by exploiting hierarchy in data. Finally, our study on Difference Evolution provides strong evidence that a query-based approach is well-suited to tracking change in datasets with varying arrival rates and at multiple time granularities. Similarly, we show that different clustering approaches can be used to capture different types of change.
|
87 |
Security of Big Data: Focus on Data Leakage Prevention (DLP)Nyarko, Richard January 2018 (has links)
Data has become an indispensable part of our daily lives in this era of information age. The amount of data which is generated is growing exponentially due to technological advances. This voluminous of data which is generated daily has brought about new term which is referred to as big data. Therefore, security is of great concern when it comes to securing big data processes. The survival of many organizations depends on the preventing of these data from falling into wrong hands. Because if these sensitive data fall into wrong hands it could cause serious consequences. For instance, the credibility of several businesses or organizations will be compromised when sensitive data such as trade secrets, project documents, and customer profiles are leaked to their competitors (Alneyadi et al, 2016). In addition, the traditional security mechanisms such as firewalls, virtual private networks (VPNs), and intrusion detection systems/intrusion prevention systems (IDSs/IPSs) are not enough to prevent against the leakage of such sensitive data. Therefore, to overcome this deficiency in protecting sensitive data, a new paradigm shift called data leakage prevention systems (DLPSs) have been introduced. Over the past years, many research contributions have been made to address data leakage. However, most of the past research focused on data leakage detection instead of preventing against the leakage. This thesis contributes to research by using the preventive approach of DLPS to propose hybrid symmetric-asymmetric encryption to prevent against data leakage. Also, this thesis followed the Design Science Research Methodology (DSRM) with CRISP-DM (CRoss Industry Standard Process for Data Mining) as the kernel theory or framework for the designing of the IT artifact (method). The proposed encryption method ensures that all confidential or sensitive documents of an organization are encrypted so that only users with access to the decrypting keys can have access. This is achieved after the documents have been classified into confidential and non-confidential ones with Naïve Bayes Classifier (NBC). Therefore, any organizations that need to prevent against data leakage before the leakage occurs can make use of this proposed hybrid encryption method.
|
88 |
Energy-efficient Straggler Mitigation for Big Data Applications on the Clouds / Amélioration de l'efficacité énergétique de la prévention des stragglers pour les applications Big Data sur les CloudsPhan, Tien-Dat 30 November 2017 (has links)
La consommation d’énergie est une préoccupation importante pour les systèmes de traitement Big Data à grande échelle, ce qui entraîne un coût monétaire énorme. En raison de l’hétérogénéité du matériel et des conflits entre les charges de travail simultanées, les stragglers (i.e., les tâches qui sont relativement plus lentes que les autres tâches) peuvent augmenter considérablement le temps d’exécution et la consommation d’énergie du travail. Par conséquent, l’atténuation des stragglers devient une technique importante pour améliorer les performances des systèmes de traitement Big Data à grande échelle. Typiquement, il se compose de deux phases: la détection de stragglers et la manipulation de stragglers. Dans la phase de détection, les tâches lentes (par exemple, les tâches avec une vitesse ou une progression inférieure à la moyenne) sont marquées en tant que stragglers. Ensuite, les stragglers sont traités en utilisant la technique d’exécution spéculative. Avec cette technique, une copie du straggler détecté est lancée en parallèle avec le straggler dans l’espoir qu’il puisse finir plus tôt, réduisant ainsi le temps d’exécution du straggler. Bien qu’un grand nombre d’études aient été proposées pour améliorer les performances des applications Big Data en utilisant la technique d’exécution spéculative, peu d’entre elles ont étudié l’efficacité énergétique de leurs solutions.Dans le cadre de cette thèse, nous commençons par caractériser l’impact de l’atténuation des stragglers sur la performance et la consommation d’énergie des systèmes de traitement de Big Data. Nous observons que l’efficacité énergétique des techniques actuelles d’atténuation des stragglers pourrait être considérablement améliorée. Cela motive une étude détaillée de ses deux phases: détection de straggler et traitement de straggler. En ce qui concerne la détection de straggler, nous introduisons un cadre novateur pour caractériser et évaluer de manière exhaustive les mécanismes de détection de straggler. En conséquence, nous proposons un nouveau mécanisme énergétique de détection de straggler. Ce mécanisme de détection est implémenté dans Hadoop et se révèle avoir une efficacité énergétique plus élevée par rapport aux mécanismes les plus récentes. En ce qui concerne le traitement de straggler, nous présentons une nouvelle méthode pour répartir des copies spéculatives, qui prend en compte l’impact de l’hétérogénéité des ressources sur la performance et la consommation d’énergie. Enfin, nous introduisons un nouveau mécanisme éconergétique pour gérer les stragglers. Ce mécanisme fournit plus de ressources disponibles pour lancer des copies spéculatives, en utilisant une approche de réservation dynamique de ressources. Il est démontré qu’elle améliore considérablement l’efficacité énergétique en utilisant une simulation. / Energy consumption is an important concern for large-scale Big Data processing systems, which results in huge monetary cost. Due to the hardware heterogeneity and contentions between concurrent workloads, stragglers (i.e., tasks performing relatively slower than other tasks) can severely increase the job’s execution time and energy consumption. Consequently, straggler mitigation becomes an important technique to improve the performance of large-scale Big Data processing systems. Typically, it consists of two phases: straggler detection and straggler handling. In the detection phase, slow tasks (e.g., tasks with speed or progress below the average) are marked as stragglers. Then, stragglers are handled using the speculative execution technique. With this technique, a copy of the detected straggler is launched in parallel with the straggler with the expectation that it can finish earlier, thus, reduce the straggler’s execution time. Although a large number of studies have been proposed to improve the performance of Big Data applications using speculative execution technique, few of them have studied the energy efficiency of their solutions. Addressing this lack, we conduct an experimental study to fully understand the impact of straggler mitigation techniques on the performance and the energy consumption of Big Data processing systems. We observe that current straggler mitigation techniques are not energy efficient. As a result, this promotes further studies aiming at higher energy efficiency for straggler mitigation. In terms of straggler detection, we introduce a novel framework for comprehensively characterizing and evaluating straggler detection mechanisms. Accordingly, we propose a new energy-driven straggler detection mechanism. This straggler detection mechanism is implemented into Hadoop and is demonstrated to have higher energy efficiency compared to the state-of-the-art mechanisms. In terms of straggler handling, we present a new speculative copy allocation method, which takes into consideration the impact of resource heterogeneity on performance and energy consumption. Finally, an energy efficient straggler handling mechanism is introduced. This mechanism provides more resource availability for launching speculative copies, by adopting a dynamic resource reservation approach. It is demonstrated, via a trace-driven simulation, to bring a high improvement in energy efficiency.
|
89 |
Efficient Big Data Processing on Large-Scale Shared Platforms ˸ managing I/Os and Failure / Sur l'efficacité des traitements Big Data sur les plateformes partagées à grandes échelle ˸ gestion des entrées-sorties et des pannesYildiz, Orcun 08 December 2017 (has links)
En 2017 nous vivons dans un monde régi par les données. Les applications d’analyse de données apportent des améliorations fondamentales dans de nombreux domaines tels que les sciences, la santé et la sécurité. Cela a stimulé la croissance des volumes de données (le déluge du Big Data). Pour extraire des informations utiles à partir de cette quantité énorme d’informations, différents modèles de traitement des données ont émergé tels que MapReduce, Hadoop, et Spark. Les traitements Big Data sont traditionnellement exécutés à grande échelle (les systèmes HPC et les Clouds) pour tirer parti de leur puissance de calcul et de stockage. Habituellement, ces plateformes à grande échelle sont utilisées simultanément par plusieurs utilisateurs et de multiples applications afin d’optimiser l’utilisation des ressources. Bien qu’il y ait beaucoup d’avantages à partager de ces plateformes, plusieurs problèmes sont soulevés dès lors qu’un nombre important d’utilisateurs et d’applications les utilisent en même temps, parmi lesquels la gestion des E / S et des défaillances sont les principales qui peuvent avoir un impact sur le traitement efficace des données.Nous nous concentrons tout d’abord sur les goulots d’étranglement liés aux performances des E/S pour les applications Big Data sur les systèmes HPC. Nous commençons par caractériser les performances des applications Big Data sur ces systèmes. Nous identifions les interférences et la latence des E/S comme les principaux facteurs limitant les performances. Ensuite, nous nous intéressons de manière plus détaillée aux interférences des E/S afin de mieux comprendre les causes principales de ce phénomène. De plus, nous proposons un système de gestion des E/S pour réduire les dégradations de performance que les applications Big Data peuvent subir sur les systèmes HPC. Par ailleurs, nous introduisons des modèles d’interférence pour les applications Big Data et HPC en fonction des résultats que nous obtenons dans notre étude expérimentale concernant les causes des interférences d’E/S. Enfin, nous exploitons ces modèles afin de minimiser l’impact des interférences sur les performances des applications Big Data et HPC. Deuxièmement, nous nous concentrons sur l’impact des défaillances sur la performance des applications Big Data en étudiant la gestion des pannes dans les clusters MapReduce partagés. Nous présentons un ordonnanceur qui permet un recouvrement rapide des pannes, améliorant ainsi les performances des applications Big Data. / As of 2017, we live in a data-driven world where data-intensive applications are bringing fundamental improvements to our lives in many different areas such as business, science, health care and security. This has boosted the growth of the data volumes (i.e., deluge of Big Data). To extract useful information from this huge amount of data, different data processing frameworks have been emerging such as MapReduce, Hadoop, and Spark. Traditionally, these frameworks run on largescale platforms (i.e., HPC systems and clouds) to leverage their computation and storage power. Usually, these largescale platforms are used concurrently by multiple users and multiple applications with the goal of better utilization of resources. Though benefits of sharing these platforms exist, several challenges are raised when sharing these large-scale platforms, among which I/O and failure management are the major ones that can impact efficient data processing.To this end, we first focus on I/O related performance bottlenecks for Big Data applications on HPC systems. We start by characterizing the performance of Big Data applications on these systems. We identify I/O interference and latency as the major performance bottlenecks. Next, we zoom in on I/O interference problem to further understand the root causes of this phenomenon. Then, we propose an I/O management scheme to mitigate the high latencies that Big Data applications may encounter on HPC systems. Moreover, we introduce interference models for Big Data and HPC applications based on the findings we obtain in our experimental study regarding the root causes of I/O interference. Finally, we leverage these models to minimize the impact of interference on the performance of Big Data and HPC applications. Second, we focus on the impact of failures on the performance of Big Data applications by studying failure handling in shared MapReduce clusters. We introduce a failure-aware scheduler which enables fast failure recovery while optimizing data locality thus improving the application performance.
|
90 |
Dessin de graphe distribué par modèle de force : application au Big Data / Distributed force directed graph drawing : a Big Data case studyHinge, Antoine 28 June 2018 (has links)
Les graphes, outil mathématique pour modéliser les relations entre des entités, sont en augmentation constante du fait d'internet (par exemple les réseaux sociaux). La visualisation de graphe (aussi appelée dessin) permet d'obtenir immédiatement des informations sur le graphe. Les graphes issus d'internet sont généralement stockés de manière morcelée sur plusieurs machines connectées par un réseau. Cette thèse a pour but de développer des algorithmes de dessin de très grand graphes dans le paradigme MapReduce, utilisé pour le calcul sur cluster. Parmi les algorithmes de dessin, les algorithmes reposants sur un modèle physique sous-jacent pour réaliser le dessin permettent d'obtenir un bon dessin indépendamment de la nature du graphe. Nous proposons deux algorithmes par modèle de forces conçus dans le paradigme MapReduce. GDAD, le premier algorithme par modèle de force dans le paradigme MapReduce, utilise des pivots pour simplifier le calcul des interactions entre les nœuds du graphes. MuGDAD, le prolongement de GDAD, utilise une simplification récursive du graphe pour effectuer le dessin, toujours à l'aide de pivots. Nous comparons ces deux algorithmes avec les algorithmes de l'état de l'art pour évaluer leurs performances. / Graphs, usually used to model relations between entities, are continually growing mainly because of the internet (social networks for example). Graph visualization (also called drawing) is a fast way of collecting data about a graph. Internet graphs are often stored in a distributed manner, split between several machines interconnected. This thesis aims to develop drawing algorithms to draw very large graphs using the MapReduce paradigm, used for cluster computing. Among graph drawing algorithms, those which rely on a physical model to compute the node placement are generally considered to draw graphs well regardless of the type of graph. We developped two force-directed graph drawing algorithms in the MapReduce paradigm. GDAD, the fist distributed force-directed graph drawing algorithm ever, uses pivots to simplify computations of node interactions. MuGDAD, following GDAD, uses a recursive simplification to draw the original graph, keeping the pivots. We compare these two algorithms with the state of the art to assess their performances.
|
Page generated in 0.0481 seconds