Global ETD Search

1	Big Data Workflows: DSL-based Specification and Software Containers for Scalable Execution Dejene Dessalk, Yared January 2020 (has links) Big Data workflows are composed of multiple orchestration steps that perform different data analytics tasks. These tasks process heterogeneous data using various computing and storage resources. Due to the diversity of application domains, involved technologies, and complexity of data sets, the design and implementation of Big Data workflows require the collaboration of domain experts and technical experts. However, existing tools are too technical and cannot easily allow domain experts to participate in the process of defining and executing Big Data workflows. Moreover, the majority of existing tools are designed for specific applications such as bioinformatics, computational chemistry, and genomics. They are also based on specific technology stacks that do not provide flexible means of code reuse and maintenance. This thesis presents the design and implementation of a Big Data workflow solution based on the use of a domain-specific language (DSL) for hiding complex technical details, enabling domain experts to participate in the process definition of workflows. The workflow solution uses a combination of software container technologies and message-oriented middleware (MOM) to enable highly scalable workflow execution. The applicability of the solution is demonstrated by implementing a prototype based on a real-world data workflow. As per performed evaluations, the proposed workflow solution was evaluated to provide efficient workflow definition and scalable execution. Furthermore, the results of a set of experiments were presented, comparing the performance of the proposed approach with Argo Workflows, one of the most promising tools in the area of Big Data workflows. / Big Data-arbetsflöden består av flera orkestreringssteg som utför olika dataanalysuppgifter. Dessa uppgifter bearbetar heterogena data med hjälp av olika databehandlings- och lagringsresurser. På grund av stora variationen av tillämpningsområden, den involverade tekniken, och komplexiteten hos datamängderna, kräver utformning och implementering av Big Data-arbetsflöden samarbete mellan domänexperter och tekniska experter. Befintliga verktyg är dock för tekniska och vilket försvårar för domänexperter att delta i processen att definiera och genomföra Big Data-arbetsflöden. Dessutom är majoriteten av befintliga verktyg utformade för specifika tillämpningar, som bioinformatik, beräkningskemi och genomik. Verktygen är också baserade på specifika teknikstackar som inte erbjuder flexibla metoder för att kunna underhålla och återanvända kod. Denna avhandling ämnar att presentera design och implementering av en Big Data-arbetsflödeslösning som utnyttjar ett domänspecifikt språk (DSL) för att dölja komplexa tekniska detaljer, vilket gör det möjligt för domänexperter att delta i processdefinitionen av arbetsflöden. Arbetsflödeslösningen använder en kombination av mjukvaruutrustningsteknik och meddelande-orienterad mellanvara (MOM) för att möjliggöra en mer skalbar körning av arbetsflöden. Tillämpningslösningen demonstreras genom att implementera en prototyp baserad på ett verkligt dataflöde. Efter en granskning av de genomförda testerna modifierades den föreslagna arbetsflödeslösningen för att uppnå en effektiv arbetsflödesdefinition och skalbar körning. Dessutom presenteras resultaten av en uppsättning experiment där man jämför skalbarheten för det föreslagna tillvägagångssättet med Argo Workflows, ett av de mest lovande verktygen inom Big Data-arbetsflöden Big Data workflow Domain-specific language Software container Message oriented middleware Scalable execution Big Data-arbetsflode Doman-specifikt sprak Programvarubehallare Meddelande-orienterad mellanprogramvara Skalbar korning Computer and Information Sciences Data- och informationsvetenskap
2	Efficient placement design and storage cost saving for big data workflow in cloud datacenters / Conception d'algorithmes de placement efficaces et économie des coûts de stockage pour les workflows du big data dans les centres de calcul de type cloud Ikken, Sonia 14 December 2017 (has links) Les workflows sont des systèmes typiques traitant le big data. Ces systèmes sont déployés sur des sites géo-distribués pour exploiter des infrastructures cloud existantes et réaliser des expériences à grande échelle. Les données générées par de telles expériences sont considérables et stockées à plusieurs endroits pour être réutilisées. En effet, les systèmes workflow sont composés de tâches collaboratives, présentant de nouveaux besoins en terme de dépendance et d'échange de données intermédiaires pour leur traitement. Cela entraîne de nouveaux problèmes lors de la sélection de données distribuées et de ressources de stockage, de sorte que l'exécution des tâches ou du job s'effectue à temps et que l'utilisation des ressources soit rentable. Par conséquent, cette thèse aborde le problème de gestion des données hébergées dans des centres de données cloud en considérant les exigences des systèmes workflow qui les génèrent. Pour ce faire, le premier problème abordé dans cette thèse traite le comportement d'accès aux données intermédiaires des tâches qui sont exécutées dans un cluster MapReduce-Hadoop. Cette approche développe et explore le modèle de Markov qui utilise la localisation spatiale des blocs et analyse la séquentialité des fichiers spill à travers un modèle de prédiction. Deuxièmement, cette thèse traite le problème de placement de données intermédiaire dans un stockage cloud fédéré en minimisant le coût de stockage. A travers les mécanismes de fédération, nous proposons un algorithme exacte ILP afin d’assister plusieurs centres de données cloud hébergeant les données de dépendances en considérant chaque paire de fichiers. Enfin, un problème plus générique est abordé impliquant deux variantes du problème de placement lié aux dépendances divisibles et entières. L'objectif principal est de minimiser le coût opérationnel en fonction des besoins de dépendances inter et intra-job / The typical cloud big data systems are the workflow-based including MapReduce which has emerged as the paradigm of choice for developing large scale data intensive applications. Data generated by such systems are huge, valuable and stored at multiple geographical locations for reuse. Indeed, workflow systems, composed of jobs using collaborative task-based models, present new dependency and intermediate data exchange needs. This gives rise to new issues when selecting distributed data and storage resources so that the execution of tasks or job is on time, and resource usage-cost-efficient. Furthermore, the performance of the tasks processing is governed by the efficiency of the intermediate data management. In this thesis we tackle the problem of intermediate data management in cloud multi-datacenters by considering the requirements of the workflow applications generating them. For this aim, we design and develop models and algorithms for big data placement problem in the underlying geo-distributed cloud infrastructure so that the data management cost of these applications is minimized. The first addressed problem is the study of the intermediate data access behavior of tasks running in MapReduce-Hadoop cluster. Our approach develops and explores Markov model that uses spatial locality of intermediate data blocks and analyzes spill file sequentiality through a prediction algorithm. Secondly, this thesis deals with storage cost minimization of intermediate data placement in federated cloud storage. Through a federation mechanism, we propose an exact ILP algorithm to assist multiple cloud datacenters hosting the generated intermediate data dependencies of pair of files. The proposed algorithm takes into account scientific user requirements, data dependency and data size. Finally, a more generic problem is addressed in this thesis that involve two variants of the placement problem: splittable and unsplittable intermediate data dependencies. The main goal is to minimize the operational data cost according to inter and intra-job dependencies Workflow du big data Accès et placement des données Minimisation des coûts de stockage Centres de données cloud Hadoop MapReduce Application dirigée par les données Données de dépendances Optimisation Big data workflow Data access and placement Storage cost minimization Cloud datacenters Hadoop MapReduce Data-driven application Data dependency Optimization

Search results

Big Data Workflows: DSL-based Specification and Software Containers for Scalable Execution

Efficient placement design and storage cost saving for big data workflow in cloud datacenters / Conception d'algorithmes de placement efficaces et économie des coûts de stockage pour les workflows du big data dans les centres de calcul de type cloud