Global ETD Search

101	Wiederverwendung berechneter Matchergebnisse für MapReduce-basiertes Object Matching Sintschilin, Sergej 19 February 2018 (has links) Die Bachelorarbeit umfasst die Erweiterung des Projektes Dedoop. Dedoop stellt eine Reihe von Werkzeugen zur Verfügung, die das Finden von Duplikaten durch Object Matching-Ansätze in einer Datenmenge automatisieren. Das Object Matching geschieht auf der MapReduce-Plattform Hadoop. Mit Hilfe der entwickelten Erweiterung, ist es möglich das vollständige Neuberechnen an den Daten bei ihrer Änderung zu vermeiden. Das Verfahren geschieht in zwei Phasen. In der ersten Phase stellt man die Änderungen fest, die zwischen der alten Datenmenge und der neuen Datenmenge stattfanden. Die dabei gewonnenen Informationen werden in drei Kategorien unterteilt: Datensätze, die in der alten und in der neuen Datenmenge unverändert zu finden sind, Datensätze aus der neuen Quelle, die die Neuberechnung benötigen, und Datensätze aus der alten Quelle, die aus der Neuberechnung ausgeschlossen werden sollen. In der zweiten Phase wird das alte Object Matching, angewendet auf die aus der ersten Phase gewonnenen Teilmengen, wiederholt. Die für die Neuberechnung benötigten Datensätze sind die, die aktualisiert oder neueingefügt wurden. Deshalb liegen für sie noch keine Ergebnisse aus dem alten Object Matching vor. Diese Datensätze werden in der zweiten Phase gegeneinander und gegen die unverändert gebliebenen Datensätze gematcht. Die aus der Neuberechnung ausgeschlossen Datensätze sind die, die aktualisiert oder gelöscht wurden. Für sie liegen bereits Matchergebnisse vor, und deshalb müssen diese Ergebnisse von diesen Datensätzen bereinigt werden. Der Vorteil dieses Verfahrens liegt darin, dass man die unverändert gebliebenen Datensätze nicht noch einmal gegeneinander zu matchen braucht. info:eu-repo/classification/ddc/000 ddc:000
102	Evaluierung und Erweiterung von MapReduce-Algorithmen zur Berechnung der transitiven Hülle ungerichteter Graphen für Entity ResolutionWorkflows Ziad, Sehili 16 April 2018 (has links) Im Bereich von Entity-Resolution oder deduplication werden aufgrund fehlender global eindeutiger Identifikatoren Match-Techniken verwendet, um zu bestimmen, ob verschiedene Datensätze dasselbe Realweltobjekt darstellen. Die inhärente quadratische Komplexität führt zu sehr langen Laufzeiten für große Datenmengen, was eine Parallelisierung dieses Prozesses erfordert. MapReduce ist wegen seiner Skalierbarkeit und Einsetzbarkeit in Cloud- Infrastrukturen eine gute Lösung zur Verbesserung der Laufzeit. Außerdem kann unter bestimmten Voraussetzungen die Qualität des Match-Ergebnisses durch die Berechnung der transitiven Hülle verbessert werden. info:eu-repo/classification/ddc/000 ddc:000
103	Fast Data Analysis Methods For Social Media Data Nhlabano, Valentine Velaphi 07 August 2018 (has links) The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted Big Data Machine Learning Sentiment Analysis Text Mining Apache Hadoop UCTD
104	Scalable Map-Reduce Algorithms for Mining Formal Concepts and Graph Substructures Kumar, Lalit January 2018 (has links) No description available. Computer Science Data Mining Graph Processing Formal Concepts Hadoop, Map-Reduce Maximal Cliques Real Valued Biclusters
105	Optimization of the Photovoltaic Time-series Analysis Process Through Hybrid Distributed Computing Hwang, Suk Hyun 01 June 2020 (has links) No description available. Computer Science
106	Towards an S3-based, DataNode-less implementation of HDFS / Mot en S3-baserad implementering av HDFS utan DataNodes Caceres Gutierrez, Franco Jesus January 2020 (has links) The relevance of data processing and analysis today cannot be overstated. The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzabledata. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate moreresources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs. We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable. / Relevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare. Hadoop distributed file system HDFS HopsFS S3 Computer and Information Sciences Data- och informationsvetenskap
107	Hive, Spark, Presto for Interactive Queries on Big Data Gureev, Nikita January 2018 (has links) Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. Apache Hadoop is one of the first open-source tools that provides a distributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many technologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with Apache Hive being one of them. However,with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark and Presto. In this thesis these technologies are examined and benchmarked in order to determine their performance for the task of interactive business intelligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance HiveTez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Parquet data on different volume and concurrency. A short analysis and conclusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on bigdata. / Traditionella relationella databassystem kan inte användas effektivt för att analysera stora datavolymer och filformat, såsom big data. Apache Hadoop är en av de första open-source verktyg som tillhandahåller ett distribuerat datalagring och resurshanteringssystem. Området för big data processing har växt fort de senaste åren och många teknologier har introducerats inom ekosystemet för big data för att hantera problemet med processering av stora datavolymer, och vissa tidiga verktyg har blivit vanligt förekommande, där Apache Hive är en av de. Med nya framsteg inom området finns det nu bättre verktyg som är bättre anpassade för interaktiva analyser av big data, som till exempel Apache Spark och Presto. I denna uppsats är dessa teknologier analyserade med benchmarks för att fastställa deras prestanda för uppgiften av interaktiva business intelligence queries. Dessa benchmarks är representative för interaktiva business intelligence queries och använder stjärnformade scheman. Prestandan är undersökt för Hive Tex, Hive LLAP, Spark SQL och Presto med text, ORC Parquet data för olika volymer och parallelism. En kort analys och sammanfattning är presenterad med ett resonemang om valet av framework och dataformat för ett system som exekverar interaktiva queries på big data. Hadoop SQL interactive analysis Hive Spark Spark SQL Presto Big Data Computer and Information Sciences Data- och informationsvetenskap
108	Research on High-performance and Scalable Data Access in Parallel Big Data Computing Yin, Jiangling 01 January 2015 (has links) To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability. Hadoop systems parallel data analysis high performance computing Computer Engineering Engineering
109	Reducing Cluster Power Consumption by Dynamically Suspending Idle Nodes Oppenheim, Brian Michael 01 June 2010 (has links) (PDF) Close to 1% of the world's electricity is consumed by computer servers. Given that the increased use of electricity raises costs and damages the environment, optimizing the world's computing infrastructure for power consumption is worthwhile. This thesis is one attempt at such an optimization. In particular, I began by building a cluster of 6 Intel Atom based low-power nodes to perform work analogous to data center clusters. Then, I installed a version of Hadoop modified with a novel power management system on the cluster. The power management system uses different algorithms to determine when to turn off idle nodes in the cluster. Using the experimental cluster running a modified Hadoop installation, I performed a series of experiments. These tests assessed various strategies for choosing nodes to suspend across a variety of workloads. The experiments validated that turning off idle nodes can yield power savings. While my experimental procedure caused the apparent throughput to significantly decrease, I argue that using more realistic workloads would have yielded much better throughput with slightly reduced power consumption. Additionally, my analysis of the results, show that the percentage power savings in a larger, more realistically sized cluster would be higher than shown in my experiments. hadoop cluster power consumption vovo green computing sustainability OS and Networks Other Computer Sciences Systems Architecture
110	E-CRADLE v1.1 - An improved distributed system for Photovoltaic Informatics Zhao, Pei 27 January 2016 (has links) No description available. Systems Design Technology Materials Science

Search results