91 |
Large-Scale Matrix Completion Using Orthogonal Rank-One Matrix Pursuit, Divide-Factor-Combine, and Apache SparkJanuary 2014 (has links)
abstract: As the size and scope of valuable datasets has exploded across many industries and fields of research in recent years, an increasingly diverse audience has sought out effective tools for their large-scale data analytics needs. Over this period, machine learning researchers have also been very prolific in designing improved algorithms which are capable of finding the hidden structure within these datasets. As consumers of popular Big Data frameworks have sought to apply and benefit from these improved learning algorithms, the problems encountered with the frameworks have motivated a new generation of Big Data tools to address the shortcomings of the previous generation. One important example of this is the improved performance in the newer tools with the large class of machine learning algorithms which are highly iterative in nature. In this thesis project, I set about to implement a low-rank matrix completion algorithm (as an example of a highly iterative algorithm) within a popular Big Data framework, and to evaluate its performance processing the Netflix Prize dataset. I begin by describing several approaches which I attempted, but which did not perform adequately. These include an implementation of the Singular Value Thresholding (SVT) algorithm within the Apache Mahout framework, which runs on top of the Apache Hadoop MapReduce engine. I then describe an approach which uses the Divide-Factor-Combine (DFC) algorithmic framework to parallelize the state-of-the-art low-rank completion algorithm Orthogoal Rank-One Matrix Pursuit (OR1MP) within the Apache Spark engine. I describe the results of a series of tests running this implementation with the Netflix dataset on clusters of various sizes, with various degrees of parallelism. For these experiments, I utilized the Amazon Elastic Compute Cloud (EC2) web service. In the final analysis, I conclude that the Spark DFC + OR1MP implementation does indeed produce competitive results, in both accuracy and performance. In particular, the Spark implementation performs nearly as well as the MATLAB implementation of OR1MP without any parallelism, and improves performance to a significant degree as the parallelism increases. In addition, the experience demonstrates how Spark's flexible programming model makes it straightforward to implement this parallel and iterative machine learning algorithm. / Dissertation/Thesis / M.S. Computer Science 2014
|
92 |
Social Network Analysis Utilizing Big Data TechnologyMagnusson, Jonathan January 2012 (has links)
As of late there has been an immense increase of data within modern society. This is evident within the field of telecommunications. The amount of mobile data is growing fast. For a telecommunication operator, this provides means of getting more information of specific subscribers. The applications of this are many, such as segmentation for marketing purposes or detection of churners, people about to switching operator. Thus the analysis and information extraction is of great value. An approach of this analysis is that of social network analysis. Utilizing such methods yields ways of finding the importance of each individual subscriber in the network. This thesis aims at investigating the usefulness of social network analysis in telecommunication networks. As these networks can be very large the methods used to study them must scale linearly when the network size increases. Thus, an integral part of the study is to determine which social network analysis algorithms that have this scalability. Moreover, comparisons of software solutions are performed to find product suitable for these specific tasks. Another important part of using social network analysis is to be able to interpret the results. This can be cumbersome without expert knowledge. For that reason, a complete process flow for finding influential subscribers in a telecommunication network has been developed. The flow uses input easily available to the telecommunication operator. In addition to using social network analysis, machine learning is employed to uncover what behavior is associated with influence and pinpointing subscribers behaving accordingly.
|
93 |
Integrace Big Data a datového skladu / Integration of Big Data and data warehouseKiška, Vladislav January 2017 (has links)
Master thesis deals with a problem of data integration between Big Data platform and enterprise data warehouse. Main goal of this thesis is to create a complex transfer system to move data from a data warehouse to this platform using a suitable tool for this task. This system should also store and manage all metadata information about previous transfers. Theoretical part focuses on describing concepts of Big Data, brief introduction into their history and presents factors which led to need for this new approach. Next chapters describe main principles and attributes of these technologies and discuss benefits of their implementation within an enterprise. Thesis also describes technologies known as Business Intelligence, their typical use cases and their relation to Big Data. Minor chapter presents main components of Hadoop system and most popular related applications. Practical part of this work consists of implementation of a system to execute and manage transfers from traditional relation database, in this case representing a data warehouse, to cluster of a few computers running a Hadoop system. This part also includes a summary of most used applications to move data into Hadoop and a design of database metadata schema, which is used to manage these transfers and to store transfer metadata.
|
94 |
An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduceLiu, Xuan January 2014 (has links)
We propose a new ensemble algorithm: the meta-boosting algorithm. This algorithm enables the original Adaboost algorithm to improve the decisions made by different WeakLearners utilizing the meta-learning approach. Better accuracy results are achieved since this algorithm reduces both bias and variance. However, higher accuracy also brings higher computational complexity, especially on big data. We then propose the parallelized meta-boosting algorithm: Parallelized-Meta-Learning (PML) using the MapReduce programming paradigm on Hadoop. The experimental results on the Amazon EC2 cloud computing infrastructure show that PML reduces the computation complexity enormously while retaining lower error rates than the results on a single computer. As we know MapReduce has its inherent weakness that it cannot directly support iterations in an algorithm, our approach is a win-win method, since it not only overcomes this weakness, but also secures good accuracy performance. The comparison between this approach and a contemporary algorithm AdaBoost.PL is also performed.
|
95 |
Big Data v technológiách IBM / Big Data in technologies from IBMŠoltýs, Matej January 2014 (has links)
This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.
|
96 |
Nástroje pro Big Data Analytics / Big Data Analytics toolsMiloš, Marek January 2013 (has links)
The thesis covers the term for specific data analysis called Big Data. The thesis firstly defines the term Big Data and the need for its creation because of the rising need for deeper data processing and analysis tools and methods. The thesis also covers some of the technical aspects of Big Data tools, focusing on Apache Hadoop in detail. The later chapters contain Big Data market analysis and describe the biggest Big Data competitors and tools. The practical part of the thesis presents a way of using Apache Hadoop to perform data analysis with data from Twitter and the results are then visualized in Tableau.
|
97 |
Využití Big Data v bankovním prostředí / Application of Big Data in the banking environmentDvorský, Bohuslav January 2016 (has links)
This thesis addresses the principles and technologies of Big Data and their usage in the banking environment. Its objective is to find business application scenarios for Big Data for purposes of delivering added value for the bank. Finding the scenarios have been achieved by studying literature and consultation with experts, they were also subsequently modeled by the author. Possibilities of application of these scenarios in the banking busi-ness environment were subsequently verified by the survey, which interviewed profession-als on issues relating to the found business scenarios. The thesis first explains the basic con-cepts and approaches of Big Data, the status of this technology compared to traditional technologies and issues of integration into the banking environment. After this theoretical beginning the business scenarios are found and modeled followed by the exploration and evaluation. Selected business scenarios are further verified for the purpose of determining the suitability or unsuitability for implementation using technologies and principles of Big Data. The contribution of this work is to find a real use of Big Data in banking, where most of the materials on this topic is very general and vague. This thesis verifies two business scenarios that can big a bank institution high added value if they are implemented with Big Data platform.
|
98 |
Big Data a jejích potenciál pro bankovní sektor / Big Data and its perspective for the bankingFirsov, Vitaly January 2013 (has links)
In this thesis, I want to explore present (y. 2012/2013) modern trends in Business Intelligence and focus specifically on the rapidly evolving and, in my (and not only) opinion, a very perspective area of analysis and use of Big Data in large enterprises. The first, introductory part contains general information and the formal conditions as aims of the work, on whom the work is oriented and where it could be used. Then there are described inputs and outputs, structure, methods to achieve the objectives, potential benefits and limitations in this part. Because at the same time I work as a data analyst in the largest bank Czech Republic, Czech Savings Bank, I focused on the using of Big Data in the banking, because I think, that it is possible to achieve great benefits from collecting and analyzing Big Data in this area. The thesis itself is divided into 3 parts (chapters 2, 3-4, 5). In the second chapter you will learn, how developed the area of BI, how it evolved historically, what is BI today and what future is predicted to the BI by the experts like the world famous and respected analyst firm Gartner. In the third chapter I will focus on Big Data itself, what this term means, how Big Data differs from traditional business information available from ERP, ECM, DMS and other enterprise systems. You will learn about ways to store and process this type of data, as well as about the existing and applicable technologies, focused on Big Data analysis. In the fourth chapter I focus on the using of Big Data in business, information in this chapter will reflect my personal views on the potential of Big Data, based on my experience during practice in Czech Savings Bank. The final part will summarize this thesis, assess, how I fulfilled the objectives defined at the beginning, and express my opinion on perspective of the trend of Big Data analytics, based to the analyzed during the writing this thesis information and knowledge.
|
99 |
Distribuované zpracování dat o IP tocích / Distributed Processing of IP flow DataKrobot, Pavel January 2015 (has links)
This thesis deals with the subject of distributed processing of IP flow. Main goal is to provide an implementation of a software collector which allows storing and processing huge amount of a network data in particular. There was studied an open-source implementation of a framework for the distributed processing of large data sets called Hadoop, which is based on MapReduce paradigm. There were made some experiments with this system which provided the comparison with the current systems and shown weaknesses of this framework. Based on this knowledge there was created a specification and scheme for an extension of current software collector within this work. In terms of the created scheme there was created an implementation of query framework for formed collector, which is considered as most critical in the field of distributed processing of IP flow data. Results of experiments with created implementation show significant performance growth and ability of linear scalability with some types of queries.
|
100 |
Wiederverwendung berechneter Matchergebnisse für MapReduce-basiertes Object MatchingSintschilin, Sergej 19 February 2018 (has links)
Die Bachelorarbeit umfasst die Erweiterung des Projektes Dedoop. Dedoop stellt eine Reihe von Werkzeugen zur Verfügung, die das Finden von Duplikaten durch Object Matching-Ansätze in einer Datenmenge automatisieren. Das Object Matching geschieht auf der MapReduce-Plattform Hadoop. Mit Hilfe der entwickelten Erweiterung, ist es möglich das vollständige Neuberechnen an den Daten bei ihrer Änderung zu vermeiden. Das Verfahren geschieht in zwei Phasen. In der ersten Phase stellt man die Änderungen fest, die zwischen der alten Datenmenge und der neuen Datenmenge stattfanden. Die dabei gewonnenen Informationen werden in drei Kategorien unterteilt: Datensätze, die in der alten und in der neuen Datenmenge unverändert zu finden sind, Datensätze aus der neuen Quelle, die die Neuberechnung benötigen, und Datensätze aus der alten Quelle, die aus der Neuberechnung ausgeschlossen werden sollen. In der zweiten Phase wird das alte Object Matching, angewendet auf die aus der ersten Phase gewonnenen Teilmengen, wiederholt. Die für die Neuberechnung benötigten Datensätze sind die, die aktualisiert oder neueingefügt wurden. Deshalb liegen für sie noch keine Ergebnisse aus dem alten Object Matching vor. Diese Datensätze werden in der zweiten Phase gegeneinander und gegen die unverändert gebliebenen Datensätze gematcht. Die aus der Neuberechnung ausgeschlossen Datensätze sind die, die aktualisiert oder gelöscht wurden. Für sie liegen bereits Matchergebnisse vor, und deshalb müssen diese Ergebnisse von diesen Datensätzen bereinigt werden. Der Vorteil dieses Verfahrens liegt darin, dass man die unverändert gebliebenen Datensätze nicht noch einmal gegeneinander zu matchen braucht.
|
Page generated in 0.0194 seconds