581 |
JOB SCHEDULING FOR STREAMING APPLICATIONS IN HETEROGENEOUS DISTRIBUTED PROCESSING SYSTEMSAl-Sinayyid, Ali 01 December 2020 (has links)
The colossal amounts of data generated daily are increasing exponentially at a never-before-seen pace. A variety of applications—including stock trading, banking systems, health-care, Internet of Things (IoT), and social media networks, among others—have created an unprecedented volume of real-time stream data estimated to reach billions of terabytes in the near future. As a result, we are currently living in the so-called Big Data era and witnessing a transition to the so-called IoT era. Enterprises and organizations are tackling the challenge of interpreting the enormous amount of raw data streams to achieve an improved understanding of data, and thus make efficient and well-informed decisions (i.e., data-driven decisions). Researchers have designed distributed data stream processing systems that can directly process data in near real-time. To extract valuable information from raw data streams, analysts need to create and implement data stream processing applications structured as a directed acyclic graphs (DAG). The infrastructure of distributed data stream processing systems, as well as the various requirements of stream applications, impose new challenges. Cluster heterogeneity in a distributed environment results in different cluster resources for task execution and data transmission, which make the optimal scheduling algorithms an NP-complete problem. Scheduling streaming applications plays a key role in optimizing system performance, particularly in maximizing the frame-rate, or how many instances of data sets can be processed per unit of time. The scheduling algorithm must consider data locality, resource heterogeneity, and communicational and computational latencies. The latencies associated with the bottleneck from computation or transmission need to be minimized when mapped to the heterogeneous and distributed cluster resources. Recent work on task scheduling for distributed data stream processing systems has a number of limitations. Most of the current schedulers are not designed to manage heterogeneous clusters. They also lack the ability to consider both task and machine characteristics in scheduling decisions. Furthermore, current default schedulers do not allow the user to control data locality aspects in application deployment.In this thesis, we investigate the problem of scheduling streaming applications on a heterogeneous cluster environment and develop the maximum throughput scheduler algorithm (MT-Scheduler) for streaming applications. The proposed algorithm uses a dynamic programming technique to efficiently map the application topology onto a heterogeneous distributed system based on computing and data transfer requirements, while also taking into account the capacity of underlying cluster resources. The proposed approach maximizes the system throughput by identifying and minimizing the time incurred at the computing/transfer bottleneck. The MT-Scheduler supports scheduling applications that are structured as a DAG, such as Amazon Timestream, Google Millwheel, and Twitter Heron. We conducted experiments using three Storm microbenchmark topologies in both simulated and real Apache Storm environments. To evaluate performance, we compared the proposed MT-Scheduler with the simulated round-robin and the default Storm scheduler algorithms. The results indicated that the MT-Scheduler outperforms the default round-robin approach in terms of both average system latency and throughput.
|
582 |
Business Intelligence in the Hotel IndustryShahini, Rei January 2020 (has links)
Applications of artificial intelligence (AI) in hospitality and accommodation have taken an enormous percentage of service-provision, helping automate most of the processes involved such as booking and purchasing, improving the guest experience, tracking of guest preferences and interests, etc. The aim of the study is to understand the roles, benefits and issues with the improvement of business intelligence (BI) in hospitality. This research is purposed to discover the applications of BI in hotel booking and accommodation. The investigation focuses on hotel guest experience, business operations and guest satisfaction. The research also shows how acquiring proper BI is supported by implementing a dynamic technology framework integrated with AI and a big data resource. In such a system, the intensive collection of customer data combined with an improved technology standard is achievable using AI. The research employs a qualitative approach for data discovery and collection. A thematic analysis helps generate proper findings that indicate an improvement in the entire hospitality service delivery system as well as customer satisfaction. In this thesis, there are examined various subsets of BI in tourism. The assessment analyzes competition arising from the application of these technologies. The study also shows the importance and application of harnessing data to gather insights about guest interests and preferences through the establishment of well-developed BI. Insights enable the customization of hotel services and products for individual guests. There is a considerable improvement in guest services and guest information collection, which is achieved through the creation of guest profiles. The research performs a discussion on the incorporation of AI and big data among other sub-components in creating diversified BI and seeks to identify the need for current BI applications in the hotel industry.
|
583 |
An experimental study of memory management in Rust programming for big data processingOkazaki, Shinsaku 10 December 2020 (has links)
Planning optimized memory management is critical for Big Data analysis tools to perform faster runtime and efficient use of computation resources. Modern Big Data analysis tools use application languages that abstract their memory management so that developers do not have to pay extreme attention to memory management strategies.
Many existing modern cloud-based data processing systems such as Hadoop, Spark or Flink use Java Virtual Machine (JVM) and take full advantage of features such as automated memory management in JVM including Garbage Collection (GC) which may lead to a significant overhead. Dataflow-based systems like Spark allow programmers to define complex objects in a host language like Java to manipulate and transfer tremendous amount of data.
System languages like C++ or Rust seem to be a better choice to develop systems for Big Data processing because they do not relay on JVM. By using a system language, a developer has full control on the memory management. We found Rust programming language to be a good candidate due to its ability to write memory-safe and fearless concurrent codes with its concept of memory ownership and borrowing. Rust programming language includes many possible strategies to optimize memory management for Big Data processing including a selection of different variable types, use of Reference Counting, and multithreading with Atomic Reference Counting.
In this thesis, we conducted an experimental study to assess how much these different memory management strategies differ regarding overall runtime performance. Our experiments focus on complex object manipulation and common Big Data processing patterns with various memory man- agement. Our experimental results indicate a significant difference among these different memory strategies regarding data processing performance.
|
584 |
ZipThru: A software architecture that exploits Zipfian skew in datasets for accelerating Big Data analysisEjebagom J Ojogbo (9529172) 16 December 2020 (has links)
<div>In the past decade, Big Data analysis has become a central part of many industries including entertainment, social networking, and online commerce. MapReduce, pioneered by Google, is a popular programming model for Big Data analysis, famous for its easy programmability due to automatic data partitioning, fault tolerance, and high performance. Majority of MapReduce workloads are summarizations, where the final output is a per-key ``reduced" version of the input, highlighting a shared property of each key in the input dataset.</div><div><br></div><div>While MapReduce was originally proposed for massive data analyses on networked clusters, the model is also applicable to datasets small enough to be analyzed on a single server. In this single-server context the intermediate tuple state generated by mappers is saved to memory, and only after all Map tasks have finished are reducers allowed to process it. This Map-then-Reduce sequential mode of execution leads to distant reuse of the intermediate state, resulting in poor locality for memory accesses. In addition the size of the intermediate state is often too large to fit in the on-chip caches, leading to numerous cache misses as the state grows during execution, further degrading performance. It is well known, however, that many large datasets used in these workloads possess a Zipfian/Power Law skew, where a minority of keys (e.g., 10\%) appear in a majority of tuples/records (e.g., 70\%). </div><div><br></div><div>I propose ZipThru, a novel MapReduce software architecture that exploits this skew to keep the tuples for the popular keys on-chip, processing them on the fly and thus improving reuse of their intermediate state and curtailing off-chip misses. ZipThru achieves this using four key mechanisms: 1) Concurrent execution of both Map and Reduce phases; 2) Holding only the small, reduced state of the minority of popular keys on-chip during execution; 3) Using a lookup table built from pre-processing a subset of the input to distinguish between popular and unpopular keys; and 4) Load balancing the concurrently executing Map and Reduce phases to efficiently share on-chip resources. </div><div><br></div><div>Evaluations using Phoenix, a shared-memory MapReduce implementation, on 16- and 32-core servers reveal that ZipThru incurs 72\% fewer cache misses on average over traditional MapReduce while achieving average speedups of 2.75x and 1.73x on both machines respectively.</div>
|
585 |
Privacy-awareness in the era of Big Data and machine learning / Integritetsmedvetenhet i eran av Big Data och maskininlärningVu, Xuan-Son January 2019 (has links)
Social Network Sites (SNS) such as Facebook and Twitter, have been playing a great role in our lives. On the one hand, they help connect people who would not otherwise be connected before. Many recent breakthroughs in AI such as facial recognition [49] were achieved thanks to the amount of available data on the Internet via SNS (hereafter Big Data). On the other hand, due to privacy concerns, many people have tried to avoid SNS to protect their privacy. Similar to the security issue of the Internet protocol, Machine Learning (ML), as the core of AI, was not designed with privacy in mind. For instance, Support Vector Machines (SVMs) try to solve a quadratic optimization problem by deciding which instances of training dataset are support vectors. This means that the data of people involved in the training process will also be published within the SVM models. Thus, privacy guarantees must be applied to the worst-case outliers, and meanwhile data utilities have to be guaranteed. For the above reasons, this thesis studies on: (1) how to construct data federation infrastructure with privacy guarantee in the big data era; (2) how to protect privacy while learning ML models with a good trade-off between data utilities and privacy. To the first point, we proposed different frameworks em- powered by privacy-aware algorithms that satisfied the definition of differential privacy, which is the state-of-the-art privacy-guarantee algorithm by definition. Regarding (2), we proposed different neural network architectures to capture the sensitivities of user data, from which, the algorithm itself decides how much it should learn from user data to protect their privacy while achieves good performance for a downstream task. The current outcomes of the thesis are: (1) privacy-guarantee data federation infrastructure for data analysis on sensitive data; (2) privacy-guarantee algorithms for data sharing; (3) privacy-concern data analysis on social network data. The research methods used in this thesis include experiments on real-life social network dataset to evaluate aspects of proposed approaches. Insights and outcomes from this thesis can be used by both academic and industry to guarantee privacy for data analysis and data sharing in personal data. They also have the potential to facilitate relevant research in privacy-aware representation learning and related evaluation methods.
|
586 |
Fast Data Analysis Methods For Social Media DataNhlabano, Valentine Velaphi 07 August 2018 (has links)
The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database.
A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted
|
587 |
P-SGLD : Stochastic Gradient Langevin Dynamics with control variatesBruzzone, Andrea January 2017 (has links)
Year after years, the amount of data that we continuously generate is increasing. When this situation started the main challenge was to find a way to store the huge quantity of information. Nowadays, with the increasing availability of storage facilities, this problem is solved but it gives us a new issue to deal with: find tools that allow us to learn from this large data sets. In this thesis, a framework for Bayesian learning with the ability to scale to large data sets is studied. We present the Stochastic Gradient Langevin Dynamics (SGLD) framework and show that in some cases its approximation of the posterior distribution is quite poor. A reason for this can be that SGLD estimates the gradient of the log-likelihood with a high variability due to naïve sampling. Our approach combines accurate proxies for the gradient of the log-likelihood with SGLD. We show that it produces better results in terms of convergence to the correct posterior distribution than the standard SGLD, since accurate proxies dramatically reduce the variance of the gradient estimator. Moreover, we demonstrate that this approach is more efficient than the standard Markov Chain Monte Carlo (MCMC) method and that it exceeds other techniques of variance reduction proposed in the literature such as SAGA-LD algorithm. This approach also uses control variates to improve SGLD so that it is straightforward the comparison with our approach. We apply the method to the Logistic Regression model.
|
588 |
A picture is worth a thousand words, or? : Individuals use of visual dashboardsNilsson, Elin, Nyborg, Mikael January 2020 (has links)
Purpose – The increasing amounts of data has become an important factor for organizations. A visual dashboard is a BI tool that can be used for communication of insights from big data. One way for individuals in organizations to get insights from timely and large data sets is through visualizations displayed in visual dashboards, but studies show that most of them fall short of their potential. Therefore, the aim of this study is to examine how individuals make use of visual dashboards. Design/Methodology – To obtain this understanding a literature review was performed, followed by a study conducted in two phases. Firstly, a multiple-case study of four organizations was performed, which included interviews and the think-aloud technique. Secondly, the findings from the multiple-case study were tested through interviews with experts in the BI area. Findings – The findings indicate that low democratization, scarce effects, and simplicity are reasons for why the use of visual dashboards is not fully exploited. Low attention and understanding combined with a lack of timely information means that data-driven actions are not taken. The phase of predictive analysis has not yet been reached, rather organizations are still using the visual dashboard for descriptive analysis, which in turn hinders the possibility for effects. For these reasons the use of visual dashboards does not meet the often described purpose to make better and faster decisions, and organizations are still to take steps in that direction. Research limitations – The sampling of industries in the multiple-case study could affect variables as number of KPIs.
|
589 |
En kvalitativ granskning av rollen Data Scientist och deras arbetsuppgifter i förhållande till kännetecken för Big Data / A qualitative review of the role of Data Scientists and their work tasks in relation to the characteristics of Big DataOtterheim, Oskar January 2020 (has links)
Detta examensarbete har inspirerats av det inkonsekventa tillämpandet av olika kännetecken vid definiering av Big Data och hur det vidsträckta arbetet som expertrollen för konceptet går att förhålla till just dessa kännetecken. Studiens syfte är således att ge en klarare bild över vad den diffusa rollen Data Scientist går ut på och lyfta fram vad de grundläggande uppgifterna i rollen innebär med dessa kännetecken som riktlinjer. Information om rollen har samlats in genom semi-strukturerade intervjuer med Data Scientists i verksamheter av varierande typer och storlekar. Studiens analys ger målande beskrivningar över hur arbetet för deltagande respondenter ser ut, och fastslår hur deras arbetsuppgifter förhåller sig till olika kännetecken för Big Data. Studiens resultat skildrar hur arbetet för de olika respondenterna förhåller sig till definitionen för Big Data, och hur arbetet skiljer sig beroende på vilken typ och storlek av verksamhet som Data Scientists är verksam i. Resultatet belyser också att arbetet för Data Scientists går att gemensamt förhålla till kännetecknen Value, Visualization och Validity vilket besvarar studiens grundläggande frågeställningen. Resultatet och undersökningen i sig reflekteras över i uppsatsens diskussionsdel där upptäckter som gjorts under arbetets gång skildras, både om Big Data som koncept och om rollen Data Scientist, vilket bland annat ger förslag på vidare studier som kan leda till kategoriseringar av rollen.
|
590 |
“We Traded Our Privacy for Comfortability” : A Study About How Big Data is Used and Abused by Major International CompaniesHansson, Madelene, Manfredsson, Adam January 2020 (has links)
Due to digitalization, e-commerce and online presence is something most of us take for granted. Companies are moving more towards an internet-based arena of sales, rather than traditional commerce in physical stores. This development has led to that firms’ choses to market themselves through various online channels such as social media. Big data is our digital DNA that we leave behind on every part of the internet that we utilize. Big data has become an international commodity that can be sold, stored and used. The authors of this thesis have investigated the way international firms extract and use big data to construct customized marketing for their customers. This thesis has also examined the ethical perspective of how this commodity is handled and used, and people’s perception regarding the matter. This is interesting to investigate since very few researches has been previously conducted combining big data usage with ethics. To accomplish the aim of this thesis, significant theory has been reviewed and accounted for. Also, a qualitative research has been conducted, where two large companies that are working closely with big data has been investigated through a case-study. The authors have also conducted six semi-structured interviews with people between the age of 20-30 years old. The outcome of this thesis shows the importance of implementing ethics within the concept and usage of big data and provide insight into the mind of the consumer that has been lacking in previous research of this subject.
|
Page generated in 0.0654 seconds