31 |
Randomized coordinate descent methods for big data optimizationTakac, Martin January 2014 (has links)
This thesis consists of 5 chapters. We develop new serial (Chapter 2), parallel (Chapter 3), distributed (Chapter 4) and primal-dual (Chapter 5) stochastic (randomized) coordinate descent methods, analyze their complexity and conduct numerical experiments on synthetic and real data of huge sizes (GBs/TBs of data, millions/billions of variables). In Chapter 2 we develop a randomized coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth separable convex function and prove that it obtains an ε-accurate solution with probability at least 1 - p in at most O((n/ε) log(1/p)) iterations, where n is the number of blocks. This extends recent results of Nesterov [43], which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing ε from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving first true iteration complexity bounds. For strongly convex functions the method converges linearly. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Our analysis is also much simpler. In Chapter 3 we show that the randomized coordinate descent method developed in Chapter 2 can be accelerated by parallelization. The speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is equal to the product of the number of processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there is no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of coordinates being updated at each iteration is random, which allows for modeling situations with variable (busy or unreliable) number of processors. We demonstrate numerically that the algorithm is able to solve huge-scale l1-regularized least squares problems with a billion variables. In Chapter 4 we extended coordinate descent into a distributed environment. We initially partition the coordinates (features or examples, based on the problem formulation) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix. Finally, in Chapter 5, we address the issue of using mini-batches in stochastic optimization of Support Vector Machines (SVMs). We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched (parallel) SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss. Our results in Chapters 2 and 3 are cast for blocks (groups of coordinates) instead of coordinates, and hence the methods are better described as block coordinate descent methods. While the results in Chapters 4 and 5 are not formulated for blocks, they can be extended to this setting.
|
32 |
Från data till kunskap : En kvalitativ studie om interaktiv visualisering av big data genom dashboardsAgerberg, David, Eriksson, Linus January 2016 (has links)
Rapid growing volumes of data demands new solutions in terms of analysing and visualizing. The growing amount of data contains valuable information which organizations in a more digitized society need to manage. It is a major challenge to visualize data, both in a static and interactive way. Through visualization of big data follows several opportunities containing risk assessment and decision basis. Previous research indicates a lack of standards and guidelines considering the development of interactive dashboards. By studying factors of success from a user-centered perspective we proceeded with a qualitative approach using semi-structured interviews. In addition to this we performed a thorough examination of existing literature in this particular field of research. A total of eight interviews were held, all eight respondents had experience from using or developing dashboards. The results indicates that user experience is an important yet not a sufficiently used principle. They also indicates challenges concerning the management of big data and particularly visualizing it. The results developed into a model which illustrates guidelines and vital components to orchestrate when developing a dashboard. A user-centered approach should pervade the entire developing process. Interactive functionalities are rather a necessity than a recommendation. With interactiveness comes drill-down functionalities which leads to a more intuitively practice. User experience is an essential component of the model, bringing light to individual customisations as well as it makes allowances to a large target group. The last component highlights the importance of early prototyping and an iterative approach to software development. The conclusion of the study is our complete model which brings opportunities to transform big data to great knowledge.
|
33 |
Compaction Strategies in Apache Cassandra : Analysis of Default Cassandra stress modelRavu, Venkata Sathya Sita J S January 2016 (has links)
Context. The present trend in a large variety of applications are ranging from the web and social networking to telecommunications, is to gather and process very large and fast growing amounts of information leading to a common set of problems known collectively as “Big Data”. The ability to process large scale data analytics over large number of data sets in the last decade proved to be a competitive advantage in a wide range of industries like retail, telecom and defense etc. In response to this trend, the research community and the IT industry have proposed a number of platforms to facilitate large scale data analytics. Such platforms include a new class of databases, often refer to as NoSQL data stores. Apache Cassandra is a type of NoSQL data store. This research is focused on analyzing the performance of different compaction strategies in different use cases for default Cassandra stress model. Objectives. The performance of compaction strategies are observed in various scenarios on the basis of three use cases, Write heavy- 90/10, Read heavy- 10/90 and Balanced- 50/50. For a default Cassandra stress model, so as to finally provide the necessary events and specifications that suggest when to switch from one compaction strategy to another. Methods. Cassandra single node network is deployed on a web server and its behavior of read and write performance with different compaction strategies is studied with read heavy, write heavy and balanced workloads. Its performance metrics are collected and analyzed. Results. Performance metrics of different compaction strategies are evaluated and analyzed. Conclusions. With a detailed analysis and logical comparison, we finally conclude that Level Tiered Compaction Strategy performs better for a read heavy (10/90) workload while using default Cassandra stress model , as compared to size tiered compaction and date tiered compaction strategies. And for Balanced Date tiered compaction strategy performs better than size tiered compaction strategy and date tiered compaction strategy.
|
34 |
Exploiting Application Characteristics for Efficient System Support of Data-Parallel Machine LearningCui, Henggang 01 May 2017 (has links)
Large scale machine learning has many characteristics that can be exploited in the system designs to improve its efficiency. This dissertation demonstrates that the characteristics of the ML computations can be exploited in the design and implementation of parameter server systems, to greatly improve the efficiency by an order of magnitude or more. We support this thesis statement with three case study systems, IterStore, GeePS, and MLtuner. IterStore is an optimized parameter server system design that exploits the repeated data access pattern characteristic of ML computations. The designed optimizations allow IterStore to reduce the total run time of our ML benchmarks by up to 50×. GeePS is a parameter server that is specialized for deep learning on distributed GPUs. By exploiting the layer-by-layer data access and computation pattern of deep learning, GeePS provides almost linear scalability from single-machine baselines (13× more training throughput with 16 machines), and also supports neural networks that do not fit in GPU memory. MLtuner is a system for automatically tuning the training tunables of ML tasks. It exploits the characteristic that the best tunable settings can often be decided quickly with just a short trial time. By making use of optimization-guided online trial-and-error, MLtuner can robustly find and re-tune tunable settings for a variety of machine learning applications, including image classification, video classification, and matrix factorization, and is over an order of magnitude faster than traditional hyperparameter tuning approaches.
|
35 |
Performance Optimization Techniques and Tools for Distributed Graph ProcessingKalavri, Vasiliki January 2016 (has links)
In this thesis, we propose optimization techniques for distributed graph processing. First, we describe a data processing pipeline that leverages an iterative graph algorithm for automatic classification of web trackers. Using this application as a motivating example, we examine how asymmetrical convergence of iterative graph algorithms can be used to reduce the amount of computation and communication in large-scale graph analysis. We propose an optimization framework for fixpoint algorithms and a declarative API for writing fixpoint applications. Our framework uses a cost model to automatically exploit asymmetrical convergence and evaluate execution strategies during runtime. We show that our cost model achieves speedup of up to 1.7x and communication savings of up to 54%. Next, we propose to use the concepts of semi-metricity and the metric backbone to reduce the amount of data that needs to be processed in large-scale graph analysis. We provide a distributed algorithm for computing the metric backbone using the vertex-centric programming model. Using the backbone, we can reduce graph sizes up to 88% and achieve speedup of up to 6.7x. / <p>QC 20160919</p>
|
36 |
Machine Learning Algorithms with Big Medicare Fraud DataUnknown Date (has links)
Healthcare is an integral component in peoples lives, especially for the rising elderly population, and must be affordable. The United States Medicare program is vital in serving the needs of the elderly. The growing number of people enrolled in the Medicare program, along with the enormous volume of money involved, increases the appeal for, and risk of, fraudulent activities. For many real-world applications, including Medicare fraud, the interesting observations tend to be less frequent than the normative observations. This difference between the normal observations and
those observations of interest can create highly imbalanced datasets. The problem of class imbalance, to include the classification of rare cases indicating extreme class
imbalance, is an important and well-studied area in machine learning. The effects of class imbalance with big data in the real-world Medicare fraud application domain, however, is limited. In particular, the impact of detecting fraud in Medicare claims is critical in lessening the financial and personal impacts of these transgressions. Fortunately, the healthcare domain is one such area where the successful detection
of fraud can garner meaningful positive results. The application of machine learning techniques, plus methods to mitigate the adverse effects of class imbalance and rarity, can be used to detect fraud and lessen the impacts for all Medicare beneficiaries. This dissertation presents the application of machine learning approaches to detect Medicare provider claims fraud in the United States. We discuss novel techniques
to process three big Medicare datasets and create a new, combined dataset, which includes mapping fraud labels associated with known excluded providers. We investigate the ability of machine learning techniques, unsupervised and supervised, to detect Medicare claims fraud and leverage data sampling methods to lessen the impact of class imbalance and increase fraud detection performance. Additionally, we extend the study of class imbalance to assess the impacts of rare cases in big data for Medicare fraud detection. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2018. / FAU Electronic Theses and Dissertations Collection
|
37 |
Operating system support for warehouse-scale computingSchwarzkopf, Malte January 2018 (has links)
Modern applications are increasingly backed by large-scale data centres. Systems software in these data centre environments, however, faces substantial challenges: the lack of uniform resource abstractions makes sharing and resource management inefficient, infrastructure software lacks end-to-end access control mechanisms, and work placement ignores the effects of hardware heterogeneity and workload interference. In this dissertation, I argue that uniform, clean-slate operating system (OS) abstractions designed to support distributed systems can make data centres more efficient and secure. I present a novel distributed operating system for data centres, focusing on two OS components: the abstractions for resource naming, management and protection, and the scheduling of work to compute resources. First, I introduce a reference model for a decentralised, distributed data centre OS, based on pervasive distributed objects and inspired by concepts in classic 1980s distributed OSes. Translucent abstractions free users from having to understand implementation details, but enable introspection for performance optimisation. Fine-grained access control is supported by combining storable, communicable identifier capabilities, and context-dependent, ephemeral handle capabilities. Finally, multi-phase I/O requests implement optimistically concurrent access to objects while supporting diverse application-level consistency policies. Second, I present the DIOS operating system, an implementation of my model as an extension to Linux. The DIOS system call API is centred around distributed objects, globally resolvable names, and translucent references that carry context-sensitive object meta-data. I illustrate how these concepts support distributed applications, and evaluate the performance of DIOS in microbenchmarks and a data-intensive MapReduce application. I find that it offers improved, finegrained isolation of resources, while permitting flexible sharing. Third, I present the Firmament cluster scheduler, which generalises prior work on scheduling via minimum-cost flow optimisation. Firmament can flexibly express many scheduling policies using pluggable cost models; it makes high-quality placement decisions based on fine-grained information about tasks and resources; and it scales the flow-based scheduling approach to very large clusters. In two case studies, I show that Firmament supports policies that reduce colocation interference between tasks and that it successfully exploits flexibility in the workload to improve the energy efficiency of a heterogeneous cluster. Moreover, my evaluation shows that Firmament scales the minimum-cost flow optimisation to clusters of tens of thousands of machines while still making sub-second placement decisions.
|
38 |
An Evaluation of Deep Learning with Class Imbalanced Big DataUnknown Date (has links)
Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g. anomaly detection. Modeling such skewed data distributions is often very difficult, and non-standard methods are sometimes required to combat these negative effects. These challenges have been studied thoroughly using traditional machine learning algorithms, but very little empirical work exists in the area of deep learning with class imbalanced big data. Following an in-depth survey of deep learning methods for addressing class imbalance, we evaluate various methods for addressing imbalance on the task of detecting Medicare fraud, a big data problem characterized by extreme class imbalance. Case studies herein demonstrate the impact of class imbalance on neural networks, evaluate the efficacy of data-level and algorithm-level methods, and achieve state-of-the-art results on the given Medicare data set. Results indicate that combining under-sampling and over-sampling maximizes both performance and efficiency. / Includes bibliography. / Thesis (M.S.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection
|
39 |
Development of a Readiness Assessment Model for Evaluating Big Data Projects: Case Study of Smart City in Oregon, USABarham, Husam Ahmad 29 May 2019 (has links)
The primary goal of this research is to help any organization, which is planning to transform to the big data analytics era, by providing a systematic and comprehensive model that this organization can use to better understand what factors influence big data projects. Also, the organization's current status against those factors. Finally, what enhancements are needed in the organization's current capabilities for optimal management of factors influencing an upcoming big data project. However, big data applications are vast and cover many sectors, and while most of the factors influencing big data projects are common across sectors, there are some factors that are related to the specific circumstances of each sector. Therefore, this research will focus on one sector only, which is the smart city sector, and its generalizability to other sectors is discussed at the end of the research.
In this research, literature review and experts feedback were used to identify the most critical factors influencing big data projects, with focus on smart city. Then, the HDM methodology was used to elicit experts' judgment to identify the relative importance of those factors. In addition, experts' feedback was used to identify possible statuses an organization might have regarding each factor. Finally, a case study of four projects related to the City of Portland, Oregon, was conducted to demonstrate the practicality and value of the research model.
The research findings indicated that there are complicated internal and external, sometimes competing, factors affecting big data projects. The research identified 18 factors as being among the most important factors affecting smart-city-related big data projects. Those factors are grouped into four perspectives: people, technology, legal, and organization. Furthermore, the case study demonstrated how the model could pinpoint shortcomings in a city's capabilities before the project start, and how to address those shortcomings to increase chances of a successful big data project.
|
40 |
Det binära guldet : en uppsats om big data och analyticsHellström, Elin, Hemlin, My January 2013 (has links)
Syftet med denna studie är att utreda begreppen big data och analytics. Utifrån vetenskapliga teorier om begreppen undersöks hur konsultföretag uppfattar och använder sig av big data och analytics. För att skapa en nyanserad bild har även en organisation inom vården undersökts för att få kunskap om hur de kan dra nytta av big data och analytics. Ett antal viktiga svårigheter och framgångsfaktorer kopplade till båda begreppen presenteras. De svårigheterna kopplas sedan ihop med en framgångsfaktor som anses kunna bidra till att lösa det problemet. De mest relevanta framgångsfaktorer som identifierats är att högkvalitativ data finns tillgänglig men även kunskap och kompetens kring hur man hanterar data. Slutligen tydliggörs begreppens innebörd där man kan se att big data oftast beskrivs ur dimensionerna volym, variation och hastighet och att analytics i de flesta fall syftar till att deskriptiv och preventiv analys genomförs. / The purpose of this study is to investigate the concepts of big data and analytics. The concepts are explored based on scientific theories and interviews with consulting firms. A healthcare organization has also been interviewed to get a richer understanding of how big data and analytics can be used to gain insights and how an organisation can benefit from them. A number of important difficulties and sucess facors connected to the concepts are presented. These difficulties are then linked to a sucess factor that is considered to solve the problem. The most relevant success factors identified are the avaliability of high quality data and knowledge and expertise on how to handle the data. Finally the concepts are clarified and one can see that big data is usually described from the dimensions volume, variety and velocity and analytics is usually described as descriptive and preventive analysis.
|
Page generated in 0.0465 seconds