61 |
Screening and Engineering Phenotypes using Big Data Systems BiologyHuttanus, Herbert M. 20 September 2019 (has links)
Biological systems display remarkable complexity that is not properly accounted for in small, reductionistic models. Increasingly, big data approaches using genomics, proteomics, metabolomics etc. are being applied to predicting and modifying the emergent phenotypes produced by complex biological systems. In this research, several novel tools were developed to assist in the acquisition and analysis of biological big data for a variety of applications. In total, two entirely new tools were created and a third, relatively new method, was evaluated by applying it to questions of clinical importance. 1) To assist in the quantification of metabolites at the subcellular level, a strategy for localized in-vivo enzymatic assays was proposed. A proof of concept for this strategy was conducted in which the local availability of acetyl-CoA in the peroxisomes of yeast was quantified by the production of polyhydroxybutyrate (PHB) using three heterologous enzymes. The resulting assay demonstrated the differences in acetyl-CoA availability in the peroxisomes under various culture conditions and genetic alterations. 2) To assist in the design of genetically modified microbe strains that are stable over many generations, software was developed to automate the selection of gene knockouts that would result in coupling cellular growth with production of a desired chemical. This software, called OptQuick, provides advantages over contemporary software for the same purpose. OptQuick can run considerably faster and uses a free optimization solver, GLPK. Knockout strategies generated by OptQuick were compared to case studies of similar strategies produced by contemporary programs. In these comparisons, OptQuick found many of the same gene targets for knockout. 3) To provide an inexpensive and non-invasive alternative for bladder cancer screening, Raman-based urinalysis was performed on clinical urine samples using RametrixTM software. RametrixTM has been previously developed and employed to other urinalysis applications, but this study was the first instance of applying this new technology to bladder cancer screening. Using a pool of 17 bladder cancer positive urine samples and 39 clinical samples exhibiting a range of healthy or other genitourinary disease phenotypes, RametrixTM was able to detect bladder cancer with a sensitivity of 94% and a specificity of 54%. 4) Methods for urine sample preservation were tested with regard to their effect on subsequent analysis with RametrixTM. Specifically, sterile filtration was tested as a potential method for extending the duration at which samples may be kept at room temperature prior to Raman analysis. Sterile filtration was shown to alter the chemical profile initially, but did not prevent further shifts in chemical profile over time. In spite of this, both unfiltered and filtered urine samples alike could be used for screening for chronic kidney disease or bladder cancer even after being stored for 2 weeks at room temperature, making sterile filtration largely unnecessary. / Doctor of Philosophy / Biological systems display remarkable complexity that is not properly accounted for in conventional, reductionistic models. Thus, there is a growing trend in biological studies to use computational analysis on large databases of information such as genomes containing thousands of genes or chemical profiles containing thousands of metabolites in a single cell. In this research, several new tools were developed to assist with gathering and processing large biological datasets. In total, two entirely new tools were created and a third, relatively new method, was evaluated by applying it to questions of medical importance. The first two tools are for bioengineering applications. Bioengineers often want to understand the complex chemical network of a cell’s metabolism and, ultimately, alter that network so as to force the cell to make more of a desired chemical like a biofuel or medicine. The first tool discussed in this dissertation offers a way to measure the concentration of key chemicals within a cell. Unlike previous methods for measuring these concentrations, however, this method limits its search to a specific compartment within the cell, which is important to many bioengineering strategies. The second technology discussed in this paper uses computer simulations of the cells entire metabolism to determine what genetic alterations might lead to better produce a chemical of interest. The third tool involves analyzing the chemical makeup of urine samples to screen for diseases such as bladder cancer. Two studies were conducted with this third tool. The first study shows that Raman spectroscopy can distinguish between bladder cancer and related diseases. The second v study addresses whether sterilizing the urine samples through filtration is necessary to preserve the samples for analysis. It was found that filtration was neither beneficial nor necessary.
|
62 |
On Grouped Observation Level Interaction and a Big Data Monte Carlo Sampling AlgorithmHu, Xinran 26 January 2015 (has links)
Big Data is transforming the way we live. From medical care to social networks, data is playing a central role in various applications. As the volume and dimensionality of datasets keeps growing, designing effective data analytics algorithms emerges as an important research topic in statistics. In this dissertation, I will summarize our research on two data analytics algorithms: a visual analytics algorithm named Grouped Observation Level Interaction with Multidimensional Scaling and a big data Monte Carlo sampling algorithm named Batched Permutation Sampler. These two algorithms are designed to enhance the capability of generating meaningful insights and utilizing massive datasets, respectively. / Ph. D.
|
63 |
A Workload-aware Resource Management and Scheduling System for Big Data AnalysisXu, Luna 05 February 2019 (has links)
The big data era has driven the needs for data analysis in every aspect of our daily lives. With the rapid growth of data size and complexity of data analysis models, modern big data analytic applications face the challenge to provide timely results often with limited resources. Such demand drives the growth of new hardware resources including GPUs and FPGAs, as well as storage devices such as SSDs and NVMs. It is challenging to manage the resources available in a cost restricted environment to best serve the applications with different characteristics. Extant approaches are agnostic to such heterogeneity in both underlying resources and workloads and require user knowledge and manual configuration for best performance. In this dissertation, we design, and implement a series of novel techniques, algorithms, and frameworks, to realize workload-aware resource management and scheduling. We demonstrate our techniques for efficient resource management across memory resource for in-memory data analytic platforms, processing resources for compute-intensive machine learning applications, and finally we design and develop a workload and heterogeneity-aware scheduler for general big data platforms.
The dissertation demonstrates that designing an effective resource manager requires efforts from both application and system side. The presented approach makes and joins the efforts on both sides to provide a holistic heterogeneity-aware resource manage and scheduling system. We are able to avoid task failure due to resource unavailability by workload-aware resource management, and improve the performance of data processing frameworks by carefully scheduling tasks according to the task characteristics and utilization and availability of the resources. / Ph. D. / Clusters of multiple computers connected through internet are often deployed in industry for larger scale data processing or computation that cannot be handled by standalone computers. In such a cluster, resources such as CPU, memory, disks are integrated to work together. It is important to manage a pool of such resources in a cluster to efficiently work together to provide better performance for workloads running on top. This role is taken by a software component in the middle layer called resource manager. Resource manager coordinates the resources in the computers and schedule tasks to them for computation. This dissertation reveals that current resource managers often partition resources statically hence cannot capture the dynamic resource needs of workloads as well as the heterogeneous configurations of the underlying resources. For example, some computers in a clsuter might be older than the others with slower CPU, less memory, etc. Workloads can show different resource needs. Watching YouTube require a lot of network resource while playing games demands powerful GPUs. To this end, the disseration proposes novel approaches to manage resources that are able to capture the heterogeneity of resources and dynamic workload needs, based on which, it can achieve efficient resource management, and schedule the right task to the right resource.
|
64 |
Modeling and Analysis of Non-Linear Dependencies using Copulas, with Applications to Machine LearningKarra, Kiran 21 September 2018 (has links)
Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence structure between random variables and allow flexible modeling of multivariate joint distributions. Elidan was the first to recognize this disconnect, and introduced copula based models to the ML community that demonstrated magnitudes of order better performance than the non copula-based models Elidan [2013]. However, the limitation of these is that they are only applicable for continuous random variables and real world data is often naturally modeled jointly as continuous and discrete. This report details our work in bridging this gap of modeling and analyzing data that is jointly continuous and discrete using copulas.
Our first research contribution details modeling of jointly continuous and discrete random variables using the copula framework with Bayesian networks, termed Hybrid Copula Bayesian Networks (HCBN) [Karra and Mili, 2016], a continuation of Elidan’s work on Copula Bayesian Networks Elidan [2010]. In this work, we extend the theorems proved by Neslehov ˇ a [2007] from bivariate ´ to multivariate copulas with discrete and continuous marginal distributions. Using the multivariate copula with discrete and continuous marginal distributions as a theoretical basis, we construct an HCBN that can model all possible permutations of discrete and continuous random variables for parent and child nodes, unlike the popular conditional linear Gaussian network model. Finally, we demonstrate on numerous synthetic datasets and a real life dataset that our HCBN compares favorably, from a modeling and flexibility viewpoint, to other hybrid models including the conditional linear Gaussian and the mixture of truncated exponentials models.
Our second research contribution then deals with the analysis side, and discusses how one may use copulas for exploratory data analysis. To this end, we introduce a nonparametric copulabased index for detecting the strength and monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals. Our index, termed Copula Index for Detecting Dependence and Monotonicity (CIM), satisfies several desirable properties of measures of association, including Renyi’s properties, the data processing inequality (DPI), and ´ consequently self-equitability. Synthetic data simulations reveal that the statistical power of CIM compares favorably to other state-of-the-art measures of association that are proven to satisfy the DPI. Simulation results with real-world data reveal CIM’s unique ability to detect the monotonicity structure among stochastic signals to find interesting dependencies in large datasets. Additionally, simulations show that CIM shows favorable performance to estimators of mutual information when discovering Markov network structure.
Our third research contribution deals with how to assess an estimator’s performance, in the scenario where multiple estimates of the strength of association between random variables need to be rank ordered. More specifically, we introduce a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator’s performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator’s performance compares favorably to kNN, vME, AP, and HMI estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes. / Ph. D. / Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, many of the traditional machine learning techniques rely on linear statistical techniques and models. For example, the correlation coefficient, a widely used construct in modern data analysis, is only a measure of linear dependence and cannot fully capture non-linear interactions. In this dissertation, we aim to address some of these gaps, and how they affect machine learning performance, using the mathematical construct of copulas.
Our first contribution deals with accurate probabilistic modeling of real-world data, where the underlying data is both continuous and discrete. We show that even though the copula construct has some limitations with respect to discrete data, it is still amenable to modeling large real-world datasets probabilistically. Our second contribution deals with analysis of non-linear datasets. Here, we develop a new measure of statistical association that can handle discrete, continuous, or combinations of such random variables that are related by any general association pattern. We show that our new metric satisfies several desirable properties and compare it’s performance to other measures of statistical association. Our final contribution attempts to provide a framework for understanding how an estimator of statistical association will affect end-to-end machine learning performance. Here, we develop the estimator response curve, and show a new way to characterize the performance of an estimator of statistical association, termed the estimator response curve. We then show that the estimator response curve can help predict how well an estimator performs in algorithms which require statistical associations to be rank ordered.
|
65 |
Large Web Archive Collection Infrastructure and ServicesWang, Xinyue 20 January 2023 (has links)
The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently.
In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format. / Doctor of Philosophy / The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information discovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently.
In this research, we propose to build a web archive big data processing infrastructure that can support efficient and scalable web archive reuse like quantitative data analysis and browsing services. We adopt industry frameworks and tools to establish a platform that can provide high-performance computation for web archive initiatives and users. We propose to convert the standard web archive data file format to a columnar data format for efficient future reuse. Our experiments show that our proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. Our design can also serve an efficient web browsing service without adopting a sophisticated web hosting architecture. In addition to the standard web archive data, we also integrate Twitter data into our design as a unique web archive resource. Twitter is a prominent source of data for researchers in a variety of fields and an integral element of the web's history. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.
|
66 |
Remote High Performance Visualization of Big Data for Immersive ScienceAbidi, Faiz Abbas 15 June 2017 (has links)
Remote visualization has emerged as a necessary tool in the analysis of big data. High-performance computing clusters can provide several benefits in scaling to larger data sizes, from parallel file systems to larger RAM profiles to parallel computation among many CPUs and GPUs. For scalable data visualization, remote visualization tools and infrastructure is critical where only pixels and interaction events are sent over the network instead of the data. In this paper, we present our pipeline using VirtualGL, TurboVNC, and ParaView to render over 40 million points using remote HPC clusters and project over 26 million pixels in a CAVE-style system. We benchmark the system by varying the video stream compression parameters supported by TurboVNC and establish some best practices for typical usage scenarios. This work will help research scientists and academicians in scaling their big data visualizations for real time interaction. / Master of Science / With advancements made in the technology sector, there are now improved and more scientific ways to see the data. 10 years ago, nobody would have thought what a 3D movie is or how it would feel to watch a movie in 3D. Some may even have questioned if it is possible. But watching 3D cinema is typical now and we do not care much about what goes behind the scenes to make this experience possible. Similarly, is it possible to see and interact with 3D data in the same way Tony Stark does in the movie Iron Man? The answer is yes, it is possible with several tools available now and one of these tools is called ParaView, which is mostly used for scientific visualization of data like climate research, computational fluid dynamics, astronomy among other things. You can either visualize this data on a 2D screen or in a 3D environment where a user will feel a sense of immersion as if they are within the scene looking and interacting with the data. But where is this data actually drawn? And how much time does it take to draw if we are dealing with large datasets? Do we want to draw all this 3D data on a local machine or can we make use of powerful remote machines that do the drawing part and send the final image through a network to the client? In most cases, drawing on a remote machine is a better solution when dealing with big data and the biggest bottleneck is how fast can data be sent to and received from the remote machines. In this work, we seek to understand the best practices of drawing big data on remote machines using ParaView and visualizing it in a 3D projection room like a CAVE (see section 2.2 for details on what is a CAVE).
|
67 |
A Framework for Hadoop Based Digital Libraries of TweetsBock, Matthew 17 July 2017 (has links)
The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL's collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools (Spark, Scala, HBase, etc.), simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient.
To represent this data, several data structures were designed to operate on top of the Hadoop and Spark libraries of tools. The first set of data structures is an abstract representation of a tweet at a basic level, as well as several concrete implementations which represent varying levels of detail to correspond with common sources of tweet data. The second major data structure is a collection structure designed to represent collections of tweet data structures and provide ways to filter, clean, and process the collections. All of these data structures went through an iterative design process based on the needs of the developers.
The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL's archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection's topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies. / Master of Science / The Digital Library Research Laboratory (DLRL) has collected over 1.5 billion tweets for the Integrated Digital Event Archiving and Library (IDEAL) and Global Event Trend Archive Research (GETAR) projects. Researchers across varying disciplines have an interest in leveraging DLRL’s collections of tweets for their own analyses. However, due to the steep learning curve involved with the required tools, simply converting the Twitter data into a workable format can be a cumbersome task in itself. This prompted the effort to build a programming framework that will help in developing code to analyze the Twitter data, run on arbitrary tweet collections, and enable developers to leverage projects designed with this general use in mind. The intent of this thesis work is to create an extensible framework of tools and data structures to represent Twitter data at a higher level and eliminate the need to work with raw text, so as to make the development of new analytics tools faster, easier, and more efficient.
The effectiveness of this effort was demonstrated in four distinct case studies. In the first case study, the framework was used to build a new tool that selects Twitter data from DLRL’s archive of tweets, cleans those tweets, and performs sentiment analysis within the topics of a collection’s topic model. The second case study applies the provided tools for the purpose of sociolinguistic studies. The third case study explores large datasets to accumulate all possible analyses on the datasets. The fourth case study builds metadata by expanding the shortened URLs contained in the tweets and storing them as metadata about the collections. The framework proved to be useful and cut development time for all four of the case studies.
|
68 |
Big data, data mining, and machine learning: value creation for business leaders and practitionersDean, J. January 2014 (has links)
No / Big data is big business. But having the data and the computational power to process it isn't nearly enough to produce meaningful results. Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners is a complete resource for technology and marketing executives looking to cut through the hype and produce real results that hit the bottom line. Providing an engaging, thorough overview of the current state of big data analytics and the growing trend toward high performance computing architectures, the book is a detail-driven look into how big data analytics can be leveraged to foster positive change and drive efficiency.
With continued exponential growth in data and ever more competitive markets, businesses must adapt quickly to gain every competitive advantage available. Big data analytics can serve as the linchpin for initiatives that drive business, but only if the underlying technology and analysis is fully understood and appreciated by engaged stakeholders.
|
69 |
A look at the potential of big data in nurturing intuition in organisational decision makersHussain, Zahid I., Asad, M. January 2017 (has links)
Yes / As big data (BD) and data analytics having gain significance the industry expects helping executives will eventually move towards evidence based decision making. The hope is to achieve more sustainable competitive advantage for their organisations. A key question is whether executives make decisions by intuition. This leads to another question whether big data would ever substitute human intuition. In this research, the ‘mind-set’ of executives about application and limitations of big data be investigated by taking into account their decision making behaviour. The aim is to look deeply into how BD technologies facilitate greater intuitiveness in executives, and consequently lead to faster and sustainable business growth.
|
70 |
What does Big Data has in-store for organisations: An Executive Management PerspectiveHussain, Zahid I., Asad, M., Alketbi, R. January 2017 (has links)
No / With a cornucopia of literature on Big Data and Data Analytics it has become a recent buzzword. The literature is full of hymns of praise for big data, and its potential applications. However, some of the latest published material exposes the challenges involved in implementing Big Data (BD) approach, where the uncertainty surrounding its applications is rendering it ineffective. The paper looks at the mind-sets and perspective of executives and their plans for using Big Data for decision making. Our data collection involved interviewing senior executives from a number of world class organisations in order to determine their understanding of big data, its limitations and applications. By using the information gathered by this is used to analyse how well executives understand big data and how well organisations are ready to use it effectively for decision making. The aim is to provide a realistic outlook on the usefulness of this technology and help organisations to make suitable and realistic decisions on its investment.
Professionals and academics are becoming increasingly interested in the field of big data (BD) and data analytics. Companies invest heavily into acquiring data, and analysing it. More recently the focus has switched towards data available through the internet which appears to have brought about new data collection opportunities. As the smartphone market developed further, data sources extended to include those from mobile and sensor networks. Consequently, organisations started using the data and analysing it. Thus, the field of business intelligence emerged, which deals with gathering data, and analysing it to gain insights and use them to make decisions (Chen, et al., 2012).
BD is seem to have a huge immense potential to provide powerful information businesses. Accenture claims (2015) that organisations are extremely satisfied with their BD projects concerned with enhancing their customer reach. Davenport (2006) has presented applications in which companies are using the power of data analytics to consistently predict behaviours and develop applications that enable them to unearth important yet difficult to see customer preferences, and evolve rapidly to generate revenues.
|
Page generated in 0.0651 seconds