Global ETD Search

461	Middleware for online scientific data analytics at extreme scale Zheng, Fang 22 May 2014 (has links) Scientific simulations running on High End Computing machines in domains like Fusion, Astrophysics, and Combustion now routinely generate terabytes of data in a single run, and these data volumes are only expected to increase. Since such massive simulation outputs are key to scientific discovery, the ability to rapidly store, move, analyze, and visualize data is critical to scientists' productivity. Yet there are already serious I/O bottlenecks on current supercomputers, and movement toward the Exascale is further accelerating this trend. This dissertation is concerned with the design, implementation, and evaluation of middleware-level solutions to enable high performance and resource efficient online data analytics to process massive simulation output data at large scales. Online data analytics can effectively overcome the I/O bottleneck for scientific applications at large scales by processing data as it moves through the I/O path. Online analytics can extract valuable insights from live simulation output in a timely manner, better prepare data for subsequent deep analysis and visualization, and gain improved performance and reduced data movement cost (both in time and in power) compared to the conventional post-processing paradigm. The thesis identifies the key challenges for online data analytics based on the needs of a variety of large-scale scientific applications, and proposes a set of novel and effective approaches to efficiently program, distribute, and schedule online data analytics along the critical I/O path. In particular, its solution approach i) provides a high performance data movement substrate to support parallel and complex data exchanges between simulation and online data analytics, ii) enables placement flexibility of analytics to exploit distributed resources, iii) for co-placement of analytics with simulation codes on the same nodes, it uses fined-grained scheduling to harvest idle resources for running online analytics with minimal interference to the simulation, and finally, iv) it supports scalable efficient online spatial indices to accelerate data analytics and visualization on the deep memory hierarchies of high end machines. Our middleware approach is evaluated with leadership scientific applications in domains like Fusion, Combustion, and Molecular Dynamics, and on different High End Computing platforms. Substantial improvements are demonstrated in end-to-end application performance and in resource efficiency at scales of up to 16384 cores, for a broad range of analytics and visualization codes. The outcome is a useful and effective software platform for online scientific data analytics facilitating large-scale scientific data exploration. Scientific data analytics I/O middleware Middleware Big data High performance computing
462	Cost-effective and privacy-conscious cloud service provisioning: architectures and algorithms Palanisamy, Balaji 27 August 2014 (has links) Cloud Computing represents a recent paradigm shift that enables users to share and remotely access high-powered computing resources (both infrastructure and software/services) contained in off-site data centers thereby allowing a more efficient use of hardware and software infrastructures. This growing trend in cloud computing, combined with the demands for Big Data and Big Data analytics, is driving the rapid evolution of datacenter technologies towards more cost-effective, consumer-driven, more privacy conscious and technology agnostic solutions. This dissertation is dedicated to taking a systematic approach to develop system-level techniques and algorithms to tackle the challenges of large-scale data processing in the Cloud and scaling and delivering privacy-aware services with anytime-anywhere availability. We analyze the key challenges in effective provisioning of Cloud services in the context of MapReduce-based parallel data processing considering the concerns of cost-effectiveness, performance guarantees and user-privacy and we develop a suite of solution techniques, architectures and models to support cost-optimized and privacy-preserving service provisioning in the Cloud. At the cloud resource provisioning tier, we develop a utility-driven MapReduce Cloud resource planning and management system called Cura for cost-optimally allocating resources to jobs. While existing services require users to select a number of complex cluster and job parameters and use those potentially sub-optimal per-job configurations, the Cura resource management achieves global resource optimization in the cloud by minimizing cost and maximizing resource utilization. We also address the challenges of resource management and job scheduling for large-scale parallel data processing in the Cloud in the presence of networking and storage bottlenecks commonly experienced in Cloud data centers. We develop Purlieus, a self-configurable locality-based data and virtual machine management framework that enables MapReduce jobs to access their data either locally or from close-by nodes including all input, output and intermediate data achieving significant improvements in job response time. We then extend our cloud resource management framework to support privacy-preserving data access and efficient privacy-conscious query processing. Concretely, we propose and implement VNCache: an efficient solution for MapReduce analysis of cloud-archived log data for privacy-conscious enterprises. Through a seamless data streaming and prefetching model in VNCache, Hadoop jobs begin execution as soon as they are launched without requiring any apriori downloading. At the cloud consumer tier, we develop mix-zone based techniques for delivering anonymous cloud services to mobile users on the move through Mobimix, a novel road-network mix-zone based framework that enables real time, location based service delivery without disclosing content or location privacy of the consumers. Cloud computing MapReduce Big data Cost-optimization Locality-awareness Caching System privacy Location privacy
463	BIG DATA : From hype to reality Danesh, Sabri January 2014 (has links) Big data is all of a sudden everywhere. It is too big to ignore!It has been six decades since the computer revolution, four decades after the development of the microchip, and two decades of the modern Internet! More than a decade after the 90s “.com” fizz, can Big Data be the next Big Bang? Big data reveals part of our daily lives. It has the potential to solve virtually any problem for a better urbanized global. Big Data sources are also very interesting from an official statistics point of view. The purpose of this paper is to explore the conceptions of big data and opportunities and challenges associated with using big data especially in official statistics. “A petabyte is the equivalent of 1,000 terabytes, or a quadrillion bytes. One terabyte is a thousand gigabytes. One gigabyte is made up of a thousand megabytes. There are a thousand thousand—i.e., a million—petabytes in a zettabyte” (Shaw 2014). And this is to be continued… Big Data Volume Velocity Variety Veracity Cukier Mayer-Schönberger SCB Statistics bigdata
464	An artefact to analyse unstructured document data stores / by André Romeo Botes Botes, André Romeo January 2014 (has links) Structured data stores have been the dominating technologies for the past few decades. Although dominating, structured data stores lack the functionality to handle the ‘Big Data’ phenomenon. A new technology has recently emerged which stores unstructured data and can handle the ‘Big Data’ phenomenon. This study describes the development of an artefact to aid in the analysis of NoSQL document data stores in terms of relational database model constructs. Design science research (DSR) is the methodology implemented in the study and it is used to assist in the understanding, design and development of the problem, artefact and solution. This study explores the existing literature on DSR, in addition to structured and unstructured data stores. The literature review formulates the descriptive and prescriptive knowledge used in the development of the artefact. The artefact is developed using a series of six activities derived from two DSR approaches. The problem domain is derived from the existing literature and a real application environment (RAE). The reviewed literature provided a general problem statement. A representative from NFM (the RAE) is interviewed for a situation analysis providing a specific problem statement. An objective is formulated for the development of the artefact and suggestions are made to address the problem domain, assisting the artefact’s objective. The artefact is designed and developed using the descriptive knowledge of structured and unstructured data stores, combined with prescriptive knowledge of algorithms, pseudo code, continuous design and object-oriented design. The artefact evolves through multiple design cycles into a final product that analyses document data stores in terms of relational database model constructs. The artefact is evaluated for acceptability and utility. This provides credibility and rigour to the research in the DSR paradigm. Acceptability is demonstrated through simulation and the utility is evaluated using a real application environment (RAE). A representative from NFM is interviewed for the evaluation of the artefact. Finally, the study is communicated by describing its findings, summarising the artefact and looking into future possibilities for research and application. / MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014 Design science research Structured data stores Unstructured data stores Artefact NoSQL Big data
465	The fast multipole method at exascale Chandramowlishwaran, Aparna 13 January 2014 (has links) This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models. Fast multipole method Performance analysis High performance computing Multi-body problem Algorithms Big data
466	An Ensemble Method for Large Scale Machine Learning with Hadoop MapReduce Liu, Xuan 25 March 2014 (has links) We propose a new ensemble algorithm: the meta-boosting algorithm. This algorithm enables the original Adaboost algorithm to improve the decisions made by different WeakLearners utilizing the meta-learning approach. Better accuracy results are achieved since this algorithm reduces both bias and variance. However, higher accuracy also brings higher computational complexity, especially on big data. We then propose the parallelized meta-boosting algorithm: Parallelized-Meta-Learning (PML) using the MapReduce programming paradigm on Hadoop. The experimental results on the Amazon EC2 cloud computing infrastructure show that PML reduces the computation complexity enormously while retaining lower error rates than the results on a single computer. As we know MapReduce has its inherent weakness that it cannot directly support iterations in an algorithm, our approach is a win-win method, since it not only overcomes this weakness, but also secures good accuracy performance. The comparison between this approach and a contemporary algorithm AdaBoost.PL is also performed. Adaboost Meta-learning Big Data Hadoop MapReduce Ensemble Learning Scalable Machine Learning Algorithm
467	Nonparametric Inference for High Dimensional Data Mukhopadhyay, Subhadeep 03 October 2013 (has links) Learning from data, especially ‘Big Data’, is becoming increasingly popular under names such as Data Mining, Data Science, Machine Learning, Statistical Learning and High Dimensional Data Analysis. In this dissertation we propose a new related field, which we call ‘United Nonparametric Data Science’ - applied statistics with “just in time” theory. It integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and it is applicable to teaching introductory statistics courses that are closer to modern frontiers of scientific research. Our framework includes small data analysis (combining traditional and modern nonparametric statistical inference), big and high dimensional data analysis (by statistical modeling methods that extend our unified framework for small data analysis). The first part of the dissertation (Chapters 2 and 3) has been oriented by the goal of developing a new theoretical foundation to unify many cultures of statistical science and statistical learning methods using mid-distribution function, custom made orthonormal score function, comparison density, copula density, LP moments and comoments. It is also examined how this elegant theory yields solution to many important applied problems. In the second part (Chapter 4) we extend the traditional empirical likelihood (EL), a versatile tool for nonparametric inference, in the high dimensional context. We introduce a modified version of the EL method that is computationally simpler and applicable to a large class of “large p small n” problems, allowing p to grow faster than n. This is an important step in generalizing the EL in high dimensions beyond the p ≤ n threshold where the standard EL and its existing variants fail. We also present detailed theoretical study of the proposed method. Big data Quantile Empirical Likelihood LP score function Copula Nonparametric Classification Data Science
468	Utilizing Crowd Sourced Analytics for Building Smarter Mobile Infrastructure and Achieving Better Quality of Experience Yarish, David 04 January 2016 (has links) There is great power in knowledge. Having insight into and predicting network events can be both informative and profitable. This thesis aims to assess how crowd-sourced network data collected on smartphones can be used to improve the quality of experience for users of the network and give network operators insight into how the networks infrastructure can also be improved. Over the course of a year, data has been collected and processed to show where networks have been performing well and where they are under-performing. The results of this collection aim to show that there is value in the collection of this data, and that this data cannot be adequately obtained without a device side presence. The various graphs and histograms demonstrate that the quantities of measurements and speeds recorded vary by both the location and time of day. It is these variations that cannot be determined via traditional network-side measurements. During the course of this experiment, it was observed that certain times of day have much greater numbers of people using the network and it is likely that the quantities of users on the network are correlated with the speeds observed at those times. Places of gathering such as malls and public areas had a higher user density, especially around noon which could is a normal time when people would take a break from the work day. Knowing exactly where and when an Access Point (AP) is utilized is important information when trying to identify how users are utilizing the network. / Graduate / davidyarish@gmail.com Analytics Quality of Experience Networks Network Infrastructure Crowd-Source Mobile Big Data Smartphone Data Probe
469	Inteligência competitiva e modelos de séries temporais para previsão de consumo : o estudo de uma empresa do setor metalúrgico Espíndola, André Mauro Santos de 30 August 2013 (has links) O mundo vive um contínuo e acelerado processo de transformação que envolve todas as áreas do conhecimento. É possível afirmar que a velocidade desse processo tem uma relação direta com a rapidez em que ocorrem as mudanças na área tecnológica. Estas mudanças têm tornado cada vez mais as relações globalizadas, modificado as transações comercias e fazendo com que as empresas repensem as formas de competir. Nesse contexto, o conhecimento assume, a partir do volume de dados e informações, um papel de novo insumo, muitas vezes com maior importância que o trabalho, capital e a terra. Essas mudanças e a importância da informação fazem com que as empresas busquem um novo posicionamento, procurando identificar no ambiente externo sinais que possam indicar eventos futuros. O grande desafio das empresas passa pela obtenção de dados, extração da informação e transformação dessa em conhecimento útil para a tomada de decisão. Nessa conjuntura este estudo teve como objetivo identificar qual o modelo de previsão de consumo para análise das informações no processo de Inteligência Competitiva em uma empresa do setor metalúrgico localizada no estado do Rio Grande do Sul. No desenvolvimento do estudo foram utilizados os temas Big Data, Data Mining, Previsão de Demanda e Inteligência Competitiva com a finalidade de responder à seguinte questão: Qual o modelo de previsão de consumo de aço que pode ser usado para análise das informações no processo de Inteligência Competitiva? Na realização do estudo foram analisados dados internos e externos a empresa na busca pela identificação de correlação entre o consumo de aço da empresa e variáveis econômicas que posteriormente foram utilizadas na identificação do modelo de previsão de consumo. Foram identificados dois modelos, um univariado sem intervenção através da metodologia de Box e Jenkins, o segundo modelo foi um modelo de previsão com Função de Transferência. Os dois modelos apresentaram uma boa capacidade de descrever a série histórica do consumo de aço, mas o modelo univariado apresentou melhores resultados na capacidade de previsão. / Submitted by Marcelo Teixeira (mvteixeira@ucs.br) on 2014-05-13T13:26:01Z No. of bitstreams: 1 Dissertacao Andre Mauro S. de Espíndola.pdf: 1338819 bytes, checksum: af57b17c9aee6e9db4340062eb47950b (MD5) / Made available in DSpace on 2014-05-13T13:26:01Z (GMT). No. of bitstreams: 1 Dissertacao Andre Mauro S. de Espíndola.pdf: 1338819 bytes, checksum: af57b17c9aee6e9db4340062eb47950b (MD5) / The world has been in a continuous and rapid process of transformation which involves all the areas of knowledge. It is possible to assert that the speed of this process has a direct relationship with the fast changes in the technological area. These changes have influenced the global relationships even more; modifying the commercial trades and making companies rethink their competitive actions. In this field, knowledge takes on a new role giving more importance to the amount of data and information to the detriment of land, labor and capital. These changes and the importance given to information make companies establish new positions in order to identify signs that anticipate events. Obtaining, extracting and transforming information into useful knowledge to help in the final decision is a challenge. Thus the purpose of this study is determine a model of consumption anticipation to analyze the process of competitive intelligence in a Metallurgy Company located in the state of Rio Grande do Sul. To develop the study the themes Big Data, Data Mining, Demand Prediction and Competitive Intelligence were used aiming to answer the question: Which model to anticipate consumption for iron can be used to analyze information in the process of competitive intelligence? For the study, internal and external data were analyzed to identify the relation between the company iron consumption and the economic variables, which were used in the demand anticipation afterwards. Two models were identified, beeing one of them univariate and having no intervention through Box and Jenkins methodology. The second model had a transfer function. Both of them demonstrated good capability in describing historical series of iron consumption, however the univariate model has demonstrated better results in the capability of anticipation. Big data Tecnologia da informação Steel Competitive intelligence
470	Genomic data analyses for population history and population health Bycroft, Clare January 2017 (has links) Many of the patterns of genetic variation we observe today have arisen via the complex dynamics of interactions and isolation of historic human populations. In this thesis, we focus on two important features of the genetics of populations that can be used to learn about human history: population structure and admixture. The Iberian peninsula has a complex demographic history, as well as rich linguistic and cultural diversity. However, previous studies using small genomic regions (such as Y-chromosome and mtDNA) as well as genome-wide data have so far detected limited genetic structure in Iberia. Larger datasets and powerful new statistical methods that exploit information in the correlation structure of nearby genetic markers have made it possible to detect and characterise genetic differentiation at fine geographic scales. We performed the largest and most comprehensive study of Spanish population structure to date by analysing genotyping array data for ~1,400 Spanish individuals genotyped at ~700,000 polymorphic loci. We show that at broad scales, the major axis of genetic differentiation in Spain runs from west to east, while there is remarkable genetic similarity in the north-south direction. Our analysis also reveals striking patterns of geographically-localised and subtle population structure within Spain at scales down to tens of kilometres. We developed and applied new approaches to show how this structure has arisen from a complex and regionally-varying mix of genetic isolation and recent gene-flow within and from outside of Iberia. To further explore the genetic impact of historical migrations and invasions of Iberia, we assembled a data set of 2,920 individuals (~300,000 markers) from Iberia and the surrounding regions of north Africa, Europe, and sub-Saharan Africa. Our admixture analysis implies that north African-like DNA in Iberia was mainly introduced in the earlier half (860 - 1120 CE) of the period of Muslim rule in Iberia, and we estimate that the closest modern-day equivalents to the initial migrants are located in Western Sahara. We also find that north African-like DNA in Iberia shows striking regional variation, with near-zero contributions in the Basque regions, low amounts (~3%) in the north east of Iberia, and as high as (~11%) in Galicia and Portugal. The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Understanding the role that genetics plays in phenotypic variation, and its potential interactions with other factors, provides a critical route to a better understanding of human biology and population health. As such, a key component of the UK Biobank resource has been the collection of genome-wide genetic data (~805,000 markers) on every participant using purpose-designed genotyping arrays. These data are the focus of the second part of this thesis. In particular, we designed and implemented a quality control (QC) pipeline on behalf of the current and future use of this multi-purpose resource. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestral backgrounds in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data, including population structure and familial relatedness, that can be important for downstream analyses. We find that cryptic relatedness is common among UK Biobank participants (~30% have at least one first cousin relative or closer), and a full range of human population structure is present in this cohort: from world-wide ancestral diversity to subtle population structure at sub-national geographic scales. Finally, we performed a genome-wide association scan on a well-studied and highly polygenic phenotype: standing height. This provided a further test of the effectiveness of our QC, as well as highlighting the potential of the resource to uncover novel regions of association.

Search results