Spelling suggestions: "subject:"dataintensive"" "subject:"dataintensiva""
1 |
Optimisation of the enactment of fine-grained distributed data-intensive work flowsLiew, Chee Sun January 2012 (has links)
The emergence of data-intensive science as the fourth science paradigm has posed a data deluge challenge for enacting scientific work-flows. The scientific community is facing an imminent flood of data from the next generation of experiments and simulations, besides dealing with the heterogeneity and complexity of data, applications and execution environments. New scientific work-flows involve execution on distributed and heterogeneous computing resources across organisational and geographical boundaries, processing gigabytes of live data streams and petabytes of archived and simulation data, in various formats and from multiple sources. Managing the enactment of such work-flows not only requires larger storage space and faster machines, but the capability to support scalability and diversity of the users, applications, data, computing resources and the enactment technologies. We argue that the enactment process can be made efficient using optimisation techniques in an appropriate architecture. This architecture should support the creation of diversified applications and their enactment on diversified execution environments, with a standard interface, i.e. a work-flow language. The work-flow language should be both human readable and suitable for communication between the enactment environments. The data-streaming model central to this architecture provides a scalable approach to large-scale data exploitation. Data-flow between computational elements in the scientific work-flow is implemented as streams. To cope with the exploratory nature of scientific work-flows, the architecture should support fast work-flow prototyping, and the re-use of work-flows and work-flow components. Above all, the enactment process should be easily repeated and automated. In this thesis, we present a candidate data-intensive architecture that includes an intermediate work-flow language, named DISPEL. We create a new fine-grained measurement framework to capture performance-related data during enactments, and design a performance database to organise them systematically. We propose a new enactment strategy to demonstrate that optimisation of data-streaming work-flows can be automated by exploiting performance data gathered during previous enactments.
|
2 |
Enabling Approximate Storage through Lossy Media Data CompressionWorek, Brian David 08 February 2019 (has links)
Memory capacity, bandwidth, and energy all continue to present hurdles in the quest for efficient, high-speed computing. Recognition, mining, and synthesis (RMS) applications in particular are limited by the efficiency of the memory subsystem due to their large datasets and need to frequently access memory. RMS applications, such as those in machine learning, deliver intelligent analysis and decision making through their ability to learn, identify, and create complex data models. To meet growing demand for RMS application deployment in battery constrained devices, such as mobile and Internet-of-Things, designers will need novel techniques to improve system energy consumption and performance. Fortunately, many RMS applications demonstrate inherent error resilience, a property that allows them to produce acceptable outputs even when data used in computation contain errors. Approximate storage techniques across circuits, architectures, and algorithms exploit this property to improve the energy consumption and performance of the memory subsystem through quality-energy scaling. This thesis reviews state of the art techniques in approximate storage and presents our own contribution that uses lossy compression to reduce the storage cost of media data. / MS / Computer memory systems present challenges in the quest for more powerful overall computing systems. Computer applications with the ability to learn from large sets of data in particular are limited because they need to frequently access the memory system. These applications are capable of intelligent analysis and decision making due to their ability to learn, identify, and create complex data models. To meet growing demand for intelligent applications in smartphones and other Internet connected devices, designers will need novel techniques to improve energy consumption and performance. Fortunately, many intelligent applications are naturally resistant to errors, which means they can produce acceptable outputs even when there are errors in inputs or computation. Approximate storage techniques across computer hardware and software exploit this error resistance to improve the energy consumption and performance of computer memory by purposefully reducing data precision. This thesis reviews state of the art techniques in approximate storage and presents our own contribution that uses lossy compression to reduce the storage cost of media data.
|
3 |
An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysisOluwaseun, Ajayi Olabode January 2018 (has links)
>Magister Scientiae - MSc / Functional genomics determines the biological functions of genes on a global scale by
using large volumes of data obtained through techniques including next-generation
sequencing (NGS). The application of NGS in biomedical research is gaining in
momentum, and with its adoption becoming more widespread, there is an increasing
need for access to customizable computational workflows that can simplify, and offer
access to, computer intensive analyses of genomic data. In this study, the Galaxy and
Ruffus frameworks were designed and implemented with a view to address the
challenges faced in biomedical research. Galaxy, a graphical web-based framework,
allows researchers to build a graphical NGS data analysis pipeline for accessible,
reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework
used by bioinformaticians as Python library to write scripts in object-oriented style,
allows for building a workflow in terms of task dependencies and execution logic. In
this study, a dual data analysis technique was explored which focuses on a comparative
evaluation of Galaxy and Ruffus frameworks that are used in composing analysis
pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the
analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed
to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the
analysis pipeline in Galaxy displayed a higher percentage of load and store instructions.
In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The
CPU usage, memory utilization, and runtime execution are graphically represented in
this study. Our evaluation suggests that workflow frameworks have distinctly different
features from ease of use, flexibility, and portability, to architectural designs.
|
4 |
Microservices in data intensive applicationsRemeika, Mantas, Urbanavicius, Jovydas January 2018 (has links)
The volumes of data which Big Data applications have to process are constantly increasing. This requires for the development of highly scalable systems. Microservices is considered as one of the solutions to deal with the scalability problem. However, the literature on practices for building scalable data-intensive systems is still lacking. This thesis aims to investigate and present the benefits and drawbacks of using microservices architecture in big data systems. Moreover, it presents other practices used to increase scalability. It includes containerization, shared-nothing architecture, data sharding, load balancing, clustering, and stateless design. Finally, an experiment comparing the performance of a monolithic application and a microservices-based application was performed. The results show that with increasing amount of load microservices perform better than the monolith. However, to cope with the constantly increasing amount of data, additional techniques should be used together with microservices.
|
5 |
Seafarers, silk, and science : oceanographic data in the makingHalfmann, Gregor January 2018 (has links)
This thesis comprises an empirical case study of scientific data production in oceanography and a philosophical analysis of the relations between newly created scientific data and the natural world. Based on qualitative interviews with researchers, I reconstruct research practices that lead to the ongoing production of digital data related to long-term developments of plankton biodiversity in the oceans. My analysis is centred on four themes: materiality, scientific representing with data, methodological continuity, and the contribution of non-scientists to epistemic processes. These are critically assessed against the background of today’s data-intensive sciences and increased automation and remoteness in oceanographic practices. Sciences of the world’s oceans have by and large been disregarded in philosophical scholarship thus far. My thesis opens this field for philosophical analysis and reveals various conditions and constraints of data practices that are largely uncontrollable by ocean scientists. I argue that the creation of useful scientific data depends on the implementation and preservation of material, methodological, and social continuities. These allow scientists to repeatedly transform visually perceived characteristics of research samples into meaningful scientific data stored in a digital database. In my case study, data are not collected but result from active intervention and subsequent manipulation and processing of newly created material objects. My discussion of scientific representing with data suggests that scientists do not extract or read any intrinsic representational relation between data and a target, but make data gradually more computable and compatible with already existing representations of natural systems. My arguments shed light on the epistemological significance of materiality, on limiting factors of scientific agency, and on an inevitable balance between changing conditions of concrete research settings and long-term consistency of data practices.
|
6 |
Workload Management for Data-Intensive ServicesLim, Harold Vinson Chao January 2013 (has links)
<p>Data-intensive web services are typically composed of three tiers: i) a display tier that interacts with users and serves rich content to them, ii) a storage tier that stores the user-generated or machine-generated data used to create this content, and iii) an analytics tier that runs data analysis tasks in order to create and optimize new content. Each tier has different workloads and requirements that result in a diverse set of systems being used in modern data-intensive web services.</p><p>Servers are provisioned dynamically in the display tier to ensure that interactive client requests are served as per the latency and throughput requirements. The challenge is not only deciding automatically how many servers to provision but also when to provision them, while ensuring stable system performance and high resource utilization. To address these challenges, we have developed a new control policy for provisioning resources dynamically in coarse-grained units (e.g., adding or removing servers or virtual machines in cloud platforms). Our new policy, called proportional thresholding, converts a user-specified performance target value into a target range in order to account for the relative effect of provisioning a server on the overall workload performance.</p><p>The storage tier is similar to the display tier in some respects, but poses the additional challenge of needing redistribution of stored data when new storage nodes are added or removed. Thus, there will be some delay before the effects of changing a resource allocation will appear. Moreover, redistributing data can cause some interference to the current workload because it uses resources that can otherwise be used for processing requests. We have developed a system, called Elastore, that addresses the new challenges found in the storage tier. Elastore not only coordinates resource allocation and data redistribution to preserve stability during dynamic resource provisioning, but it also finds the best tradeoff between workload interference and data redistribution time.</p><p>The workload in the analytics tier consists of data-parallel workflows that can either be run in a batch fashion or continuously as new data becomes available. Each workflow is composed of smaller units that have producer-consumer relationships based on data. These workflows are often generated from declarative specifications in languages like SQL, so there is a need for a cost-based optimizer that can generate an efficient execution plan for a given workflow. There are a number of challenges when building a cost-based optimizer for data-parallel workflows, which includes characterizing the large execution plan space, developing cost models to estimate the execution costs, and efficiently searching for the best execution plan. We have built two cost-based optimizers: Stubby for batch data-parallel workflows running on MapReduce systems, and Cyclops for continuous data-parallel workflows where the choice of execution system is made a part of the execution plan space.</p><p>We have conducted a comprehensive evaluation that shows the effectiveness of each tier's automated workload management solution.</p> / Dissertation
|
7 |
Scheduling distributed data-intensive applications on global gridsVenugopal, Srikumar Unknown Date (has links) (PDF)
The next generation of scientific experiments and studies are being carried out by large collaborations of researchers distributed around the world engaged in analysis of huge collections of data generated by scientific instruments. Grid computing has emerged as an enabler for such collaborations as it aids communities in sharing resources to achieve common objectives. Data Grids provide services for accessing, replicating and managing data collections in these collaborations. Applications used in such Grids are distributed data-intensive, that is, they access and process distributed datasets to generate results. These applications need to transparently and efficiently access distributed data and computational resources. This thesis investigates properties of data-intensive computing environments and presents a software framework and algorithms for mapping distributed data-oriented applications to Grid resources. (For complete abstract open document)
|
8 |
Zoolander: Modeling and managing replication for predictabilityYang, Daiyi 19 December 2011 (has links)
No description available.
|
9 |
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel ArchitecturesJiang, Wei 27 August 2012 (has links)
No description available.
|
10 |
Cost-Effective Resource Configurations for Executing Data-Intensive Workloads in Public CloudsMian, Rizwan 04 December 2013 (has links)
The rate of data growth in many domains is straining our ability to manage and analyze it. Consequently, we see the emergence of computing systems that attempt to efficiently process data-intensive applications or I/O bound applications with large data. Cloud computing offers “infinite” resources on demand, and on a pay-as-you-go basis. As a result, it has gained interest for large-scale data processing. Given this supposedly infinite resource set, we need a provisioning process to determine appropriate resources for data processing or workload execution. We observe that the prevalent data processing architectures do not usually employ provisioning techniques available in a public cloud, and existing provisioning techniques have largely ignored data-intensive applications in public clouds.
In this thesis, we take a step towards bridging the gap between existing data processing approaches and the provisioning techniques available in a public cloud, such that the monetary cost of executing data-intensive workloads is minimized. We formulate the problem of provisioning and include constructs to exploit a cloud’s elasticity to include any number of resources to host a multi-tenant database system prior to execution. The provisioning is modeled as a search problem, and we use standard search heuristics to solve it.
We propose a novel framework for resource provisioning in a cloud environment. Our framework allows pluggable cost and performance models. We instantiate the framework by developing various search algorithms, cost and performance models to support the search for an effective resource configuration.
We consider data-intensive workloads that consist of transactional, analytical or mixed workloads for evaluation, and access multiple database tenants. The workloads are based on standard TPC benchmarks. In addition, the user preferences on response time or throughput are expressed as constraints. Our propositions and their results are validated in a real public cloud, namely the Amazon cloud. The evaluation supports our claim that the framework is an effective tool for provisioning database workloads in a public cloud with minimal dollar cost. / Thesis (Ph.D, Computing) -- Queen's University, 2013-11-30 19:30:39.427
|
Page generated in 0.3176 seconds