• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 4
  • 2
  • 1
  • 1
  • Tagged with
  • 11
  • 11
  • 6
  • 4
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Využití metadat při řešení business intelligence aplikací / Application of metadata in business intelligence systems

Andrle, Ondřej January 2010 (has links)
The diploma thesis focuses on metadata as important means which are able to stop a trend of increasing costs of development, operation and maintenance of decision support systems -- Business Intelligence. The theoretical part of the thesis is elaborated based on such assumption. Its goal is to produce an extensive analysis of the term metadata -- starting with a general definition of the term, then dealing with the categorization and analyzing the issues and benefits. Further on, the term metadata management is discussed as well as metadata repository which is the key element of metadata solutions. The aim of the practical part of the thesis is to analyze the selected commercial metadata management solutions and answer the question whether there is currently a suitable comprehensive solution which would suit the needs of a chosen financial institution, Komerční banka. Furthermore, another question is discussed in the thesis and that is if a suitable solution is to either purchase a commercial solution or decide for own development. The analysis in both parts of the thesis, theoretical and practical, is mainly based on foreign sources, above all articles by specialists in the area of data warehousing, and numerous consultations with an expert on metadata from Komerční banka, Mr. Jiří Omacht.
2

Distributed Metadata Management for Parallel FileSystems

MESHRAM, VILOBH MAHADEO 19 October 2011 (has links)
No description available.
3

Model Management

Rahm, Erhard 23 October 2018 (has links)
No description available.
4

Supporting Metadata Management for Data Curation: Problem and Promise

Westbrooks, Elaine L. 02 May 2008 (has links)
Breakout session from the Living the Future 7 Conference, April 30-May 3, 2008, University of Arizona Libraries, Tucson, AZ. / Research communities and libraries are on the verge of reaching a saturation point with regard to the number of published reports documenting, planning, and defining e-science, e-research, cyberscholarship, and data curation. Despite the volumes of literature, little research is devoted to metadata maintenance and infrastructure. Libraries are poised to contribute metadata expertise to campus-wide data curation efforts; however, traditional and costly library methods of metadata creation and management must be replaced with cost-effective models that focus on the researcher’s data collection/analysis process. In such a model, library experts collaborate with researchers in building tools for metadata creation and maintenance which in turn contribute to the long-term sustainability, organization, and preservation of data. This presentation will introduce one of Cornell University Library’s collaborative efforts curating 2003 Northeast Blackout Data. The goal of the project is to make Blackout data accessible so that it can serve as a catalyst for innovative cross-disciplinary research that will produce better scientific understanding of the technology and communications that failed during the Blackout. Library staff collaborated with three groups: engineering faculty at Cornell, Government power experts, and power experts in the private sector. Finally the core components with regard to the metadata management methodology will be outlined and defined. Rights management emerged as the biggest challenge for the Blackout project.
5

Enhancing clinical data management and streamlining organic phase DNA isolation protocol in the Pre-Cancer Genomic Atlas cohort

Potter, Austin 23 November 2020 (has links)
In the age of big data, thoughtful management and harmonization of clinical metadata and sample processing in translational research is a critical for effective data generation, integration, and analysis. These steps enable the cutting edge discoveries and enhance overall conclusions that may come from complex multi-omic translational research studies. The focus of my thesis has been on harmonizing the clinical metadata collected as part of the lung Pre Cancer Genome Atlas (PCGA) in addition to expanding the use of banked samples. The lung PCGA study included longitudinal collected samples and data from participants in a high-risk lung cancer-screening program at Roswell Park Comprehensive Cancer Center (Roswell) in Buffalo, NY. Clinical metadata for this study was collected over many years at Roswell and subsets of this data were shared with Boston University Medical Campus (BUMC) for the lung PCGA study. During the study, additional clinical metadata was acquired and shared with BUMC to complement the analysis of genomic profiling of DNA and RNA, as well as protein staining of tissue. With regards to the PCGA study, my thesis has two aims: 1) Curate the clinical metadata from received from Roswell during the PCGA study to enhance both its accessibility to current investigators and collaborators and reproducibility of results 2) Test methods to isolate DNA from remnant samples to expand the use of banked samples for genomic profiling. We hypothesized that the accomplishment of these goals would allow for increased use of the clinical metadata, enhanced reproducibility of the results, and expansion of samples available for DNA sequencing The clinical metadata received from Roswell was consolidated into a singular source that is continually updated and available for export for future research use. These metadata management efforts led to increased use among the members of our laboratory and collaborators working with the lung PCGA cohort. Additionally, the curation of metadata has allowed for improved analysis, reproducibility, and increased awareness of the current inventory of remaining samples. During the process of lung PCGA clinical metadata curation, physical inventory of the remaining samples revealed remnant organic phase samples. Therefore, in addition to my work associated with clinical metadata, the second goal of my thesis focuses on DNA isolation from remnant banked biological samples from the lung PCGA cohort. In the first phase of the lung PCGA, nucleic acid isolation of RNA was intended to be collected exclusively from fresh frozen endobronchial biopsy samples, and formalin-fixed paraffin embedded (FFPE) biopsy samples were to be used for DNA isolation. DNA isolation from the FFPE samples was unsuccessful. However, from the RNA isolation, the remaining organic phase was banked and could potentially serve as a source of DNA. The organic phase of this isolation contained cell debris, proteins, and, as previously mentioned, DNA. We hypothesized that current protocols for organic phase DNA isolation might yield adequate quantities of DNA for genomic profiling. Utilizing immortalized cell culture lines to establish methodology, numerous organic phase DNA isolation protocols were tested. During subsequent validation using the remaining organic phase samples from the lung PCGA cohort, the protocol yielded varied results, suggesting that further optimization to increase DNA purity is required. The ability to isolate DNA from these valuable samples will enhance progress in the lung PCGA study. The aims of this thesis involving curation of clinical metadata and generation of additional DNA samples for DNA profiling has had significant impact on the PCGA study and future expansions of this work.
6

Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach

Megler, Veronika Margaret 04 June 2014 (has links)
In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true "Google for data".
7

BlobSeer: Towards efficient data storage management for large-scale, distributed systems

Nicolae, Bogdan 30 November 2010 (has links) (PDF)
With data volumes increasing at a high rate and the emergence of highly scalable infrastructures (cloud computing, petascale computing), distributed management of data becomes a crucial issue that faces many challenges. This thesis brings several contributions in order to address such challenges. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, it highlights the potentially large benefits of using versioning in this context. Second, based on these principles, it introduces a series of distributed data and metadata management algorithms that enable a high throughput under concurrency. Third, it shows how to efficiently implement these algorithms in practice, dealing with key issues such as high-performance parallel transfers, efficient maintainance of distributed data structures, fault tolerance, etc. These results are used to build BlobSeer, an experimental prototype that is used to demonstrate both the theoretical benefits of the approach in synthetic benchmarks, as well as the practical benefits in real-life, applicative scenarios: as a storage backend for MapReduce applications, as a storage backend for deployment and snapshotting of virtual machine images in clouds, as a quality-of-service enabled data storage service for cloud applications. Extensive experimentations on the Grid'5000 testbed show that BlobSeer remains scalable and sustains a high throughput even under heavy access concurrency, outperforming by a large margin several state-of-art approaches.
8

Efficient support for data-intensive scientific workflows on geo-distributed clouds / Support pour l'exécution efficace des workflows scientifiques à traitement intensif de données sur les cloud géo-distribués

Pineda Morales, Luis Eduardo 24 May 2017 (has links)
D’ici 2020, l’univers numérique atteindra 44 zettaoctets puisqu’il double tous les deux ans. Les données se présentent sous les formes les plus diverses et proviennent de sources géographiquement dispersées. L’explosion de données crée un besoin sans précédent en terme de stockage et de traitement de données, mais aussi en terme de logiciels de traitement de données capables d’exploiter au mieux ces ressources informatiques. Ces applications à grande échelle prennent souvent la forme de workflows qui aident à définir les dépendances de données entre leurs différents composants. De plus en plus de workflows scientifiques sont exécutés sur des clouds car ils constituent une alternative rentable pour le calcul intensif. Parfois, les workflows doivent être répartis sur plusieurs data centers. Soit parce qu’ils dépassent la capacité d’un site unique en raison de leurs énormes besoins de stockage et de calcul, soit car les données qu’ils traitent sont dispersées dans différents endroits. L’exécution de workflows multisite entraîne plusieurs problèmes, pour lesquels peu de solutions ont été développées : il n’existe pas de système de fichiers commun pour le transfert de données, les latences inter-sites sont élevées et la gestion centralisée devient un goulet d’étranglement. Cette thèse présente trois contributions qui visent à réduire l’écart entre les exécutions de workflows sur un seul site ou plusieurs data centers. Tout d’abord, nous présentons plusieurs stratégies pour le soutien efficace de l’exécution des workflows sur des clouds multisite en réduisant le coût des opérations de métadonnées. Ensuite, nous expliquons comment la manipulation sélective des métadonnées, classées par fréquence d’accès, améliore la performance des workflows dans un environnement multisite. Enfin, nous examinons une approche différente pour optimiser l’exécution de workflows sur le cloud en étudiant les paramètres d’exécution pour modéliser le passage élastique à l’échelle. / By 2020, the digital universe is expected to reach 44 zettabytes, as it is doubling every two years. Data come in the most diverse shapes and from the most geographically dispersed sources ever. The data explosion calls for applications capable of highlyscalable, distributed computation, and for infrastructures with massive storage and processing power to support them. These large-scale applications are often expressed as workflows that help defining data dependencies between their different components. More and more scientific workflows are executed on clouds, for they are a cost-effective alternative for intensive computing. Sometimes, workflows must be executed across multiple geodistributed cloud datacenters. It is either because these workflows exceed a single site capacity due to their huge storage and computation requirements, or because the data they process is scattered in different locations. Multisite workflow execution brings about several issues, for which little support has been developed: there is no common ile system for data transfer, inter-site latencies are high, and centralized management becomes a bottleneck. This thesis consists of three contributions towards bridging the gap between single- and multisite workflow execution. First, we present several design strategies to eficiently support the execution of workflow engines across multisite clouds, by reducing the cost of metadata operations. Then, we take one step further and explain how selective handling of metadata, classified by frequency of access, improves workflows performance in a multisite environment. Finally, we look into a different approach to optimize cloud workflow execution by studying some parameters to model and steer elastic scaling.
9

An Application-Attuned Framework for Optimizing HPC Storage Systems

Paul, Arnab Kumar 19 August 2020 (has links)
High performance computing (HPC) is routinely employed in diverse domains such as life sciences, and Geology, to simulate and understand the behavior of complex phenomena. Big data driven scientific simulations are resource intensive and require both computing and I/O capabilities at scale. There is a crucial need for revisiting the HPC I/O subsystem to better optimize for and manage the increased pressure on the underlying storage systems from big data processing. Extant HPC storage systems are designed and tuned for a specific set of applications targeting a range of workload characteristics, but they lack the flexibility in adapting to the ever-changing application behaviors. The complex nature of modern HPC storage systems along with the ever-changing application behaviors present unique opportunities and engineering challenges. In this dissertation, we design and develop a framework for optimizing HPC storage systems by making them application-attuned. We select three different kinds of HPC storage systems - in-memory data analytics frameworks, parallel file systems and object storage. We first analyze the HPC application I/O behavior by studying real-world I/O traces. Next we optimize parallelism for applications running in-memory, then we design data management techniques for HPC storage systems, and finally focus on low-level I/O load balance for improving the efficiency of modern HPC storage systems. / Doctor of Philosophy / Clusters of multiple computers connected through internet are often deployed in industry and laboratories for large scale data processing or computation that cannot be handled by standalone computers. In such a cluster, resources such as CPU, memory, disks are integrated to work together. With the increase in popularity of applications that read and write a tremendous amount of data, we need a large number of disks that can interact effectively in such clusters. This forms the part of high performance computing (HPC) storage systems. Such HPC storage systems are used by a diverse set of applications coming from organizations from a vast range of domains from earth sciences, financial services, telecommunication to life sciences. Therefore, the HPC storage system should be efficient to perform well for the different read and write (I/O) requirements from all the different sets of applications. But current HPC storage systems do not cater to the varied I/O requirements. To this end, this dissertation designs and develops a framework for HPC storage systems that is application-attuned and thus provides much improved performance than other state-of-the-art HPC storage systems without such optimizations.
10

Partial persistent sequences and their applications to collaborative text document editing and processing

Wu, Qinyi 08 July 2011 (has links)
In a variety of text document editing and processing applications, it is necessary to keep track of the revision history of text documents by recording changes and the metadata of those changes (e.g., user names and modification timestamps). The recent Web 2.0 document editing and processing applications, such as real-time collaborative note taking and wikis, require fine-grained shared access to collaborative text documents as well as efficient retrieval of metadata associated with different parts of collaborative text documents. Current revision control techniques only support coarse-grained shared access and are inefficient to retrieve metadata of changes at the sub-document granularity. In this dissertation, we design and implement partial persistent sequences (PPSs) to support real-time collaborations and manage metadata of changes at fine granularities for collaborative text document editing and processing applications. As a persistent data structure, PPSs have two important features. First, items in the data structure are never removed. We maintain necessary timestamp information to keep track of both inserted and deleted items and use the timestamp information to reconstruct the state of a document at any point in time. Second, PPSs create unique, persistent, and ordered identifiers for items of a document at fine granularities (e.g., a word or a sentence). As a result, we are able to support consistent and fine-grained shared access to collaborative text documents by detecting and resolving editing conflicts based on the revision history as well as to efficiently index and retrieve metadata associated with different parts of collaborative text documents. We demonstrate the capabilities of PPSs through two important problems in collaborative text document editing and processing applications: data consistency control and fine-grained document provenance management. The first problem studies how to detect and resolve editing conflicts in collaborative text document editing systems. We approach this problem in two steps. In the first step, we use PPSs to capture data dependencies between different editing operations and define a consistency model more suitable for real-time collaborative editing systems. In the second step, we extend our work to the entire spectrum of collaborations and adapt transactional techniques to build a flexible framework for the development of various collaborative editing systems. The generality of this framework is demonstrated by its capabilities to specify three different types of collaborations as exemplified in the systems of RCS, MediaWiki, and Google Docs respectively. We precisely specify the programming interfaces of this framework and describe a prototype implementation over Oracle Berkeley DB High Availability, a replicated database management engine. The second problem of fine-grained document provenance management studies how to efficiently index and retrieve fine-grained metadata for different parts of collaborative text documents. We use PPSs to design both disk-economic and computation-efficient techniques to index provenance data for millions of Wikipedia articles. Our approach is disk economic because we only save a few full versions of a document and only keep delta changes between those full versions. Our approach is also computation-efficient because we avoid the necessity of parsing the revision history of collaborative documents to retrieve fine-grained metadata. Compared to MediaWiki, the revision control system for Wikipedia, our system uses less than 10% of disk space and achieves at least an order of magnitude speed-up to retrieve fine-grained metadata for documents with thousands of revisions.

Page generated in 0.1076 seconds