1 |
Automatic data integration with generalized mapping definitionsTian, Aibo 18 September 2014 (has links)
Data integration systems provide uniform access to a set of heterogeneous structured data sources. An essential component of a data integration system is the mapping between the federated data model and each data source. The scale of interconnect among data sources in the big data era is a new impetus for automating the mapping process. Despite decades of research on data integration, generating mappings still requires extensive labor. The thesis of this research is that the progress on automatic data integration has been limited by a narrow definition of mapping. The common mapping process is to find correspondences between pairs of entities in the data models, and create logic expressions over the correspondences as executable mappings. This does not cover all issues in real world applications. This research aims to overcome this problem in two ways: (1) generalize the common mapping definition for relational databases; (2) address the problem in a more general framework, the Semantic Web. The Semantic Web provides flexible graph based data models and reasoning capabilities as in knowledge representation systems. The new graph data model introduces opportunities for new mapping definitions. The comparison of mapping definitions and solutions for both relational databases and the Semantic Web is discussed. In this dissertation, I propose two generalizations of mapping problems. First, the common schema matching definition for relational databases is generalized from finding correspondences between pairs of attributes to finding correspondences consisting of relations, attributes, and data values. This generalization solves real world issues that are not previously covered. The same generalization can be applied to ontology matching in the Semantic Web. The second piece of work generalizes the ontology mapping definition from finding correspondences between pairs of entities to pairs of graph paths (sequences of entities). As a path provides more context than a single entity, mapping between paths can solve two challenges in data integration: the missing mapping challenge and the ambiguous mapping challenge. Combining the two proposed generalizations together, I demonstrate a complete data integration system using the Semantic Web techniques. The complete system includes the components of automatic ontology mapping and query reformulation, and semi-automatically federates the query results from multiple data sources. / text
|
2 |
Automated annotation of protein families / Automatiserad annotering av proteinfamiljerElfving, Eric January 2011 (has links)
Introduction: The great challenge in bioinformatics is data integration. The amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this workis to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families.This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family. Method: The different types of data available all needed different handling. Textual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types. Results: The biggest challenge was data incompleteness yielding little (or no) results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation-reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue. Conclusions: A python based tool for automatic annotation has been created to annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.
|
3 |
Fast history matching of time-lapse seismic and production data for high resolution modelsJimenez, Eduardo Antonio 10 October 2008 (has links)
Integrated reservoir modeling has become an important part of day-to-day
decision analysis in oil and gas management practices. A very attractive and promising
technology is the use of time-lapse or 4D seismic as an essential component in subsurface
modeling. Today, 4D seismic is enabling oil companies to optimize production and
increase recovery through monitoring fluid movements throughout the reservoir. 4D
seismic advances are also being driven by an increased need by the petroleum
engineering community to become more quantitative and accurate in our ability to
monitor reservoir processes. Qualitative interpretations of time-lapse anomalies are being
replaced by quantitative inversions of 4D seismic data to produce accurate maps of fluid
saturations, pore pressure, temperature, among others.
Within all steps involved in this subsurface modeling process, the most
demanding one is integrating the geologic model with dynamic field data, including 4Dseismic
when available. The validation of the geologic model with observed dynamic
data is accomplished through a "history matching" (HM) process typically carried out
with well-based measurements. Due to low resolution of production data, the validation
process is severely limited in its reservoir areal coverage, compromising the quality of the
model and any subsequent predictive exercise. This research will aim to provide a novel
history matching approach that can use information from high-resolution seismic data to
supplement the areally sparse production data. The proposed approach will utilize
streamline-derived sensitivities as means of relating the forward model performance with
the prior geologic model. The essential ideas underlying this approach are similar to those
used for high-frequency approximations in seismic wave propagation. In both cases, this leads to solutions that are defined along "streamlines" (fluid flow), or "rays" (seismic
wave propagation). Synthetic and field data examples will be used extensively to
demonstrate the value and contribution of this work.
Our results show that the problem of non-uniqueness in this complex history
matching problem is greatly reduced when constraints in the form of saturation maps
from spatially closely sampled seismic data are included. Further on, our methodology
can be used to quickly identify discrepancies between static and dynamic modeling.
Reducing this gap will ensure robust and reliable models leading to accurate predictions
and ultimately an optimum hydrocarbon extraction.
|
4 |
Selection of maintenance policies for a data warehousing environment : a cost based approach to meeting quality of service requirementsEngström, Henrik January 2002 (has links)
No description available.
|
5 |
Integration of building design and construction information : a neutral object-oriented modelKiwan, Mohd S. A. A. January 1994 (has links)
No description available.
|
6 |
Data Sharing and Exchange: Semantics and Query AnsweringAwada, Rana January 2015 (has links)
Exchanging and integrating data that belong to worlds of different vocabularies are two prominent problems in the database literature. While data coordination deals with managing and integrating data between autonomous yet related sources with possibly distinct vocabularies, data exchange is defined as the problem of extracting data from a source and materializing it in an independent target to conform to the target schema. These two problems, however, have never been studied in a unified setting which allows both the exchange of the data as well as the coordination of different vocabularies between different sources. Our thesis shows that such a unified setting exhibits data integration capabilities that are beyond the ones provided by data exchange and data coordination separately. In this thesis, we propose a new setting – called DSE, for Data Sharing and Exchange – which allows the exchange of data between independent source and target applications that possess independent schemas, as well as independent yet related domains of constants. To facilitate this type of exchange, we extend the source-to-target dependencies used in the ordinary data exchange setting which allow the association between the source and the target at the schema level, with the mapping table construct introduced in the classical data coordination setting which defines the association between the source and the target at the instance level. A mapping table construct defines for each source element, the set of associated (or corresponding) elements in the domain of the target. The semantics of this association relationship between source and target elements change with different requirements of different applications. Ordinary DE settings can represent DSE settings; however, we show that there exist DSE settings with particular semantics of related values in mapping tables where DE is not the best exchange solution to adopt. The thesis introduces two DSE settings with such a property. We call the first DSE with unique identity semantics. The semantics of a mapping table in this DSE setting specifies that each source element should be uniquely mapped to at least one target element that is associated with it in the mapping table.
ii In this setting, classical DE is one method to perform a data exchange; however, it is not the best method to adopt, since it can not represent exchange applications, that require – as DC applications – to compute both portions as well as complete sets of certain answers for conjunctive queries. In addition, we show that adopting known DE universal solutions as semantics for such DSE settings is not the best in terms of efficiency when computing certain answers for conjunctive queries. The second DSE setting that the thesis introduces with the same property is called DSE with equality semantics. This setting captures interesting meaning of related data in a mapping table. Such semantics impose that each source element in a mapping table is related to a target element only if both elements are equivalent (i.e they have the same meaning). We show in our thesis that this DSE setting differs from ordinary DE settings in the sense that additional information could be entailed under such interpretation of related data. Also, this added information needs to be augmented to both the source instance and the mapping table in order to generate target instances that correctly reflect both in a DSE scenario. In other words, we can say that in such a DSE setting, a source instance and a mapping table can be incomplete with respect to the semantics of the mapping table. We formally define the two aforementioned semantics of a DSE setting and we distinguish between two types of solutions for this setting, named,universal DSE solutions, which contain the complete set of exchanged information, and universal DSE KB-Solutions, which store a portion of the exchanged information with implicit information in the form of a set of rules over the target. DSEKB-Solutions allow applications to compute on demand both a portion and the complete set of certain answers for conjunctive queries. In addition,we define the semantics of conjunctive query answering, and we distinguish between sound and complete certain answers for conjunctive queries and we define the algorithms to compute these efficiently. Finally, we provide experimental results which compare the run times to generate DSE solutions versus DSE KB-solutions, and compare the performance of computing sound and complete certain answers for conjunctive queries using both types of solutions
|
7 |
Metadata-Driven Data IntegrationNadal Francesch, Sergi 16 May 2019 (has links) (PDF)
Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings.This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities.We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems. / Doctorat en Sciences de l'ingénieur et technologie / info:eu-repo/semantics/nonPublished
|
8 |
Statistical methods for robust analysis of transcriptome data by integration of biological prior knowledge / Méthodes statistiques pour une analyse robuste du transcriptome à travers l'intégration d'a priori biologiqueJeanmougin, Marine 16 November 2012 (has links)
Au cours de la dernière décennie, les progrès en Biologie Moléculaire ont accéléré le développement de techniques d'investigation à haut-débit. En particulier, l'étude du transcriptome a permis des avancées majeures dans la recherche médicale. Dans cette thèse, nous nous intéressons au développement de méthodes statistiques dédiées au traitement et à l'analyse de données transcriptomiques à grande échelle. Nous abordons le problème de sélection de signatures de gènes à partir de méthodes d'analyse de l'expression différentielle et proposons une étude de comparaison de différentes approches, basée sur plusieurs stratégies de simulations et sur des données réelles. Afin de pallier les limites de ces méthodes classiques qui s'avèrent peu reproductibles, nous présentons un nouvel outil, DiAMS (DIsease Associated Modules Selection), dédié à la sélection de modules de gènes significatifs. DiAMS repose sur une extension du score-local et permet l'intégration de données d'expressions et de données d'interactions protéiques. Par la suite, nous nous intéressons au problème d'inférence de réseaux de régulation de gènes. Nous proposons une méthode de reconstruction à partir de modèles graphiques Gaussiens, basée sur l'introduction d'a priori biologique sur la structure des réseaux. Cette approche nous permet d'étudier les interactions entre gènes et d'identifier des altérations dans les mécanismes de régulation, qui peuvent conduire à l'apparition ou à la progression d'une maladie. Enfin l'ensemble de ces développements méthodologiques sont intégrés dans un pipeline d'analyse que nous appliquons à l'étude de la rechute métastatique dans le cancer du sein. / Recent advances in Molecular Biology have led biologists toward high-throughput genomic studies. In particular, the investigation of the human transcriptome offers unprecedented opportunities for understanding cellular and disease mechanisms. In this PhD, we put our focus on providing robust statistical methods dedicated to the treatment and the analysis of high-throughput transcriptome data. We discuss the differential analysis approaches available in the literature for identifying genes associated with a phenotype of interest and propose a comparison study. We provide practical recommendations on the appropriate method to be used based on various simulation models and real datasets. With the eventual goal of overcoming the inherent instability of differential analysis strategies, we have developed an innovative approach called DiAMS, for DIsease Associated Modules Selection. This method was applied to select significant modules of genes rather than individual genes and involves the integration of both transcriptome and protein interactions data in a local-score strategy. We then focus on the development of a framework to infer gene regulatory networks by integration of a biological informative prior over network structures using Gaussian graphical models. This approach offers the possibility of exploring the molecular relationships between genes, leading to the identification of altered regulations potentially involved in disease processes. Finally, we apply our statistical developments to study the metastatic relapse of breast cancer.
|
9 |
Context Interchange as a Scalable Solution to Interoperating Amongst Heterogeneous Dynamic ServicesZhu, Hongwei, Madnick, Stuart E. 01 1900 (has links)
Many online services access a large number of autonomous data sources and at the same time need to meet different user requirements. It is essential for these services to achieve semantic interoperability among these information exchange entities. In the presence of an increasing number of proprietary business processes, heterogeneous data standards, and diverse user requirements, it is critical that the services are implemented using adaptable, extensible, and scalable technology. The COntext INterchange (COIN) approach, inspired by similar goals of the Semantic Web, provides a robust solution. In this paper, we describe how COIN can be used to implement dynamic online services where semantic differences are reconciled on the fly. We show that COIN is flexible and scalable by comparing it with several conventional approaches. With a given ontology, the number of conversions in COIN is quadratic to the semantic aspect that has the largest number of distinctions. These semantic aspects are modeled as modifiers in a conceptual ontology; in most cases the number of conversions is linear with the number of modifiers, which is significantly smaller than traditional hard-wiring middleware approach where the number of conversion programs is quadratic to the number of sources and data receivers. In the example scenario in the paper, the COIN approach needs only 5 conversions to be defined while traditional approaches require 20,000 to 100 million. COIN achieves this scalability by automatically composing all the comprehensive conversions from a small number of declaratively defined sub-conversions. / Singapore-MIT Alliance (SMA)
|
10 |
Distributed Data Integration using Web Services and XMLMukker, Alka 20 December 2004 (has links)
Data integration has been an active topic of research in the past. With the advances in technology in the context of web, data integration faces new challenges imposed by heterogeneity of source, their autonomy and independence. Web services, which are universally accessible software components deployed on the web, are becoming the focus of recent researches due to their ability to interconnect systems and cost optimizations. At the same time, XML has also become one of the core technologies for business applications. By offering a standard, flexible and inherently extensible data format, XML significantly reduces the burden of deploying the many technologies needed to ensure the success of Web services. This thesis examines the opportunities for data integration in the context of web services development paradigm. It examines the existing technologies and standards of web services and XML and provides an example of how web services can be used to unlock heterogeneous systems to extract and integrate data. The approach followed to illustrate this uses embedded web service calls inside XML documents. The main contributions of this paper are: 1) comprehensive research of existing technologies 2) architecture to support invocation of embedded web services 3) implementation of an application to show the results 4) use of existing technologies to implement the proposed system.
|
Page generated in 0.1149 seconds