Spelling suggestions: "subject:"dataintegration"" "subject:"data.cointegration""
41 |
Query Processing for Peer Mediator DatabasesKatchaounov, Timour January 2003 (has links)
The ability to physically interconnect many distributed, autonomous and heterogeneous software systems on a large scale presents new opportunities for sharing and reuse of existing, and for the creataion of new information and new computational services. However, finding and combining information in many such systems is a challenge even for the most advanced computer users. To address this challenge, mediator systems logically integrate many sources to hide their heterogeneity and distribution and give the users the illusion of a single coherent system. Many new areas, such as scientific collaboration, require cooperation between many autonomous groups willing to share their knowledge. These areas require that the data integration process can be distributed among many autonomous parties, so that large integration solutions can be constructed from smaller ones. For this we propose a decentralized mediation architecture, peer mediator systems (PMS), based on the peer-to-peer (P2P) paradigm. In a PMS, reuse of human effort is achieved through logical composability of the mediators in terms of other mediators and sources by defining mediator views in terms of views in other mediators and sources. Our thesis is that logical composability in a P2P mediation architecture is an important requirement and that composable mediators can be implemented efficiently through query processing techniques. In order to compute answers of queries in a PMS, logical mediator compositions must be translated to query execution plans, where mediators and sources cooperate to compute query answers. The focus of this dissertation is on query processing methods to realize composability in a PMS architecture in an efficient way that scales over the number of mediators. Our contributions consist of an investigation of the interfaces and capabilities for peer mediators, and the design, implementation and experimental study of several query processing techniques that realize composability in an efficient and scalable way.
|
42 |
Application of Semantic Web Technology to Establish Knowledge Management and Discovery in the Life SciencesVenkatesan, Aravind January 2014 (has links)
The last three decades has seen the successful development of many high-throughput technologies that have revolutionised and transformed biological research. The application of these technologies has generated large quantities of data allowing new approaches to analyze and integrate these data, which now constitute the field of Systems Biology. Systems Biology aims to enable a holistic understanding of a biological system by mapping interactions between all the biochemical components within the system. This requires integration of interdisciplinary data and knowledge to comprehensively explore the various biological processes of a system. Ontologies in biology (bio-ontologies) and the Semantic Web are playing an increasingly important role in the integration of data and knowledge by offering an explicit, unambiguous and rich representation mechanism. This increased influence led to the proposal of the Semantic Systems Biology paradigm to complement the techniques currently used in Systems Biology. Semantic Systems Biology provides a semantic description of the knowledge about the biological systems on the whole facilitating data integration, knowledge management, reasoning and querying. However, this approach is still a typical product of technology push, offering potential users access to the new technology. This doctoral thesis presents the work performed to bring Semantic Systems Biology closer to biological domain experts. The work covers a variety of aspects of Semantic Systems Biology: The Gene eXpression Knowledge Base is a resource that captures knowledge on gene expression. The knowledge base exploits the power of seamless data integration offered by the semantic web technologies to build large networks of varied datasets, capable of answering complex biological questions. The knowledge base is the result of the active collaboration with the Gastrin Systems Biology group here at the Norwegian University of Science and Technology. This resource was customised by the integration of additional data sets on users’ request. Additionally, the utility of the knowledge base is demonstrated by the conversion of biological questions into computable queries. The joint analysis of the query results has helped in filling knowledge gaps in the biological system of study. Biologists often use different bioinformatics tools to conduct complex biological analysis. However, using these tools frequently poses a steep learning curve for the life science researchers. Therefore, the thesis describes ONTO-ToolKit, a plug-in that allows biologists to exploit bio-ontology based analysis as part of biological workflows in Galaxy. ONTO-ToolKit allows users to perform ontology-based analysis to improve the depth of their overall analysis Visualisation plays a key role in aiding users understand and grasp the knowledge represented in bio-ontologies. To this end, OLSVis, a web application was developed to make ontology browsing intuitive and flexible. Finally, the steps needed to further advance the Semantic Systems Biology approach has been discussed. / Semantic Systems Biology
|
43 |
Multivariate Analysis of Diverse Data for Improved Geostatistical Reservoir ModelingHong, Sahyun Unknown Date
No description available.
|
44 |
Tabular Representation of Schema Mappings: Semantics and AlgorithmsRahman, Md. Anisur 27 May 2011 (has links)
Our thesis investigates a mechanism for representing schema mapping by tabular forms and checking utility of the new representation.
Schema mapping is a high-level specification that describes the relationship between two database schemas. Schema mappings constitute essential building blocks of data integration, data exchange and peer-to-peer data sharing systems. Global-and-local-as-view (GLAV) is one of the approaches for specifying the schema mappings. Tableaux are used for expressing queries and functional dependencies on a single database in a tabular form. In our thesis, we first introduce a tabular representation of GLAV mappings. We find that this tabular representation helps to solve many mapping-related algorithmic and semantic problems. For example, a well-known problem is to find the minimal instance of the target schema for a given instance of the source schema and a set of mappings between the source and the target schema. Second, we show that our proposed tabular mapping can be used as an operator on an instance of the source schema to produce an instance of the target schema which is `minimal' and `most general' in nature. There exists a tableaux-based mechanism for finding equivalence of two queries. Third, we extend that mechanism for deducing equivalence between two schema mappings using their corresponding tabular representations. Sometimes, there exist redundant conjuncts in a schema mapping which causes data exchange, data integration and data sharing operations more time consuming. Fourth, we present an algorithm that utilizes the tabular representations for reducing number of constraints in the schema mappings. At present, either schema-level mappings or data-level mappings are used for data sharing purposes. Fifth, we introduce and give the semantics of bi-level mapping that combines the schema-level and data-level mappings. We also show that bi-level mappings are more effective for data sharing systems. Finally, we implemented our algorithms and developed a software prototype to evaluate our proposed strategies.
|
45 |
A Practical Approach to Merging Multidimensional Data ModelsMireku Kwakye, Michael 30 November 2011 (has links)
Schema merging is the process of incorporating data models into an integrated, consistent schema from which query solutions satisfying all incorporated models can be derived. The efficiency of such a process is reliant on the effective semantic representation of the chosen data models, as well as the mapping relationships between the elements of the source data models.
Consider a scenario where, as a result of company mergers or acquisitions, a number of related, but possible disparate data marts need to be integrated into a global data warehouse. The ability to retrieve data across these disparate, but related, data marts poses an important challenge. Intuitively, forming an all-inclusive data warehouse includes the tedious tasks of identifying related fact and dimension table attributes, as well as the design of a schema merge algorithm for the integration. Additionally, the evaluation of the combined set of correct answers to queries, likely to be independently posed to such data marts, becomes difficult to achieve.
Model management refers to a high-level, abstract programming language designed to efficiently manipulate schemas and mappings. Particularly, model management operations such as match, compose mappings, apply functions and merge, offer a way to handle the above-mentioned data integration problem within the domain of data warehousing.
In this research, we introduce a methodology for the integration of star schema source data marts into a single consolidated data warehouse based on model management. In our methodology, we discuss the development of three (3) main streamlined steps to facilitate the generation of a global data warehouse. That is, we adopt techniques for deriving attribute correspondences, and for schema mapping discovery. Finally, we formulate and design a merge algorithm, based on multidimensional star schemas; which is primarily the core contribution of this research. Our approach focuses on delivering a polynomial time solution needed for the expected volume of data and its associated large-scale query processing.
The experimental evaluation shows that an integrated schema, alongside instance data, can be derived based on the type of mappings adopted in the mapping discovery step. The adoption of Global-And-Local-As-View (GLAV) mapping models delivered a maximally-contained or exact representation of all fact and dimensional instance data tuples needed in query processing on the integrated data warehouse. Additionally, different forms of conflicts, such as semantic conflicts for related or unrelated dimension entities, and descriptive conflicts for differing attribute data types, were encountered and resolved in the developed solution. Finally, this research has highlighted some critical and inherent issues regarding functional dependencies in mapping models, integrity constraints at the source data marts, and multi-valued dimension attributes. These issues were encountered during the integration of the source data marts, as it has been the case of evaluating the queries processed on the merged data warehouse as against that on the independent data marts.
|
46 |
Modeling gene regulatory networks through data integrationAzizi, Elham 12 March 2016 (has links)
Modeling gene regulatory networks has become a problem of great interest in biology and medical research. Most common methods for learning regulatory dependencies rely on observations in the form of gene expression data.
In this dissertation, computational models for gene regulation have been developed based on constrained regression by integrating comprehensive gene expression data for M. tuberculosis with genome-scale ChIP-Seq interaction data. The resulting models confirmed predictive power for expression in independent stress conditions and identified mechanisms driving hypoxic adaptation and lipid metabolism in M. tuberculosis.
I then used the regulatory network model for M. tuberculosis to identify factors responding to stress conditions and drug treatments, revealing drug synergies and conditions that potentiate drug treatments. These results can guide and optimize design of drug treatments for this pathogen.
I took the next step in this direction, by proposing a new probabilistic framework for learning modular structures in gene regulatory networks from gene expression and protein-DNA interaction data, combining the ideas of module networks and stochastic blockmodels. These models also capture combinatorial interactions between regulators. Comparisons with other network modeling methods that rely solely on expression data, showed the essentiality of integrating ChIP-Seq data in identifying direct regulatory links in M. tuberculosis. Moreover, this work demonstrates the theoretical advantages of integrating ChIP-Seq data for the class of widely-used module network models.
The systems approach and statistical modeling presented in this dissertation can also be applied to problems in other organisms. A similar approach was taken to model the regulatory network controlling genes with circadian gene expression in Neurospora crassa, through integrating time-course expression data with ChIP-Seq data. The models explained combinatorial regulations leading to different phase differences in circadian rhythms. The Neurospora crassa network model also works as a tool to manipulate the phases of target genes.
|
47 |
Integrative analysis of complex genomic and epigenomic mapsSharma, Supriya 20 February 2018 (has links)
Modern healthcare research demands collaboration across disciplines to build preventive measures and innovate predictive capabilities for curing diseases. Along with the emergence of cutting-edge computational and statistical methodologies, data generation and analysis has become cheaper in the last ten years. However, the complexity of big data due to its variety, volume, and velocity creates new challenges for biologists, physicians, bioinformaticians, statisticians, and computer scientists. Combining data from complex multiple profiles is useful to better understand cellular functions and pathways that regulates cell function to provide insights that could not have been obtained using the individual profiles alone. However, current normalization and artifact correction methods are platform and data type specific, and may require both the training and test sets for any application (e.g. biomarker development). This often leads to over-fitting and reduces the reproducibility of genomic findings across studies. In addition, many bias correction and integration approaches require renormalization or reanalysis if additional samples are later introduced. The motivation behind this research was to develop and evaluate strategies for addressing data integration issues across data types and profiling platforms, which should improve healthcare-informatics research and its application in personalized medicine. We have demonstrated a comprehensive and coordinated framework for data standardization across tissue types and profiling platforms. This allows easy integration of data from multiple data generating consortiums. The main goal of this research was to identify regions of genetic-epigenetic co-ordination that are independent of tissue type and consistent across epigenomics profiling data platforms. We developed multi-‘omic’ therapeutic biomarkers for epigenetic drug efficacy by combining our biomarker regions with drug perturbation data generated in our previous studies. We used an adaptive Bayesian factor analysis approach to develop biomarkers for multiple HDACs simultaneously, allowing for predictions of comparative efficacy between the drugs. We showed that this approach leads to different predictions across breast cancer subtypes compared to profiling the drugs separately. We extended this approach on patient samples from multiple public data resources containing epigenetic profiling data from cancer and normal tissues (The Cancer Genome Atlas, TCGA; NIH Roadmap epigenomics data).
|
48 |
Formalização do processo de tradução de consultas em ambientes de integração de dados XML / Formalization of a query translation process in XML data integrationAlves, Willian Bruno Gomes January 2008 (has links)
A fim de consultar uma mesma informação em fontes XML heterogêneas seria desejável poder formular uma única consulta em relação a um esquema global conceitual e então traduzi-la automaticamente para consultas XML para cada uma das fontes. CXPath (Conceptual XPath) é uma proposta de linguagem para consultar fontes XML em um nível conceitual. Essa linguagem foi desenvolvida para simplificar o processo de tradução de consultas em nível conceitual para consultas em nível XML. Ao mesmo tempo, a linguagem tem como objetivo a facilidade de aprendizado de sua sintaxe. Por essa razão, sua sintaxe é bastante semelhante à da linguagem XPath utilizada para consultar documentos XML. Nesta dissertação é definido formalmente o mecanismo de tradução de consultas em nível conceitual, escritas em CXPath, para consultas em nível XML, escritas em XPath. É mostrado o tratamento do relacionamento de herança no mecanismo de tradução, e é feita uma discussão sobre a relação entre a expressividade do modelo conceitual e o mecanismo de tradução. Existem situações em que a simples tradução de uma consulta CXPath não contempla alguns resultados, pois as fontes de dados podem ser incompletas. Neste trabalho, o modelo conceitual que constitui o esquema global do sistema de integração de dados é estendido com dependências de inclusão e o mecanismo de resolução de consultas é modificado para lidar com esse tipo de dependência. Mais especificamente, são apresentados mecanismos de reescrita e eliminação de redundâncias de consultas a fim de lidar com essas dependências. Com o aumento de expressividade do esquema global é possível inferir resultados, a partir dos dados disponíveis no sistema de integração, que antes não seriam contemplados com a simples tradução de uma consulta. Também é apresentada a abordagem para integração de dados utilizada nesta dissertação de acordo com o arcabouço formal para integração de dados proposto por (LENZERINI, 2002). De acordo com o autor, tal arcabouço é geral o bastante para capturar todas as abordagens para integração de dados da literatura, o que inclui a abordagem aqui mostrada. / In order to search for the same information in heterogeneous XML data sources, it would be desirable to state a single query against a global conceptual schema and then translate it automatically into an XML query for each specific data source. CXPath (for Conceptual XPath ) has been proposed as a language for querying XML sources at the conceptual level. This language was developed to simplify the translation process of queries at conceptual level to queries at XML level. At the same time, one of the goals of the language design is to facilitate the learning of its syntax. For this reason its syntax is similar to the XPath language used for querying XML documents. In this dissertation, a translation mechanism of queries at conceptual level, written in CXPath, to queries at XML level, written in XPath, is formally defined. The inheritance relationship in the translation mechanism is shown, being discussed the relation between the conceptual model expressivity and the translation mechanism. In some cases, the translation of a CXPath query does not return some of the answers because the data sources may be incomplete. In this work, the conceptual model, which is the basis for the data integration system’s global schema, is improved with inclusion dependencies, and the query answering mechanism is modified to deal with this kind of dependency. More specifically, mechanisms of query rewriting and redundancy elimination are presented to deal with this kind of dependency. This global schema improvement allows infer results, with the data available in the system, that would not be provided with a simple query translation. The approach of data integration used in this dissertation is also presented within the formal framework for data integration proposed by (LENZERINI, 2002). According to the author, that framework is general enough to capture all approaches in the literature, including, in particular, the approach considered in this dissertation.
|
49 |
Redução do esforço do usuário na configuração da deduplicação de grandes bases de dados / Reducing the user effort to tune large scale deduplicationDal Bianco, Guilherme January 2014 (has links)
A deduplicação consiste na tarefa de identificar quais objetos (registros, documentos, textos, etc.) são potencialmente os mesmos em uma base de dados (ou em um conjunto de bases de dados). A identificação de dados duplicados depende da intervenção do usuário, principalmente para a criação de um conjunto contendo pares duplicados e não duplicados. Tais informações são usadas para ajudar na identificação de outros possíveis pares duplicados presentes na base de dados. Em geral, quando a deduplicação é estendida para grandes conjuntos de dados, a eficiência e a qualidade das duplicatas dependem diretamente do “ajuste” de um usuário especialista. Nesse cenário, a configuração das principais etapas da deduplicação (etapas de blocagem e classificação) demandam que o usuário seja responsável pela tarefa pouco intuitiva de definir valores de limiares e, em alguns casos, fornecer pares manualmente rotulados. Desse modo, o processo de calibração exige que o usuário detenha um conhecimento prévio sobre as características específicas da base de dados e os detalhes do funcionamento do método de deduplicação. O objetivo principal desta tese é tratar do problema da configuração da deduplicação de grandes bases de dados, de modo a reduzir o esforço do usuário. O usuário deve ser somente requisitado para rotular um conjunto reduzido de pares automaticamente selecionados. Para isso, é proposta uma metodologia, chamada FS-Dedup, que incorpora algoritmos do estado da arte da deduplicação para permitir o processamento de grandes volumes de dados e adiciona um conjunto de estratégias com intuito de possibilitar a definição dos parâmetros do deduplicador, removendo os detalhes de configuração da responsabilidade do usuário. A metodologia pode ser vista como uma camada capaz de identificar as informações requisitadas pelo deduplicador (principalmente valores de limiares) a partir de um conjunto de pares rotulados pelo usuário. A tese propõe também uma abordagem que trata do problema da seleção dos pares informativos para a criação de um conjunto de treinamento reduzido. O desafio maior é selecionar um conjunto reduzido de pares suficientemente informativo para possibilitar a configuração da deduplicação com uma alta eficácia. Para isso, são incorporadas estratégias para reduzir o volume de pares candidatos a um algoritmo de aprendizagem ativa. Tal abordagem é integrada à metodologia FS-Dedup para possibilitar a remoção da intervenção especialista nas principais etapas da deduplicação. Por fim, um conjunto exaustivo de experimentos é executado com objetivo de validar as ideias propostas. Especificamente, são demonstrados os promissores resultados alcançados nos experimentos em bases de dados reais e sintéticas, com intuito de reduzir o número de pares manualmente rotulados, sem causar perdas na qualidade da deduplicação. / Deduplication is the task of identifying which objects (e.g., records, texts, documents, etc.) are potentially the same in a given dataset (or datasets). It usually requires user intervention in several stages of the process, mainly to ensure that pairs representing matchings and non-matchings can be determined. This information can be used to help detect other potential duplicate records. When deduplication is applied to very large datasets, the matching quality depends on expert users. The expert users are requested to define threshold values and produce a training set. This intervention requires user knowledge of the noise level of the data and a particular approach to deduplication so that it can be applied to configure the most important stages of the process (e.g. blocking and classification). The main aim of this thesis is to provide solutions to help in tuning the deduplication process in large datasets with a reduced effort from the user, who is only required to label an automatically selected subset of pairs. To achieve this, we propose a methodology, called FS-Dedup, which incorporates state-of-the-art algorithms in its deduplication core to address high performance issues. Following this, a set of strategies is proposed to assist in setting its parameters, and removing most of the detailed configuration concerns from the user. The methodology proposed can be regarded as a layer that is able to identify the specific information requested in the deduplication approach (mainly, threshold values) through pairs that are manually labeled by the user. Moreover, this thesis proposed an approach which would enable to select an informative set of pairs to produce a reduced training set. The main challenge here is how to select a “representative” set of pairs to configure the deduplication with high matching quality. In this context, the proposed approach incorporates an active learning method with strategies that allow the deduplication to be carried out on large datasets. This approach is integrated with the FS-Dedup methodology to avoid the need for a definition of threshold values in the most important deduplication stages. Finally, exhaustive experiments using both synthetic and real datasets have been conducted to validate the ideas outlined in this thesis. In particular, we demonstrate the ability of our approach to reduce the user effort without degrading the matching quality.
|
50 |
Formalização do processo de tradução de consultas em ambientes de integração de dados XML / Formalization of a query translation process in XML data integrationAlves, Willian Bruno Gomes January 2008 (has links)
A fim de consultar uma mesma informação em fontes XML heterogêneas seria desejável poder formular uma única consulta em relação a um esquema global conceitual e então traduzi-la automaticamente para consultas XML para cada uma das fontes. CXPath (Conceptual XPath) é uma proposta de linguagem para consultar fontes XML em um nível conceitual. Essa linguagem foi desenvolvida para simplificar o processo de tradução de consultas em nível conceitual para consultas em nível XML. Ao mesmo tempo, a linguagem tem como objetivo a facilidade de aprendizado de sua sintaxe. Por essa razão, sua sintaxe é bastante semelhante à da linguagem XPath utilizada para consultar documentos XML. Nesta dissertação é definido formalmente o mecanismo de tradução de consultas em nível conceitual, escritas em CXPath, para consultas em nível XML, escritas em XPath. É mostrado o tratamento do relacionamento de herança no mecanismo de tradução, e é feita uma discussão sobre a relação entre a expressividade do modelo conceitual e o mecanismo de tradução. Existem situações em que a simples tradução de uma consulta CXPath não contempla alguns resultados, pois as fontes de dados podem ser incompletas. Neste trabalho, o modelo conceitual que constitui o esquema global do sistema de integração de dados é estendido com dependências de inclusão e o mecanismo de resolução de consultas é modificado para lidar com esse tipo de dependência. Mais especificamente, são apresentados mecanismos de reescrita e eliminação de redundâncias de consultas a fim de lidar com essas dependências. Com o aumento de expressividade do esquema global é possível inferir resultados, a partir dos dados disponíveis no sistema de integração, que antes não seriam contemplados com a simples tradução de uma consulta. Também é apresentada a abordagem para integração de dados utilizada nesta dissertação de acordo com o arcabouço formal para integração de dados proposto por (LENZERINI, 2002). De acordo com o autor, tal arcabouço é geral o bastante para capturar todas as abordagens para integração de dados da literatura, o que inclui a abordagem aqui mostrada. / In order to search for the same information in heterogeneous XML data sources, it would be desirable to state a single query against a global conceptual schema and then translate it automatically into an XML query for each specific data source. CXPath (for Conceptual XPath ) has been proposed as a language for querying XML sources at the conceptual level. This language was developed to simplify the translation process of queries at conceptual level to queries at XML level. At the same time, one of the goals of the language design is to facilitate the learning of its syntax. For this reason its syntax is similar to the XPath language used for querying XML documents. In this dissertation, a translation mechanism of queries at conceptual level, written in CXPath, to queries at XML level, written in XPath, is formally defined. The inheritance relationship in the translation mechanism is shown, being discussed the relation between the conceptual model expressivity and the translation mechanism. In some cases, the translation of a CXPath query does not return some of the answers because the data sources may be incomplete. In this work, the conceptual model, which is the basis for the data integration system’s global schema, is improved with inclusion dependencies, and the query answering mechanism is modified to deal with this kind of dependency. More specifically, mechanisms of query rewriting and redundancy elimination are presented to deal with this kind of dependency. This global schema improvement allows infer results, with the data available in the system, that would not be provided with a simple query translation. The approach of data integration used in this dissertation is also presented within the formal framework for data integration proposed by (LENZERINI, 2002). According to the author, that framework is general enough to capture all approaches in the literature, including, in particular, the approach considered in this dissertation.
|
Page generated in 0.1452 seconds