Global ETD Search

11	Keyword Join: Realizing Keyword Search for Information Integration Yu, Bei, Liu, Ling, Ooi, Beng Chin, Tan, Kian Lee 01 1900 (has links) Information integration has been widely addressed over the last several decades. However, it is far from solved due to the complexity of resolving schema and data heterogeneities. In this paper, we propose out attempt to alleviate such difficulty by realizing keyword search functionality for integrating information from heterogeneous databases. Our solution does not require predefined global schema or any mappings between databases. Rather, it relies on an operator called keyword join to take a set of lists of partial answers from different data sources as input, and output a list of results that are joined by the tuples from input lists based on predefined similarity measures as integrated results. Our system allows source databases remain autonomous and the system to be dynamic and extensible. We have tested our system with real dataset and benchmark, which shows that our proposed method is practical and effective. / Singapore-MIT Alliance (SMA) keyword join keyword query data integration database
12	Schema quality analysis in a data integration system BATISTA, Maria da Conceição Moraes 31 January 2008 (has links) Made available in DSpace on 2014-06-12T15:49:12Z (GMT). No. of bitstreams: 1 license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2008 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / Qualidade da Informação (QI) tem se tornado um aspecto crítico nas organizações e em pesquisas da área de sistemas de informação. Informações de pouca qualidade podem ter impactos negativos na efetividade de uma organização. O crescimento do uso de data warehouses e acesso direto de gerentes e usários a informações obtidas de várias fontes contribuíram para o crescimento da necessidade de qualidade nas informações das empresas. A noção de QI em sistemas de informação emergiu nos últimos e vem sendo alvo de interesse cada vez maior. Não existe ainda um acordo comum acerca de uma definição da QI. Apenas um consenso de que tratase de um conceito de adequação ao uso . A informação é considerada apropriada para o uso dentro da perspectiva dos requisitos e necessidades de um usuário, ou seja, a qualidade da informação depende de sua utilidade. O acesso integrado a informações distribuídas em múltiplas fontes de dados heterogêneas, distribuídas e autônomas é um problema importante a ser resolvido em muitos domínios de aplicações. Tipicamente existem algumas formas de se obter respostas a consultas globais, sobre dados em fontes diferentes com diferentes combinações. entretanto é bastante custoso obter todas as respostas possíveis. Enquanto muita pesquisa tem sido feita em relação a processamento de consultas e seleção de planos com critérios de custo, pouco se conhece com relação ao problema de incorporar aspectos de QI em esquemas globais de sistemas de integração de dados. Neste trabalho, nós propomos a análise da QI em um sistema de integração de dados, mais especificamente a qualidade dos esquemas do sistema. O nosso principal objetivo é melhorar a qualidade da execução das consultas. Nossa proposta baseiasse na hipótese de que uma alternativa de otimizar o processamento de consultas seria a construção de esquemas com altos escores de QI. Assim, o foco deste trabalho está no desenvolvimento de mecanismos de análise da QI voltados esquemas de integração de dados, especialmente o esquema global. Inicialmente, nós construímos uma lista de critérios de QI e relacionamos estes critérios com os elementos existentes em sistemas de integração de dados. Em seguida, direcionamos o foco para o esquema integrado e especificamos formalmente critérios de qualidade de esquemas minimalidade, completude do esquema e consistência de tipo. Também especificamos um algoritmo de execução de ajustes de forma a melhorar a minimalidade e algoritmos para medir a consistência de tipo nos esquemas. Com esses experimentos conseguimos mostrar que o tempo de execução de uma consulta em um sistema de integração de dados pode diminuir se esta consulta for submetida a um esquema com escores altos de minimalidade e consistência de tipo Information Quality Data Quality Data Integration
13	Open City Data Pipeline Bischof, Stefan, Kämpgen, Benedikt, Harth, Andreas, Polleres, Axel, Schneider, Patrik 02 1900 (has links) (PDF) Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality. / Series: Working Papers on Information Systems, Information Business and Operations
14	On techniques for pay-as-you-go data integration of linked data Christodoulou, Klitos January 2015 (has links) It is recognised that nowadays, users interact with large amounts of data that exist in disparate forms, and are stored under different settings. Moreover, it is true that the amount of structured and un-structured data outside a single well organised data management system is expanding rapidly. To address the recent challenges of managing large amounts of potentially distributed data, the vision of a dataspace was introduced. This data management paradigm aims at reducing the complexity behind the challenges of integrating heterogeneous data sources. Recently, efforts by the Linked Data (LD) community gave rise to a Web of Data (WoD) that interweaves with the current Web of documents in a way that it is useful for data consumption by both humans and computational agents. On the WoD, datasets are structured under a common data model and published as Web resources following a simple set of guidelines that enables them to be linked with other pieces of data, as well as, to be annotated with useful meta data that help determine their semantics. The WoD is an evolving open ecosystem including specialist publishers as well as community efforts aiming at re-publishing isolated databases as LD on the WoD, and annotating them with meta data. The WoD raises new opportunities and challenges. However, currently it mostly relies on manual efforts for integrating the large amounts of heterogeneous data sources on the WoD. This dissertation makes the case that several techniques from the dataspaces research area (aiming at on-demand integration of data sources in a pay-as-you-go fashion) can support the integration of heterogeneous WoD sources. In so doing, this dissertation explores the opportunities and identifies the challenges of adapting existing pay-as-you-go data integration techniques in the context of LD. More specifically, this dissertation makes the following contributions: (1) a case-study for identifying the challenges when existing pay-as-you-go data integration techniques are applied in a setting where data sources are LD; (2) a methodology that deals with the 'schema-less' nature of LD sources by automatically inferring a conceptual structure from a given RDF graph thus enabling downstream tasks, such as the identification of matches and the derivation of mappings, which are, both, essential for the automatic bootstrapping of a dataspace; and (3) a well-defined, principled methodology that builds on a Bayesian inference technique for reasoning under uncertainty to improve pay-as-you-go integration. Although the developed methodology is generic in being able to reason with different hypothesis, its effectiveness has only been explored on reducing the uncertain decisions made by string-based matchers during the matching stage of a dataspace system. 005.7
15	Leveraging big data resources and data integration in biology: applying computational systems analyses and machine learning to gain insights into the biology of cancers Sinkala, Musalula 24 February 2021 (has links) Recently, many "molecular profiling" projects have yielded vast amounts of genetic, epigenetic, transcription, protein expression, metabolic and drug response data for cancerous tumours, healthy tissues, and cell lines. We aim to facilitate a multi-scale understanding of these high-dimensional biological data and the complexity of the relationships between the different data types taken from human tumours. Further, we intend to identify molecular disease subtypes of various cancers, uncover the subtype-specific drug targets and identify sets of therapeutic molecules that could potentially be used to inhibit these targets. We collected data from over 20 publicly available resources. We then leverage integrative computational systems analyses, network analyses and machine learning, to gain insights into the pathophysiology of pancreatic cancer and 32 other human cancer types. Here, we uncover aberrations in multiple cell signalling and metabolic pathways that implicate regulatory kinases and the Warburg effect as the likely drivers of the distinct molecular signatures of three established pancreatic cancer subtypes. Then, we apply an integrative clustering method to four different types of molecular data to reveal that pancreatic tumours can be segregated into two distinct subtypes. We define sets of proteins, mRNAs, miRNAs and DNA methylation patterns that could serve as biomarkers to accurately differentiate between the two pancreatic cancer subtypes. Then we confirm the biological relevance of the identified biomarkers by showing that these can be used together with pattern-recognition algorithms to infer the drug sensitivity of pancreatic cancer cell lines accurately. Further, we evaluate the alterations of metabolic pathway genes across 32 human cancers. We find that while alterations of metabolic genes are pervasive across all human cancers, the extent of these gene alterations varies between them. Based on these gene alterations, we define two distinct cancer supertypes that tend to be associated with different clinical outcomes and show that these supertypes are likely to respond differently to anticancer drugs. Overall, we show that the time has already arrived where we can leverage available data resources to potentially elicit more precise and personalised cancer therapies that would yield better clinical outcomes at a much lower cost than is currently being achieved. big data data integration biology machine learning
16	Mashup-Werkzeuge zur Ad-hoc-Datenintegration im Web Aumüller, David, Thor, Andreas 05 November 2018 (has links) No description available. Mash-up, Datenintegration Mash-up; data-integration
17	Ontology-driven Data Integration for Clinical Sleep Research Mueller, Remo Sebastian 07 July 2011 (has links) No description available. Computer Science data integration ontology web-framework
18	Ontology Development and Utilization in Product Design Chang, Xiaomeng 01 May 2008 (has links) Currently, computer-based support tools are widely used to facilitate the design process and have the potential to reduce design time, decrease product cost and enhance product quality. PDM (Product Data Management) and PLM (Product Lifecycle Management) are two types of computer-based information systems that have been developed to manage product lifecycle and product related data. While promising, significant limitations still exist, where information required to make decisions may not be available, may be lacking consistency, and may not be expressed in a general way for sharing among systems. Moreover, it is difficult for designers to consider multiple complex technical and economical criteria, relations, and objectives in product design simultaneously. In recent years, ontology-based method is a new and promising approach to manage knowledge in engineering, integrate multiple data resources, and facilitate the consideration of the complex relations among concepts and slots in decision making. The purpose of this research is to explore an ontology-based method to solve the limitations in present computer-based information systems for product design. The field of Design for Manufacturing (DFM) is selected for this study, and three primary aspects are investigated. First, a generalized DFM ontology is proposed and developed. The ontology fulfills the mathematical and logical constraints needed in DFM, as well as ontology editor capabilities to support the continuous improvement of the ontology. Second, the means to guide users to the proper information and integrate heterogeneous data resources is investigated. Third, based on the ontology and information integration, a decision support tool is developed to help designers consider the design problem in a systematic way and make design decisions efficiently based on accurate and comprehensive data. The methods and tools developed in this research are refined using example cases provided by the CFSP (The NSF Center for Friction Stir Processing). This includes cost models and a decision support environment. Errors that may occur in the research are categorized with management methods. An error ontology is built to help root cause analysis of errors and further reduce possible errors in the ontology and decision support tool. An evaluation methodology for the research is also investigated. / Ph. D. Decision Support Data Integration Knowledge Management Ontology
19	Differential Dependency Network and Data Integration for Detecting Network Rewiring and Biomarkers Fu, Yi 30 January 2020 (has links) Rapid advances in high-throughput molecular profiling techniques enabled large-scale genomics, transcriptomics, and proteomics-based biomedical studies, generating an enormous amount of multi-omics data. Processing and summarizing multi-omics data, modeling interactions among biomolecules, and detecting condition-specific dysregulation using multi-omics data are some of the most important yet challenging analytics tasks. In the case of detecting somatic DNA copy number aberrations using bulk tumor samples in cancer research, normal cell contamination becomes one significant confounding factor that weakens the power regardless of whichever methods used for detection. To address this problem, we propose a computational approach – BACOM 2.0 to more accurately estimate normal cell fraction and accordingly reconstruct DNA copy number signals in cancer cells. Specifically, by introducing allele-specific absolute normalization, BACOM 2.0 can accurately detect deletion types and aneuploidy in cancer cells directly from DNA copy number data. Genes work through complex networks to support cellular processes. Dysregulated genes can cause structural changes in biological networks, also known as network rewiring. Genes with a large number of rewired edges are more likely to be associated with functional alteration leading phenotype transitions, and hence are potential biomarkers in diseases such as cancers. Differential dependency network (DDN) method was proposed to detect such network rewiring and biomarkers. However, the existing DDN method and software tool has two major drawbacks. Firstly, in imbalanced sample groups, DDN suffers from systematic bias and produces false positive differential dependencies. Secondly, the computational time of the block coordinate descent algorithm in DDN increases rapidly with the number of involved samples and molecular entities. To address the imbalanced sample group problem, we propose a sample-scale-wide normalized formulation to correct systematic bias and design a simulation study for testing the performance. To address high computational complexity, we propose several strategies to accelerate DDN learning, including two reformulated algorithms for block-wise coefficient updating in the DDN optimization problem. Specifically, one strategy on discarding predictors and one strategy on accelerating parallel computing. More importantly, experimental results show that new DDN learning speed with combined accelerating strategies is hundreds of times faster than that of the original method on medium-sized data. We applied the DDN method on several biomedical datasets of omics data and detected significant phenotype-specific network rewiring. With a random-graph-based detection strategy, we discovered the hub node defined biomarkers that helped to generate or validate several novel scientific hypotheses in collaborative research projects. For example, the hub genes detected by the DDN methods in proteomics data from artery samples are significantly enriched in the citric acid cycle pathway that plays a critical role in the development of atherosclerosis. To detect intra-omics and inter-omics network rewirings, we propose a method called multiDDN that uses a multi-layer signaling model to integrate multi-omics data. We adapt the block coordinate descent algorithm to solve the multiDDN optimization problem with accelerating strategies. The simulation study shows that, compared with the DDN method on single omics, the multiDDN method has considerable advantage on higher accuracy of detecting network rewiring. We applied the multiDDN method on the real multi-omics data from CPTAC ovarian cancer dataset, and detected multiple hub genes associated with histone protein deacetylation and were previously reported in independent ovarian cancer data analysis. / Doctor of Philosophy / We witnessed the start of the human genome project decades ago and stepped into the era of omics since then. Omics are comprehensive approaches for analyzing genome-wide biomolecular profiles. The rapid development of high-throughput technologies enables us to produce an enormous amount of omics data such as genomics, transcriptomics, and proteomics data, which makes researchers swim in a sea of omics information that once never imagined. Yet, the era of omics brings new challenges to us: to process the huge volumes of data, to summarize the data, to reveal the interactions between entities, to link various types of omics data, and to discover mechanisms hidden behind omics data. In processing omics data, one factor that weakens the strengths of follow up data analysis is sample impurity. We call impure tumor samples contaminated by normal cells as heterogeneous samples. The genomic signals measured from heterogeneous samples are a mixture of signals from both tumor cells and normal cells. To correct the mixed signals and get true signals from pure tumor cells, we propose a computational approach called BACOM 2.0 to estimate normal cell fraction and corrected genomics signals accordingly. By introducing a novel normalization method that identifies the neutral component in mixed signals of genomic copy number data, BACOM 2.0 could accurately detect genes' deletion types and abnormal chromosome numbers in tumor cells. In cells, genes connect to other genes and form complex biological networks to perform their functions. Dysregulated genes can cause structural change in biological networks, also known as network rewiring. In a biological network with network rewiring events, a large quantity of network rewiring linking to a single hub gene suggests concentrated gene dysregulation. This hub gene has more impact on the network and hence is more likely to associate with the functional change of the network, which ultimately leads to abnormal phenotypes such as cancer diseases. Therefore, the hub genes linked with network rewiring are potential indicators of disease status or known as biomarkers. Differential dependency network (DDN) method was proposed to detect network rewiring events and biomarkers from omics data. However, the DDN method still has a few drawbacks. Firstly, for two groups of data with unequal sample sizes, DDN consistently detects false targets of network rewiring. The permutation test, which uses the same method on randomly shuffled samples is supposed to distinguish the true targets from random effects, however, is also suffered from the same reason and could let pass those false targets. We propose a new formulation that corrects the mistakes brought by unequal group size and design a simulation study to test the new formulation's correctness. Secondly, the time used for computing in solving DDN problems is unbearably long when processing omics data with a large number of samples scale or a large number of genes. We propose several strategies to increase DDN's computation speed, including three redesigned formulas for efficiently updating the results, one rule to preselect predictor variables, and one accelerating skill of utilizing multiple CPU cores simultaneously. In the timing test, the DDN method with increased computing speed is much faster than the original method. To detect network rewirings within the same omics data or between different types of omics, we propose a method called multiDDN that uses an integrated model to process multiple types of omics data. We solve the new problem by adapting the block coordinate descending algorithm. The test on simulated data shows multiDDN is better than single omics DDN. We applied DDN or multiDDN method on several datasets of omics data and detected significant network rewiring associated with diseases. We detected hub nodes from the network rewiring events. These hub genes as potential biomarkers help us to ask new meaningful questions in related researches. molecular data integration differential network analysis biomarker
20	Análise gênica de comorbidades a partir da integração de dados epidemiológicos / Comorbidities genetic analysis from epidemological data integration Ferraz Néto, Karla 01 December 2014 (has links) A identificação de genes responsáveis por doenças humanas pode fornecer conhecimentos sobre mecanismos patológicos e psicológicos que são essenciais para o desenvolvimento de novos diagnósticos e terapias. Sabemos que uma doença é raramente uma consequência de uma anormalidade num único gene, porém reflete desordens de uma rede intra e intercelular complexa. Muitas metodologias conhecidas na Bioinformática são capazes de priorizar genes relacionados a uma determinada doença. Algumas abordagens também podem validar a pertinência ou não destes genes em relação à doença estudada. Uma abordagem de priorização de genes é a investigação a partir de doenças que acometem pacientes ao mesmo tempo, as comorbidades. Existem muitas fontes de dados biomédicos que podem ser utilizadas para a coleta de comorbidades. Desta forma, podemos coletar pares de doenças que formam comorbidades epidemiológicas e assim analisar os genes de cada doença. Esta análise serve para expandirmos a lista de genes candidatos de cada uma dessas doenças e justificarmos a relação gênica entre essas comorbidades. O objetivo principal deste projeto é o de integração dos dados epidemiológicos e genéticos para a realização da predição de genes causadores de doenças. Isto se dará através do estudo de comorbidade destas doenças. / The identification of genes responsible for human diseases can provide knowledge about pathological and physiological mechanisms that are essential for the development of new diagnostics and therapeutics. It is known that a disease is rarely a consequence of an abnormality in a single gene, but reflects complex intra and intercellular network disorders. Many methodologies known in Bioinformatics are able to prioritize genes related to a particular disease. Some approaches can also validate how appropriate or not these genes are relative to a disease. An approach for prioritizing genes is the research from diseases afecting patients at the same time, i.e. comorbidities. There are many sources of biomedical data that can be used to collect comorbidities and analyse genes of each disease. We can also expand the list of candidate genes for each singular disease and justify the genetic relationship of these comorbidities. The main objective of this project is the integration of epidemiologic and genetic data to perform the prediction of causing genes through the study of comorbidity of these illnesses. comorbidade comorbidity data integration gene prediction integração de dados predição de genes

Search results