Global ETD Search

21	Differential Dependency Network and Data Integration for Detecting Network Rewiring and Biomarkers Fu, Yi 30 January 2020 (has links) Rapid advances in high-throughput molecular profiling techniques enabled large-scale genomics, transcriptomics, and proteomics-based biomedical studies, generating an enormous amount of multi-omics data. Processing and summarizing multi-omics data, modeling interactions among biomolecules, and detecting condition-specific dysregulation using multi-omics data are some of the most important yet challenging analytics tasks. In the case of detecting somatic DNA copy number aberrations using bulk tumor samples in cancer research, normal cell contamination becomes one significant confounding factor that weakens the power regardless of whichever methods used for detection. To address this problem, we propose a computational approach – BACOM 2.0 to more accurately estimate normal cell fraction and accordingly reconstruct DNA copy number signals in cancer cells. Specifically, by introducing allele-specific absolute normalization, BACOM 2.0 can accurately detect deletion types and aneuploidy in cancer cells directly from DNA copy number data. Genes work through complex networks to support cellular processes. Dysregulated genes can cause structural changes in biological networks, also known as network rewiring. Genes with a large number of rewired edges are more likely to be associated with functional alteration leading phenotype transitions, and hence are potential biomarkers in diseases such as cancers. Differential dependency network (DDN) method was proposed to detect such network rewiring and biomarkers. However, the existing DDN method and software tool has two major drawbacks. Firstly, in imbalanced sample groups, DDN suffers from systematic bias and produces false positive differential dependencies. Secondly, the computational time of the block coordinate descent algorithm in DDN increases rapidly with the number of involved samples and molecular entities. To address the imbalanced sample group problem, we propose a sample-scale-wide normalized formulation to correct systematic bias and design a simulation study for testing the performance. To address high computational complexity, we propose several strategies to accelerate DDN learning, including two reformulated algorithms for block-wise coefficient updating in the DDN optimization problem. Specifically, one strategy on discarding predictors and one strategy on accelerating parallel computing. More importantly, experimental results show that new DDN learning speed with combined accelerating strategies is hundreds of times faster than that of the original method on medium-sized data. We applied the DDN method on several biomedical datasets of omics data and detected significant phenotype-specific network rewiring. With a random-graph-based detection strategy, we discovered the hub node defined biomarkers that helped to generate or validate several novel scientific hypotheses in collaborative research projects. For example, the hub genes detected by the DDN methods in proteomics data from artery samples are significantly enriched in the citric acid cycle pathway that plays a critical role in the development of atherosclerosis. To detect intra-omics and inter-omics network rewirings, we propose a method called multiDDN that uses a multi-layer signaling model to integrate multi-omics data. We adapt the block coordinate descent algorithm to solve the multiDDN optimization problem with accelerating strategies. The simulation study shows that, compared with the DDN method on single omics, the multiDDN method has considerable advantage on higher accuracy of detecting network rewiring. We applied the multiDDN method on the real multi-omics data from CPTAC ovarian cancer dataset, and detected multiple hub genes associated with histone protein deacetylation and were previously reported in independent ovarian cancer data analysis. / Doctor of Philosophy / We witnessed the start of the human genome project decades ago and stepped into the era of omics since then. Omics are comprehensive approaches for analyzing genome-wide biomolecular profiles. The rapid development of high-throughput technologies enables us to produce an enormous amount of omics data such as genomics, transcriptomics, and proteomics data, which makes researchers swim in a sea of omics information that once never imagined. Yet, the era of omics brings new challenges to us: to process the huge volumes of data, to summarize the data, to reveal the interactions between entities, to link various types of omics data, and to discover mechanisms hidden behind omics data. In processing omics data, one factor that weakens the strengths of follow up data analysis is sample impurity. We call impure tumor samples contaminated by normal cells as heterogeneous samples. The genomic signals measured from heterogeneous samples are a mixture of signals from both tumor cells and normal cells. To correct the mixed signals and get true signals from pure tumor cells, we propose a computational approach called BACOM 2.0 to estimate normal cell fraction and corrected genomics signals accordingly. By introducing a novel normalization method that identifies the neutral component in mixed signals of genomic copy number data, BACOM 2.0 could accurately detect genes' deletion types and abnormal chromosome numbers in tumor cells. In cells, genes connect to other genes and form complex biological networks to perform their functions. Dysregulated genes can cause structural change in biological networks, also known as network rewiring. In a biological network with network rewiring events, a large quantity of network rewiring linking to a single hub gene suggests concentrated gene dysregulation. This hub gene has more impact on the network and hence is more likely to associate with the functional change of the network, which ultimately leads to abnormal phenotypes such as cancer diseases. Therefore, the hub genes linked with network rewiring are potential indicators of disease status or known as biomarkers. Differential dependency network (DDN) method was proposed to detect network rewiring events and biomarkers from omics data. However, the DDN method still has a few drawbacks. Firstly, for two groups of data with unequal sample sizes, DDN consistently detects false targets of network rewiring. The permutation test, which uses the same method on randomly shuffled samples is supposed to distinguish the true targets from random effects, however, is also suffered from the same reason and could let pass those false targets. We propose a new formulation that corrects the mistakes brought by unequal group size and design a simulation study to test the new formulation's correctness. Secondly, the time used for computing in solving DDN problems is unbearably long when processing omics data with a large number of samples scale or a large number of genes. We propose several strategies to increase DDN's computation speed, including three redesigned formulas for efficiently updating the results, one rule to preselect predictor variables, and one accelerating skill of utilizing multiple CPU cores simultaneously. In the timing test, the DDN method with increased computing speed is much faster than the original method. To detect network rewirings within the same omics data or between different types of omics, we propose a method called multiDDN that uses an integrated model to process multiple types of omics data. We solve the new problem by adapting the block coordinate descending algorithm. The test on simulated data shows multiDDN is better than single omics DDN. We applied DDN or multiDDN method on several datasets of omics data and detected significant network rewiring associated with diseases. We detected hub nodes from the network rewiring events. These hub genes as potential biomarkers help us to ask new meaningful questions in related researches. molecular data integration differential network analysis biomarker
22	Ontology-driven Data Integration for Clinical Sleep Research Mueller, Remo Sebastian 07 July 2011 (has links) No description available. Computer Science data integration ontology web-framework
23	Ontology Development and Utilization in Product Design Chang, Xiaomeng 01 May 2008 (has links) Currently, computer-based support tools are widely used to facilitate the design process and have the potential to reduce design time, decrease product cost and enhance product quality. PDM (Product Data Management) and PLM (Product Lifecycle Management) are two types of computer-based information systems that have been developed to manage product lifecycle and product related data. While promising, significant limitations still exist, where information required to make decisions may not be available, may be lacking consistency, and may not be expressed in a general way for sharing among systems. Moreover, it is difficult for designers to consider multiple complex technical and economical criteria, relations, and objectives in product design simultaneously. In recent years, ontology-based method is a new and promising approach to manage knowledge in engineering, integrate multiple data resources, and facilitate the consideration of the complex relations among concepts and slots in decision making. The purpose of this research is to explore an ontology-based method to solve the limitations in present computer-based information systems for product design. The field of Design for Manufacturing (DFM) is selected for this study, and three primary aspects are investigated. First, a generalized DFM ontology is proposed and developed. The ontology fulfills the mathematical and logical constraints needed in DFM, as well as ontology editor capabilities to support the continuous improvement of the ontology. Second, the means to guide users to the proper information and integrate heterogeneous data resources is investigated. Third, based on the ontology and information integration, a decision support tool is developed to help designers consider the design problem in a systematic way and make design decisions efficiently based on accurate and comprehensive data. The methods and tools developed in this research are refined using example cases provided by the CFSP (The NSF Center for Friction Stir Processing). This includes cost models and a decision support environment. Errors that may occur in the research are categorized with management methods. An error ontology is built to help root cause analysis of errors and further reduce possible errors in the ontology and decision support tool. An evaluation methodology for the research is also investigated. / Ph. D. Decision Support Data Integration Knowledge Management Ontology
24	Análise gênica de comorbidades a partir da integração de dados epidemiológicos / Comorbidities genetic analysis from epidemological data integration Ferraz Néto, Karla 01 December 2014 (has links) A identificação de genes responsáveis por doenças humanas pode fornecer conhecimentos sobre mecanismos patológicos e psicológicos que são essenciais para o desenvolvimento de novos diagnósticos e terapias. Sabemos que uma doença é raramente uma consequência de uma anormalidade num único gene, porém reflete desordens de uma rede intra e intercelular complexa. Muitas metodologias conhecidas na Bioinformática são capazes de priorizar genes relacionados a uma determinada doença. Algumas abordagens também podem validar a pertinência ou não destes genes em relação à doença estudada. Uma abordagem de priorização de genes é a investigação a partir de doenças que acometem pacientes ao mesmo tempo, as comorbidades. Existem muitas fontes de dados biomédicos que podem ser utilizadas para a coleta de comorbidades. Desta forma, podemos coletar pares de doenças que formam comorbidades epidemiológicas e assim analisar os genes de cada doença. Esta análise serve para expandirmos a lista de genes candidatos de cada uma dessas doenças e justificarmos a relação gênica entre essas comorbidades. O objetivo principal deste projeto é o de integração dos dados epidemiológicos e genéticos para a realização da predição de genes causadores de doenças. Isto se dará através do estudo de comorbidade destas doenças. / The identification of genes responsible for human diseases can provide knowledge about pathological and physiological mechanisms that are essential for the development of new diagnostics and therapeutics. It is known that a disease is rarely a consequence of an abnormality in a single gene, but reflects complex intra and intercellular network disorders. Many methodologies known in Bioinformatics are able to prioritize genes related to a particular disease. Some approaches can also validate how appropriate or not these genes are relative to a disease. An approach for prioritizing genes is the research from diseases afecting patients at the same time, i.e. comorbidities. There are many sources of biomedical data that can be used to collect comorbidities and analyse genes of each disease. We can also expand the list of candidate genes for each singular disease and justify the genetic relationship of these comorbidities. The main objective of this project is the integration of epidemiologic and genetic data to perform the prediction of causing genes through the study of comorbidity of these illnesses. comorbidade comorbidity data integration gene prediction integração de dados predição de genes
25	IntegraWeb: uma proposta de arquitetura baseada em mapeamentos semânticos e técnicas de mineração de dados / IntegraWeb: an architectural proposal based on semantic mappings and data mining techniques Pierin, Felipe Lombardi 05 December 2017 (has links) Atualmente uma grande quantidade de conteúdo é produzida e publicada todos os dias na Internet. São documentos publicados por diferentes pessoas, por diversas organizações e em inúmeros formatos sem qualquer tipo de padronização. Por esse motivo, a informação relevante sobre um mesmo domínio de interesse acaba espalhada pela Web nos diversos portais, o que dificulta uma visão ampla, centralizada e objetiva sobre esta informação. Nesse contexto, a integração dos dados espalhados na rede torna-se um problema de pesquisa relevante, para permitir a realização de consultas mais inteligentes, de modo a obter resultados mais ricos de significado e mais próximos do interesse do usuário. No entanto, tal integração não é trivial, sendo por muitas vezes custosa devido à dependência do desenvolvimento de sistemas e mão de obra especializados, visto que são poucos os modelos reaproveitáveis e facilmente integráveis entre si. Assim, a existência de um modelo padronizado para a integração dos dados e para o acesso à informação produzida por essas diferentes entidades reduz o esforço na construção de sistemas específicos. Neste trabalho é proposta uma arquitetura baseada em ontologias para a integração de dados publicados na Internet. O seu uso é ilustrado através de casos de uso reais para a integração da informação na Internet, evidenciando como o uso de ontologias pode trazer resultados mais relevantes. / A lot of content is produced and published every day on the Internet. Those documents are published by different people, organizations and in many formats without any type of established standards. For this reason, relevant information about a domain of interest is spread through the Web in various portals, which hinders a broad, centralized and objective view of this information. In this context, the integration of the data scattered in the network becomes a relevant research problem, to enable smarter queries, in order to obtain richer results of meaning and closer to the user\'s interest. However, such integration is not trivial, and is often costly because of the reliance on the development of specialized systems by professionals, since there are few reusable and easily integrable models. Thus, the existence of a standardized model for data integration and access to the information produced by these different entities reduces the effort in the construction of specific systems. In this work we propose an architecture based on ontologies for the integration of data published on the Internet. Its use is illustrated through experimental cases for the integration of information on the Internet, showing how the use of ontologies can bring more relevant results. Data-integration Integração de informação Ontologias Ontologies Semantic Web Web semântica
26	MPPI: um modelo de procedência para subsidiar processos de integração / MPPI: a provenance model to support data integration processes Tomazela, Bruno 05 February 2010 (has links) A procedência dos dados consiste no conjunto de metadados que possibilita identificar as fontes e os processos de transformação aplicados aos dados, desde a criação até o estado atual desses dados. Existem diversas motivações para se incorporar a procedência ao processo de integração, tais como avaliar a qualidade dos dados das fontes heterogêneas, realizar processos de auditoria dos dados e de atribuição de autoria aos proprietários dos dados e reproduzir decisões de integração. Nesta dissertação é proposto o MPPI, um modelo de procedência para subsidiar processos de integração. O modelo enfoca sistemas nos quais as fontes de dados podem ser atualizadas somente pelos seus proprietários, impossibilitando que a integração retifique eventuais conflitos de dados diretamente nessas fontes. O principal requisito do MPPI é que ele ofereça suporte ao tratamento de todas as decisões de integração realizadas em processos anteriores, de forma que essas decisões possam ser reaplicadas automaticamente em processos de integração subsequentes. O modelo MPPI possui quatro características. A primeira delas consiste no mapeamento da procedência dos dados em operações de cópia, edição, inserção e remoção, e no armazenamento dessas operações em um repositório de operações. A segunda característica é o tratamento de operações de sobreposição, por meio da proposta das políticas blind, restrict, undo e redo. A terceira característica consiste na identificação de anomalias decorrentes do fato de que fontes de dados autônomas podem alterar os seus dados entre processos de integração, e na proposta de quatro tipos de validação das operações frente a essas anomalias: validação completa, da origem, do destino, ou nenhuma. A quarta característica consiste na reaplicação de operações, por meio da proposta dos métodos VRS (do inglês Validate and Reapply in Separate) e VRT (do inglês Validate and Reapply in Tandem) e da reordenação segura do repositório, os quais garantem que todas as decisões de integração tomadas pelo usuário em processos de integração anteriores sejam resolvidas automaticamente e da mesma forma em processos de integração subsequentes. A validação do modelo MPPI foi realizada por meio de testes de desempenho que investigaram o tratamento de operações de sobreposição, o método VRT e a reordenação segura, considerando como base as demais características do modelo. Os resultados obtidos mostraram a viabilidade de implementação das políticas propostas para tratamento de operações de sobreposição em sistemas de integração reais. Os resultados também mostraram que o método VRT proporcionou ganhos de desempenho significativos frente à coleta quando o objetivo é restabelecer resultados de processos de integração que já foram executados pelo menos uma vez. O ganho médio de desempenho do método VRT foi de pelo menos 93%. Ademais, os testes também mostraram que reordenar as operações antes da reaplicação pode melhorar ainda mais o desempenho do método VRT / Data provenance is the set of metadata that allows for the identification of sources and transformations applied to data, since its creation to its current state. There are several advantages of incorporating data provenance into data integration processes, such as to estimate data quality and data reliability, to perform data audit, to establish the copyright and ownership of data, and to reproduce data integration decisions. In this master\'s thesis, we propose the MPPI, a novel data provenance model that supports data integration processes. The model focuses on systems in which only owners can update their data sources, i.e., the integration process cannot correct the sources according to integration decisions. The main goal of the MPPI model is to handle decisions taken by the user in previous integration processes, so they can be automatically reapplied in subsequent integration processes. The MPPI model introduces the following properties. It is based on mapping provenance data into operations of copy, edit, insert and remove, which are stored in an operation repository. It also provides four techniques to handle overlapping operations: blind, restrict, undo and redo. Furthermore, it identifies anomalies generated by sources that are updated between two data integration processes and proposes four validation approaches to avoid these anomalies: full validation, source validation, target validation and no validation. Moreover, it introduces two methods that perform the reapplication of operations according to decisions taken by the user, called the VRS (Validate and Reapply in Separate) and the VRT (Validate and Reapply in Tandem) methods, in addition to extending the VRT method with the safe reordering optimization. The MPPI model was validated through performance tests that investigated overlapping operations, the VRT method and the safe reordering optimization. The tests showed that the techniques proposed to handle overlapping operations are feasible to be applied to real integration systems. The results also demonstrated that the VRT method provided significant performance gains over data gathering when the goal is to reestablish previous integration results. The performance gains were of at least 93%. Furthermore, the performance results also showed that reordering the operations before the reapplication process can improve even more the performance of the VRT method Data integration Data provenance Integração de dados Procedência dos dados
27	Exploring Strategies to Integrate Disparate Bioinformatics Datasets Fakhry, Charbel Bader 01 January 2019 (has links) Distinct bioinformatics datasets make it challenging for bioinformatics specialists to locate the required datasets and unify their format for result extraction. The purpose of this single case study was to explore strategies to integrate distinct bioinformatics datasets. The technology acceptance model was used as the conceptual framework to understand the perceived usefulness and ease of use of integrating bioinformatics datasets. The population of this study included bioinformatics specialists of a research institution in Lebanon that has strategies to integrate distinct bioinformatics datasets. The data collection process included interviews with 6 bioinformatics specialists and reviewing 27 organizational documents relating to integrating bioinformatics datasets. Thematic analysis was used to identify codes and themes related to integrating distinct bioinformatics datasets. Key themes resulting from data analysis included a focus on integrating bioinformatics datasets, adding metadata with the submitted bioinformatics datasets, centralized bioinformatics database, resources, and bioinformatics tools. I showed throughout analyzing the findings of this study that specialists who promote standardizing techniques, adding metadata, and centralization may increase efficiency in integrating distinct bioinformatics datasets. Bioinformaticians, bioinformatics providers, the health care field, and society might benefit from this research. Improvement in bioinformatics affects poistevely the health-care field which has a positive social change. The results of this study might also lead to positive social change in research institutions, such as reduced workload, less frustration, reduction in costs, and increased efficiency while integrating distinct bioinformatics datasets. Bioinformatics data integration disparate Information technology Bioinformatics Databases and Information Systems
28	Using web services for customised data entry Deng, Yanbo January 2007 (has links) Scientific databases often need to be accessed from a variety of different applications. There are usually many ways to retrieve and analyse data already in a database. However, it can be more difficult to enter data which has originally been stored in different sources and formats (e.g. spreadsheets, other databases, statistical packages). This project focuses on investigating a generic, platform independent way to simplify the loading of databases. The proposed solution uses Web services as middleware to supply essential data management functionality such as inserting, updating, deleting and retrieval of data. These functions allow application developers to easily customise their own data entry applications according to local data sources, formats and user requirements. We implemented a Web service to support loading data to the Germinate database at the New Zealand Institute of Crop & Food Research (CFR). We also provided language specific client toolkits to help developers invoke the Web service. The toolkits allow applications to be easily customised for different platforms. In addition, we developed sample applications to help end users load data from their project data sources via the Web service. The Web service approach was evaluated through user and developer trials. The feedback from the developer trial showed that using Web services as middleware is a useful approach to allow developers and competent end users to customise data entry with minimal effort. More importantly, the customised client applications enabled end users to load data directly from their project spreadsheets and databases. It significantly reduced the effort required for exporting or transforming the source data. data integration data management data loading web services
29	A Property Valuation Model for Rural Victoria Hayles, Kelly, kellyhayles@iinet.net.au January 2006 (has links) Licensed valuers in the State of Victoria, Australia currently appraise rural land using manual techniques. Manual techniques typically involve site visits to the property, liaison with property owners through interview, and require a valuer experienced in agricultural properties to determine a value. The use of manual techniques typically takes longer to determine a property value than for valuations performed using automated techniques, providing appropriate data are available. Manual methods of valuation can be subjective and lead to bias in valuation estimates, especially where valuers have varying levels of experience within a specific regional area. Automation may lend itself to more accurate valuation estimates by providing greater consistency between valuations. Automated techniques presently in use for valuation include artificial neural networks, expert systems, case based reasoning and multiple regression analysis. The latter technique appears mo st widely used for valuation. The research aimed to develop a conceptual rural property valuation model, and to develop and evaluate quantitative models for rural property valuation based on the variables identified in the conceptual model. The conceptual model was developed by examining peer research, Valuation Best Practice Standards, a standard in use throughout Victoria for rating valuations, and rural property valuation texts. Using data that are only available digitally and publicly, the research assessed this conceptualisation using properties from four LGAs in the Wellington and Wimmera Catchment Management Authority (CMAs) areas in Victoria. Cluster analysis was undertaken to assess if the use of sub-markets, that are determined statistically, can lead to models that are more accurate than sub-markets that have been determined using geographically defined areas. The research is divided into two phases; the 'available data phase' and the 'restricted data phase'. The 'available data phase' used publicly available digital data to build quantitative models to estimate the value of rural properties. The 'restricted data phase' used data that became available near the completion of the research. The research examined the effect of using statistically derived sub-markets as opposed to geographically derived ones for property valuation. Cluster analysis was used during both phases of model development and showed that one of the clusters developed in the available data phase was superior in its model prediction compared to the models produced using geographically derived regions. A number of limitations with the digital property data available for Victoria were found. Although GIS analysis can enable more property characteristics to be derived and measured from existing data, it is reliant on having access to suitable digital data. The research also identified limitations with the metadata elements in use in Victoria (ANZMETA DTD version 1). It is hypothesised that to further refine the models and achieve greater levels of price estimation, additional properties would need to be sourced and added to the current property database. It is suggested that additional research needs to address issues associated with sub-market identification. If results of additional modelling indicated significantly different levels of price estimation, then these models could be used with manual techniques to evaluate manually derived valuation estimates. GIS data integration regression analysis cluster analysis property valuation
30	Query Processing for Peer Mediator Databases Katchaounov, Timour January 2003 (has links) <p>The ability to physically interconnect many distributed, autonomous and heterogeneous software systems on a large scale presents new opportunities for sharing and reuse of existing, and for the creataion of new information and new computational services. However, finding and combining information in many such systems is a challenge even for the most advanced computer users. To address this challenge, mediator systems logically integrate many sources to hide their heterogeneity and distribution and give the users the illusion of a single coherent system.</p><p>Many new areas, such as scientific collaboration, require cooperation between many autonomous groups willing to share their knowledge. These areas require that the data integration process can be distributed among many autonomous parties, so that large integration solutions can be constructed from smaller ones. For this we propose a decentralized mediation architecture, peer mediator systems (PMS), based on the peer-to-peer (P2P) paradigm. In a PMS, reuse of human effort is achieved through logical composability of the mediators in terms of other mediators and sources by defining mediator views in terms of views in other mediators and sources.</p><p>Our thesis is that logical composability in a P2P mediation architecture is an important requirement and that composable mediators can be implemented efficiently through query processing techniques.</p><p>In order to compute answers of queries in a PMS, logical mediator compositions must be translated to query execution plans, where mediators and sources cooperate to compute query answers. The focus of this dissertation is on query processing methods to realize composability in a PMS architecture in an efficient way that scales over the number of mediators.</p><p>Our contributions consist of an investigation of the interfaces and capabilities for peer mediators, and the design, implementation and experimental study of several query processing techniques that realize composability in an efficient and scalable way.</p> Datalogi data integration mediators query processing Datalogi Computer science Datalogi

Search results