Global ETD Search

191	Semantic Enrichment of Ontology Mappings Arnold, Patrick 04 January 2016 (has links) (PDF) Schema and ontology matching play an important part in the field of data integration and semantic web. Given two heterogeneous data sources, meta data matching usually constitutes the first step in the data integration workflow, which refers to the analysis and comparison of two input resources like schemas or ontologies. The result is a list of correspondences between the two schemas or ontologies, which is often called mapping or alignment. Many tools and research approaches have been proposed to automatically determine those correspondences. However, most match tools do not provide any information about the relation type that holds between matching concepts, for the simple but important reason that most common match strategies are too simple and heuristic to allow any sophisticated relation type determination. Knowing the specific type holding between two concepts, e.g., whether they are in an equality, subsumption (is-a) or part-of relation, is very important for advanced data integration tasks, such as ontology merging or ontology evolution. It is also very important for mappings in the biological or biomedical domain, where is-a and part-of relations may exceed the number of equality correspondences by far. Such more expressive mappings allow much better integration results and have scarcely been in the focus of research so far. In this doctoral thesis, the determination of the correspondence types in a given mapping is the focus of interest, which is referred to as semantic mapping enrichment. We introduce and present the mapping enrichment tool STROMA, which obtains a pre-calculated schema or ontology mapping and for each correspondence determines a semantic relation type. In contrast to previous approaches, we will strongly focus on linguistic laws and linguistic insights. By and large, linguistics is the key for precise matching and for the determination of relation types. We will introduce various strategies that make use of these linguistic laws and are able to calculate the semantic type between two matching concepts. The observations and insights gained from this research go far beyond the field of mapping enrichment and can be also applied to schema and ontology matching in general. Since generic strategies have certain limits and may not be able to determine the relation type between more complex concepts, like a laptop and a personal computer, background knowledge plays an important role in this research as well. For example, a thesaurus can help to recognize that these two concepts are in an is-a relation. We will show how background knowledge can be effectively used in this instance, how it is possible to draw conclusions even if a concept is not contained in it, how the relation types in complex paths can be resolved and how time complexity can be reduced by a so-called bidirectional search. The developed techniques go far beyond the background knowledge exploitation of previous approaches, and are now part of the semantic repository SemRep, a flexible and extendable system that combines different lexicographic resources. Further on, we will show how additional lexicographic resources can be developed automatically by parsing Wikipedia articles. The proposed Wikipedia relation extraction approach yields some millions of additional relations, which constitute significant additional knowledge for mapping enrichment. The extracted relations were also added to SemRep, which thus became a comprehensive background knowledge resource. To augment the quality of the repository, different techniques were used to discover and delete irrelevant semantic relations. We could show in several experiments that STROMA obtains very good results w.r.t. relation type detection. In a comparative evaluation, it was able to achieve considerably better results than related applications. This corroborates the overall usefulness and strengths of the implemented strategies, which were developed with particular emphasis on the principles and laws of linguistics. ontology mapping ontology matching semantische erweiterung datenintegration hintergrundwissen ontology mapping ontology matching semantic enrichment data integration background knowledge ddc:500
192	Integrative Analyses of Diverse Biological Data Sources January 2011 (has links) abstract: The technology expansion seen in the last decade for genomics research has permitted the generation of large-scale data sources pertaining to molecular biological assays, genomics, proteomics, transcriptomics and other modern omics catalogs. New methods to analyze, integrate and visualize these data types are essential to unveil relevant disease mechanisms. Towards these objectives, this research focuses on data integration within two scenarios: (1) transcriptomic, proteomic and functional information and (2) real-time sensor-based measurements motivated by single-cell technology. To assess relationships between protein abundance, transcriptomic and functional data, a nonlinear model was explored at static and temporal levels. The successful integration of these heterogeneous data sources through the stochastic gradient boosted tree approach and its improved predictability are some highlights of this work. Through the development of an innovative validation subroutine based on a permutation approach and the use of external information (i.e., operons), lack of a priori knowledge for undetected proteins was overcome. The integrative methodologies allowed for the identification of undetected proteins for Desulfovibrio vulgaris and Shewanella oneidensis for further biological exploration in laboratories towards finding functional relationships. In an effort to better understand diseases such as cancer at different developmental stages, the Microscale Life Science Center headquartered at the Arizona State University is pursuing single-cell studies by developing novel technologies. This research arranged and applied a statistical framework that tackled the following challenges: random noise, heterogeneous dynamic systems with multiple states, and understanding cell behavior within and across different Barrett's esophageal epithelial cell lines using oxygen consumption curves. These curves were characterized with good empirical fit using nonlinear models with simple structures which allowed extraction of a large number of features. Application of a supervised classification model to these features and the integration of experimental factors allowed for identification of subtle patterns among different cell types visualized through multidimensional scaling. Motivated by the challenges of analyzing real-time measurements, we further explored a unique two-dimensional representation of multiple time series using a wavelet approach which showcased promising results towards less complex approximations. Also, the benefits of external information were explored to improve the image representation. / Dissertation/Thesis / Ph.D. Industrial Engineering 2011 Industrial Engineering Bioinformatics Biostatistics Data integration Data mining Ensemble methods Genomics Multiple time series Single-cell studies
193	Régulation de la formation du bois chez l'eucalyptus lors du développement et en réponse à des contraintes environnementales / Regulation of wood formation in eucalyptus during develpment and in response to environmental constraints Ployet, Raphaël 30 June 2017 (has links) Du fait de sa croissance exceptionnelle combinée aux propriétés supérieures de son bois, l'Eucalyptus est devenu le feuillu le plus planté au monde et s'est imposé comme source de biomasse pour la production de papier et de biocarburants de seconde génération. Le bois est composé de parois secondaires lignifiées et sa formation est finement régulée par un réseau complexe, et globalement mal connu, de facteurs de transcription (FT). Les parois secondaires sont composées de 80% de polysaccharides, ciblés pour la plupart des bioproduits à haute valeur ajoutée, tandis que la lignine (20%) est responsable de la récalcitrance de la biomasse à la dégradation enzymatique mais augmente le potentiel énergétique du bois par combustion. Malgré son adaptabilité remarquable à différents sols et climats, la croissance de l'Eucalyptus varie fortement suivant ces facteurs. L'Eucalyptus est largement planté sur des sols lessivés dans les régions tropicales et subtropicales où les plantations industrielles font face à des épisodes de sécheresse de plus en plus fréquents, en combinaison avec des forts manques de nutriments, nécessitant de gros apports en fertilisants. Dans les région tempérées telles que l'Europe du Nord, la principale limitation à l'implantation de cet arbre dépourvu d'endodormance, est l'exposition au froid. Ces contraintes abiotiques sont aggravées par le changement climatique et leur impact sur la formation du bois et sa qualité restent peu documentés. Quelques données suggèrent que ces stress affectent le dépôt de la paroi secondaire ainsi que la structure du xylème. Cependant, ces résultats sont très hétérogènes entre différentes espèces et principalement focalisés sur des tissus différents du bois. La sélection de clones adaptés et le développement de pratiques culturales plus viables, sont essentiels pour améliorer la productivité et la qualité du bois, ce qui requiert une meilleure compréhension de la réponse des arbres au froid et au manque d'eau en interaction avec la nutrition. Dans le but de décrypter les régulations induites par le froid dans la différenciation du xylème, nous avons effectué une approche ciblée sur des Eucalyptus acclimatés au froid. Des analyses de biochimie, d'histochimie et de transcriptomique, ont révélé que le froid déclenche un dépôt de paroi secondaire précoce dans les cellules du xylème en développement, caractérisé par un fort dépôt de lignine. En parallèle, pour caractériser l'effet du manque d'eau combiné à différents régimes nutritifs, sur la formation et la qualité du bois, nous avons tiré profit d'un dispositif expérimental mis en place au champ avec un clone commercial d'Eucalyptus, soumis à une exclusion de pluie combinée à une fertilisation au potassium. Nous avons combiné des analyses globales du transcriptome et du métabolome, avec l'analyse des propriétés structurales et biochimiques du bois. L'approche intégrative de ces jeux de données a révélé que la fertilisation au potassium induit une répression de la biosynthèse de la paroi secondaire ainsi qu'une régulation de l'activité cambiale et la modification dans les propriétés du bois, avec une forte interaction avec l'exclusion d'eau. Ces deux approches ont permis l'identification de différents FT non caractérisés qui constituent des candidats prometteurs dans le contrôle de l'activité cambiale et du dépôt de paroi secondaire chez un ligneux. Leur caractérisation fonctionnelle chez le peuplier et l'Eucalyptus a révélé un nouveau régulateur clé de la biosynthèse de paroi secondaire, et plusieurs facteurs MYB potentiellement impliqués dans la balance entre formation de la paroi secondaire et croissance. / Due to its outstanding growth combined to superior wood properties, Eucalyptus genus has become the most planted hardwood on earth and emerged as the most appealing sources of renewable biomass feedstock for paper and second-generation biofuels. Wood is composed of lignified secondary cell walls (SCWs) and its formation is tightly regulated by a complex, partially unknown, transcription factors (TFs) network. SCWs are composed by 80% of polysaccharides targeted for most of value-added bioproducts, whereas lignin (20%) is responsible for biomass recalcitrance to enzymatic degradation but increase wood energetic potential for combustion. Despite its remarkable adaptability to various soils and climate environment Eucalyptus growth varies strongly according to these factors. Eucalyptus is extensively grown in highly weathered soils in tropical and subtropical regions where plantations are facing more frequent drought episodes in combination to nutrient starvation, requiring high amounts of expensive fertilizers. In temperate regions such as North of Europe, the main limitation for the expansion of this non-dormant tree is cold exposure, which reduces dramatically its growth. The effects of these stresses are emphasized in the actual context of climate change which induces sharp contrasting periods, and their impacts on wood formation and quality remain unknown. Scarce data from literature suggest that these stresses affect secondary cell wall (SCW) deposition as well as xylem cell patterning. However these results are highly heterogeneous among different species and mainly focused on non-woody tissues. The selection of adapted clones and the development of more sustainable cultural practices are crucial to improve wood productivity and quality, which require a better understanding of tree response to cold and water stress in interaction with nutrition. In order to unravel the regulation of xylem differentiation by low temperature, we performed a targeted approach on cold-acclimated Eucalyptus trees. By biochemical, histochemical and transcriptomic analyses, we revealed that low temperature trigger a precocious SCW deposition in developing xylem cells, characterized by a strong lignin deposition. In parallel, we aimed to characterize the effect of water stress combined to different mineral nutrition regimes, on wood formation and quality. To this end, we took advantage of an experimental design set up on field with a highly productive Eucalyptus commercial clone submitted to both rainfall exclusion combined to potassium fertilization. We combined large scale analyses of transcriptome and metabolome, with wood structural and biochemical properties analyses. The integrative approach with these datasets revealed that potassium fertilization induces a repression of SCW biosynthesis, together with regulation of cambial activity and modifications in wood properties, with a strong interaction with water exclusion. Both approaches allowed to point out several uncharacterized yet TFs which are highly promising candidates in the control of cambial activity and SCW deposition in a woody perennial. Characterization of their function in poplar and Eucalyptus revealed a new key regulator of SCW biosynthesis in wood, and several MYB TFs potentially involved in the trade-off between SCW biosynthesis and growth. Formation du bois Stress abiotiques Facteurs de transcription Intégration de données Eucalyptus Wood formation Abiotic stresses Transcription factors Data integration Eucalyptus
194	Apprentissage statistique pour l'intégration de données omiques / Statistical learning for omics data integration Mariette, Jérôme 15 December 2017 (has links) Les avancées des nouvelles techniques de séquençage ont permis de produire des données hétérogènes, volumineuse, de grande dimension et à différentes échelles du vivant. L'intégration de ces différentes données représente un défi en biologie des systèmes, défi qu'il est critique d'aborder pour tirer le meilleur parti possible de l'accumulation d'informations biologiques pour leur interprétation et leur exploitation dans un but finalisé. Cette thèse regroupe plusieurs contributions méthodologiques utiles à l'exploration simultanée de plusieurs jeux de données omiques de natures hétérogènes. Pour aborder cette question, les noyaux et les méthodes à noyaux offrent un cadre naturel, car ils permettent de prendre en compte la nature propre de chacun des tableaux de données tout en permettant leur combinaison. Toutefois, lorsque le nombre d'observations à traiter est grand, les méthodes à noyaux souffrent d'un manque d'interprétabilité et d'une grande complexité algorithmique. Une première partie de mon travail a porté sur l'adaptation de deux méthodes exploratoires à noyaux : l'analyse en composantes principales (K-PCA) et les cartes auto- organisatrices (K-SOM). Les adaptations développées portent d'une part sur le passage à l'échelle du K-SOM et de la K-PCA au domaine des omiques et d'autre part sur l'amélioration de l'interprétabilité des résultats. Dans une seconde partie, je me suis intéressé à l'apprentissage multi-noyaux pour combiner plusieurs jeux de données omiques. L'efficacité des méthodes proposées est illustrée dans le contexte de l'écologie microbienne : huit jeux de données du projet TARA oceans ont été intégrés et analysés à l'aide d'une K-PCA. / The development of high-throughput sequencing technologies has lead to produce high dimensional heterogeneous datasets at different living scales. To process such data, integrative methods have been shown to be relevant, but still remain challenging. This thesis gathers methodological contributions useful to simultaneously explore heterogeneous multi-omics datasets. To tackle this problem, kernels and kernel methods represent a natural framework because they allow to handle the own nature of each datasets while permitting their combination. However, when the number of sample to process is high, kernel methods suffer from several drawbacks: their complexity is increased and the interpretability of the model is lost. A first part of my work is focused on the adaptation of two exploratory kernel methods: the principal component analysis (K-PCA) and the self-organizing map (K-SOM). The proposed adaptations first address the scaling problem of both K-SOM and K-PCA to omics datasets and second improve the interpretability of the models. In a second part, I was interested in multiple kernel learning to combine multiple omics datasets. The proposed methods efficiency is highlighted in the domain of microbial ecology: eight TARA oceans datasets are integrated and analysed using a K-PCA. Données omiques Intégration de données Noyaux ACP Cartes auto-organisatrices Omcis data Data integration Kernel PCA Auto-organized maps
195	Pareamento privado de atributos no contexto da resolução de entidades com preservação de privacidade. NÓBREGA, Thiago Pereira da. 10 September 2018 (has links) Submitted by Emanuel Varela Cardoso (emanuel.varela@ufcg.edu.br) on 2018-09-10T19:58:50Z No. of bitstreams: 1 THIAGO PEREIRA DA NÓBREGA – DISSERTAÇÃO (PPGCC) 2018.pdf: 3402601 bytes, checksum: b1a8d86821a4d14435d5adbdd850ec04 (MD5) / Made available in DSpace on 2018-09-10T19:58:50Z (GMT). No. of bitstreams: 1 THIAGO PEREIRA DA NÓBREGA – DISSERTAÇÃO (PPGCC) 2018.pdf: 3402601 bytes, checksum: b1a8d86821a4d14435d5adbdd850ec04 (MD5) Previous issue date: 2018-05-11 / A Resolução de entidades com preservação de privacidade (REPP) consiste em identificar entidades (e.g. Pacientes), armazenadas em bases de dados distintas, que correspondam a um mesmo objeto do mundo real. Como as entidades em questão possuem dados privados (ou seja, dados que não podem ser divulgados) é fundamental que a tarefa de REPP seja executada sem que nenhuma informação das entidades seja revelada entre os participantes (proprietários das bases de dados), de modo que a privacidade dos dados seja preservada. Ao final da tarefa de REPP, cada participante identifica quais entidades de sua base de dados estão presentes nas bases de dados dos demais participantes. Antes de iniciar a tarefa de REPP os participantes devem concordar em relação à entidade (em comum), a ser considerada na tarefa, e aos atributos das entidades a serem utilizados para comparar as entidades. Em geral, isso exige que os participantes tenham que expor os esquemas de suas bases de dados, compartilhando (meta-) informações que podem ser utilizadas para quebrar a privacidade dos dados. Este trabalho propõe uma abordagem semiautomática para identificação de atributos similares (pareamento de atributos) a serem utilizados para comparar entidades durante a REPP. A abordagem é inserida em uma etapa preliminar da REPP (etapa de Apresentação) e seu resultado (atributos similares) pode ser utilizado pelas etapas subsequentes (Blocagem e Comparação). Na abordagem proposta a identificação dos atributos similares é realizada utilizando-se representações dos atributos (Assinaturas de Dados), geradas por cada participante, eliminando a necessidade de divulgar informações sobre seus esquemas, ou seja, melhorando a segurança e privacidade da tarefa de REPP. A avaliação da abordagem aponta que a qualidade do pareamento de atributos é equivalente a uma solução que não considera a privacidade dos dados, e que a abordagem é capaz de preservar a privacidade dos dados. / The Privacy Preserve Record Linkage (PPRL) aims to identify entities, that can not have their information disclosed (e.g., Medical Records), which correspond to the same real-world object across different databases. It is crucial to the PPRL tasks that it is executed without revealing any information between the participants (database owners) during the PPRL task, to preserve the privacy of the original data. At the end of a PPRL task, each participant identifies which entities in its database are present in the databases of the other participants. Thus, before starting the PPRL task, the participants must agree on the entity and its attributes, to be compared in the task. In general, this agreement requires that participants have to expose their schemas, sharing (meta-)information that can be used to break the privacy of the data. This work proposes a semiautomatic approach to identify similar attributes (attribute pairing) to identify the entities attributes. The approach is inserted as a preliminary step of the PPRL (Handshake), and its result (similar attributes) can be used by subsequent steps (Blocking and Comparison). In the proposed approach, the participants generate a privacy-preserving representation (Data Signatures) of the attributes values that are sent to a trusted third-party to identify similar attributes from different data sources. Thus, by eliminating the need to share information about their schemas, consequently, improving the security and privacy of the PPRL task. The evaluation of the approach points out that the quality of attribute pairing is equivalent to a solution that does not consider data privacy, and is capable of preserving data privacy. Ciência da computação Preservação de privacidade Segurança e privacidade Resolução de entidades Integração de dados Schema matching Security and Privacy Entity resolution Data integration
196	Ontology-based approach for standard formats integration in reservoir modeling / Abordagem baseada em ontologias para integração de formatos padrões em modelagem de reservatórios Werlang, Ricardo January 2015 (has links) A integração de dados oriundos de fontes autônomas e heterogêneas ainda é um grande problema para diversas aplicações. Na indústria de petróleo e gás, uma grande quantidade de dados é gerada diariamente a partir de múltiplas fontes, tais como dados sísmicos, dados de poços, dados de perfuração, dados de transporte e dados de marketing. No entanto, estes dados são adquiridos através da aplicação de diferentes técnicas e representados em diferentes formatos e padrões. Assim, estes dados existem de formas estruturadas em banco de dados e de formas semi-estruturadas em planilhas e documentos, tais como relatórios e coleções multimídia. Para lidar com a heterogeneidade dos formatos de dados, a informação precisa ser padronizada e integrada em todos os sistemas, disciplinas e fronteiras organizacionais. Como resultado, este processo de integração permitirá uma melhor tomada de decisão dentro de colaborações, uma vez que dados de alta qualidade poderão ser acessados em tempo hábil. A indústria do petróleo depende do uso eficiente desses dados para a construção de modelos computacionais, a fim de simplificar a realidade geológica e para ajudar a compreende-la. Tal modelo, que contém objetos geológicos analisados por diferentes profissionais—geólogos, geofísicos e engenheiros — não representa a realidade propriamente dita, mas a conceitualização do especialista. Como resultado, os objetos geológicos modelados assumem representações semânticas distintas e complementares no apoio à tomada de decisões. Para manter os significados pretendidos originalmente, ontologias estão sendo usadas para explicitar a semântica dos modelos e para integrar os dados e arquivos gerados nas etapas da cadeia de exploração. A principal reivindicação deste trabalho é que a interoperabilidade entre modelos da terra construídos e manipulados por diferentes profissionais e sistemas pode ser alcançada evidenciando o significado dos objetos geológicos representados nos modelos. Nós mostramos que ontologias de domínio desenvolvidas com o apoio de conceitos teórico de ontologias de fundamentação demonstraram ser uma ferramenta adequada para esclarecer a semântica dos conceitos geológicos. Nós exemplificamos essa capacidade através da análise dos formatos de comunicação padrões mais utilizados na cadeia de modelagem (LAS, WITSML e RESQML), em busca de entidades semanticamente relacionadas com os conceitos geológicos descritos em ontologias de Geociências. Mostramos como as noções de identidade, rigidez, essencialidade e unidade, aplicadas a conceitos ontológicos, conduzem o modelador à definir mais precisamente os objetos geológicos no modelo. Ao tornar explícitas as propriedades de identidade dos objetos modelados, o modelador pode superar as ambiguidades da terminologia geológica. Ao fazer isso, explicitamos os objetos e propriedades relevantes que podem ser mapeados a partir de um modelo para outro, mesmo quando eles estão representados em diferentes nomes e formatos. / The integration of data issued from autonomous and heterogeneous sources is still a significant problem for an important number of applications. In the oil and gas industry, a large amount of data is generated every day from multiple sources such as seismic data, well data, drilling data, transportation data, and marketing data. However, these data are acquired by the application of different techniques and represented in different standards and formats. Thus, these data exist in a structured form in databases, and in semi-structured forms in spreadsheets and documents such as reports and multimedia collections. To deal with this large amount of information, as well as the heterogeneous data formats of the data, the information needs to be standardized and integrated across systems, disciplines and organizational boundaries. As a result, this information integration will enable better decision making within collaborations, once high quality data will be accessible timely. The petroleum industry depends on the efficient use of these data to the construction of computer models in order to simplify the geological reality and to help understanding it. Such a model, which contains geological objects analyzed by different professionals – geologists, geophysicists and engineers – does not represent the reality itself, but the expert’s conceptualization. As a result, the geological objects modeled assume distinct semantic representations and complementary in supporting decision-making. For keeping the original intended meanings, ontologies were used for expliciting the semantic of the models and for integrating the data and files generated in the various stages of the exploration chain. The major claim of this work is that interoperability among earth models built and manipulated by different professionals and systems can be achieved by making apparent the meaning of the geological objects represented in the models. We show that domain ontologies developed with support of theoretical background of foundational ontologies show to be an adequate tool to clarify the semantic of geology concepts. We exemplify this capability by analyzing the communication standard formats most used in the modeling chain (LAS,WITSML, and RESQML), searching for entities semantically related with the geological concepts described in ontologies for Geosciences. We show how the notions of identity, rigidity, essentiality and unity applied to ontological concepts lead the modeler to more precisely define the geological objects in the model. By making explicit the identity properties of the modeled objects, the modeler who applies data standards can overcome the ambiguities of the geological terminology. In doing that, we clarify which are the relevant objects and properties that can be mapped from one model to another, even when they are represented with different names and formats. Ontologias Geociências Informática médica Geological data integration Communication standard formats Conceptual modeling Ontology Foundational ontology Geological objects mapping
197	Avaliação experimental de uma técnica de padronização de escores de similaridade / Experimental evaluation of a similarity score standardization technique Nunes, Marcos Freitas January 2009 (has links) Com o crescimento e a facilidade de acesso a Internet, o volume de dados cresceu muito nos últimos anos e, consequentemente, ficou muito fácil o acesso a bases de dados remotas, permitindo integrar dados fisicamente distantes. Geralmente, instâncias de um mesmo objeto no mundo real, originadas de bases distintas, apresentam diferenças na representação de seus valores, ou seja, os mesmos dados no mundo real podem ser representados de formas diferentes. Neste contexto, surgiram os estudos sobre casamento aproximado utilizando funções de similaridade. Por consequência, surgiu a dificuldade de entender os resultados das funções e selecionar limiares ideais. Quando se trata de casamento de agregados (registros), existe o problema de combinar os escores de similaridade, pois funções distintas possuem distribuições diferentes. Com objetivo de contornar este problema, foi desenvolvida em um trabalho anterior uma técnica de padronização de escores, que propõe substituir o escore calculado pela função de similaridade por um escore ajustado (calculado através de um treinamento), o qual é intuitivo para o usuário e pode ser combinado no processo de casamento de registros. Tal técnica foi desenvolvida por uma aluna de doutorado do grupo de Banco de Dados da UFRGS e será chamada aqui de MeaningScore (DORNELES et al., 2007). O presente trabalho visa estudar e realizar uma avaliação experimental detalhada da técnica MeaningScore. Com o final do processo de avaliação aqui executado, é possível afirmar que a utilização da abordagem MeaningScore é válida e retorna melhores resultados. No processo de casamento de registros, onde escores de similaridades distintos devem ser combinados, a utilização deste escore padronizado ao invés do escore original, retornado pela função de similaridade, produz resultados com maior qualidade. / With the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality. Armazenamento : Dados Banco : Dados Métricas : Similaridade Consulta : Similaridade Similarity querying Data integration Data cleaning Record matching Adjusted score Data quality
198	Avaliação experimental de uma técnica de padronização de escores de similaridade / Experimental evaluation of a similarity score standardization technique Nunes, Marcos Freitas January 2009 (has links) Com o crescimento e a facilidade de acesso a Internet, o volume de dados cresceu muito nos últimos anos e, consequentemente, ficou muito fácil o acesso a bases de dados remotas, permitindo integrar dados fisicamente distantes. Geralmente, instâncias de um mesmo objeto no mundo real, originadas de bases distintas, apresentam diferenças na representação de seus valores, ou seja, os mesmos dados no mundo real podem ser representados de formas diferentes. Neste contexto, surgiram os estudos sobre casamento aproximado utilizando funções de similaridade. Por consequência, surgiu a dificuldade de entender os resultados das funções e selecionar limiares ideais. Quando se trata de casamento de agregados (registros), existe o problema de combinar os escores de similaridade, pois funções distintas possuem distribuições diferentes. Com objetivo de contornar este problema, foi desenvolvida em um trabalho anterior uma técnica de padronização de escores, que propõe substituir o escore calculado pela função de similaridade por um escore ajustado (calculado através de um treinamento), o qual é intuitivo para o usuário e pode ser combinado no processo de casamento de registros. Tal técnica foi desenvolvida por uma aluna de doutorado do grupo de Banco de Dados da UFRGS e será chamada aqui de MeaningScore (DORNELES et al., 2007). O presente trabalho visa estudar e realizar uma avaliação experimental detalhada da técnica MeaningScore. Com o final do processo de avaliação aqui executado, é possível afirmar que a utilização da abordagem MeaningScore é válida e retorna melhores resultados. No processo de casamento de registros, onde escores de similaridades distintos devem ser combinados, a utilização deste escore padronizado ao invés do escore original, retornado pela função de similaridade, produz resultados com maior qualidade. / With the growth of the Web, the volume of information grew considerably over the past years, and consequently, the access to remote databases became easier, which allows the integration of distributed information. Usually, instances of the same object in the real world, originated from distinct databases, present differences in the representation of their values, which means that the same information can be represented in different ways. In this context, research on approximate matching using similarity functions arises. As a consequence, there is a need to understand the result of the functions and to select ideal thresholds. Also, when matching records, there is the problem of combining the similarity scores, since distinct functions have different distributions. With the purpose of overcoming this problem, a previous work developed a technique that standardizes the scores, by replacing the computed score by an adjusted score (computed through a training), which is more intuitive for the user and can be combined in the process of record matching. This work was developed by a Phd student from the UFRGS database research group, and is referred to as MeaningScore (DORNELES et al., 2007). The present work intends to study and perform an experimental evaluation of this technique. As the validation shows, it is possible to say that the usage of the MeaningScore approach is valid and return better results. In the process of record matching, where distinct similarity must be combined, the usage of the adjusted score produces results with higher quality. Armazenamento : Dados Banco : Dados Métricas : Similaridade Consulta : Similaridade Similarity querying Data integration Data cleaning Record matching Adjusted score Data quality
199	Integrative Analysis of Genomic Aberrations in Cancer and Xenograft Models January 2015 (has links) abstract: No two cancers are alike. Cancer is a dynamic and heterogeneous disease, such heterogeneity arise among patients with the same cancer type, among cancer cells within the same individual’s tumor and even among cells within the same sub-clone over time. The recent application of next-generation sequencing and precision medicine techniques is the driving force to uncover the complexity of cancer and the best clinical practice. The core concept of precision medicine is to move away from crowd-based, best-for-most treatment and take individual variability into account when optimizing the prevention and treatment strategies. Next-generation sequencing is the method to sift through the entire 3 billion letters of each patient’s DNA genetic code in a massively parallel fashion. The deluge of next-generation sequencing data nowadays has shifted the bottleneck of cancer research from multiple “-omics” data collection to integrative analysis and data interpretation. In this dissertation, I attempt to address two distinct, but dependent, challenges. The first is to design specific computational algorithms and tools that can process and extract useful information from the raw data in an efficient, robust, and reproducible manner. The second challenge is to develop high-level computational methods and data frameworks for integrating and interpreting these data. Specifically, Chapter 2 presents a tool called Snipea (SNv Integration, Prioritization, Ensemble, and Annotation) to further identify, prioritize and annotate somatic SNVs (Single Nucleotide Variant) called from multiple variant callers. Chapter 3 describes a novel alignment-based algorithm to accurately and losslessly classify sequencing reads from xenograft models. Chapter 4 describes a direct and biologically motivated framework and associated methods for identification of putative aberrations causing survival difference in GBM patients by integrating whole-genome sequencing, exome sequencing, RNA-Sequencing, methylation array and clinical data. Lastly, chapter 5 explores longitudinal and intratumor heterogeneity studies to reveal the temporal and spatial context of tumor evolution. The long-term goal is to help patients with cancer, particularly those who are in front of us today. Genome-based analysis of the patient tumor can identify genomic alterations unique to each patient’s tumor that are candidate therapeutic targets to decrease therapy resistance and improve clinical outcome. / Dissertation/Thesis / Doctoral Dissertation Biomedical Informatics 2015 Bioinformatics Computer science computational algorithm data framework data integration massively parallel sequencing precision medicine
200	Query authentication in data outsourcing and integration services Chen, Qian 27 August 2015 (has links) Owing to the explosive growth of data driven by e-commerce, social media, and mobile apps, data outsourcing and integration have become two popular Internet services. These services involve one or more data owners (DOs), many requesting clients, and a service provider (SP). The DOs outsource/synchronize their data to the SP, and the SP will provide query services to the requesting clients on behalf of DOs. However, as a third-party server, the SP might alter (leave out or forge) the outsourced/integrated data and query results, intentionally or not. To address this trustworthy issue, the SP is expected to deliver their services in an authenticatable manner, so that the correctness of the service results can be verified by the clients. Unfortunately, existing work on query authentication cannot preserve the privacy of the data being queried. Furthermore, almost all previous studies assume only a single data source/owner, while data integration services usually combine data from multiple sources. In this dissertation, we take the first step to study the authentication of location-based queries with confidentiality and investigate authenticated online data integration services. Cost models, security analysis, and experimental results consistently show the effectiveness and robustness of our proposed schemes under various system settings and query workloads. Data integration (Computer science) Querying (Computer science) Location-based services Data encryption (Computer science) Data protection Data integrity

Search results