71 |
Interactive mapping specification and repairing in the presence of policy views / Spécification et réparation interactive de mappings en présence de polices de sécuritéComignani, Ugo 19 September 2019 (has links)
La migration de données entre des sources aux schémas hétérogènes est un domaine en pleine croissance avec l'augmentation de la quantité de données en accès libre, et le regroupement des données à des fins d'apprentissage automatisé et de fouilles. Cependant, la description du processus de transformation des données d'une instance source vers une instance définie sur un schéma différent est un processus complexe même pour un utilisateur expert dans ce domaine. Cette thèse aborde le problème de la définition de mapping par un utilisateur non expert dans le domaine de la migration de données, ainsi que la vérification du respect par ce mapping des contraintes d'accès ayant été définies sur les données sources. Pour cela, dans un premier temps nous proposons un système dans lequel l'utilisateur fournit un ensemble de petits exemples de ses données, et est amené à répondre à des questions booléennes simples afin de générer un mapping correspondant à ses besoins. Dans un second temps, nous proposons un système permettant de réécrire le mapping produit de manière à assurer qu'il respecte un ensemble de vues de contrôle d'accès définis sur le schéma source du mapping. Plus précisément, le premier grand axe de cette thèse est la formalisation du problème de la définition interactive de mappings, ainsi que la description d'un cadre formel pour la résolution de celui-ci. Cette approche formelle pour la résolution du problème de définition interactive de mappings est accompagnée de preuves de bonnes propriétés. A la suite de cela, basés sur le cadre formel défini précédemment, nous proposons des algorithmes permettant de résoudre efficacement ce problème en pratique. Ces algorithmes visent à réduire le nombre de questions auxquelles l'utilisateur doit répondre afin d'obtenir un mapping correspondant à ces besoins. Pour cela, les mappings possibles sont ordonnés dans des structures de treillis imbriqués, afin de permettre un élagage efficace de l'espace des mappings à explorer. Nous proposons également une extension de cette approche à l'utilisation de contraintes d'intégrité afin d'améliorer l’efficacité de l'élagage. Le second axe majeur vise à proposer un processus de réécriture de mapping qui, étant donné un ensemble de vues de contrôle d'accès de référence, permet d'assurer que le mapping réécrit ne laisse l'accès à aucune information n'étant pas accessible via les vues de contrôle d'accès. Pour cela, nous définissons un protocole de contrôle d'accès permettant de visualiser les informations accessibles ou non à travers un ensemble de vues de contrôle d'accès. Ensuite, nous décrivons un ensemble d'algorithmes permettant la réécriture d'un mapping en un mapping sûr vis-à-vis d'un ensemble de vues de contrôle d'accès. Comme précédemment, cette approche est complétée de preuves de bonnes propriétés. Afin de réduire le nombre d'interactions nécessaires avec l'utilisateur lors de la réécriture d'un mapping, une approche permettant l'apprentissage des préférences de l'utilisateur est proposée, cela afin de permettre le choix entre un processus interactif ou automatique. L'ensemble des algorithmes décrit dans cette thèse ont fait l'objet d'un prototypage et les expériences réalisées sur ceux-ci sont présentées dans cette thèse / Data exchange between sources over heterogeneous schemas is an ever-growing field of study with the increased availability of data, oftentimes available in open access, and the pooling of such data for data mining or learning purposes. However, the description of the data exchange process from a source to a target instance defined over a different schema is a cumbersome task, even for users acquainted with data exchange. In this thesis, we address the problem of allowing a non-expert user to spec- ify a source-to-target mapping, and the problem of ensuring that the specified mapping does not leak information forbidden by the security policies defined over the source. To do so, we first provide an interactive process in which users provide small examples of their data, and answer simple boolean questions in order to specify their intended mapping. Then, we provide another process to rewrite this mapping in order to ensure its safety with respect to the source policy views. As such, the first main contribution of this thesis is to provide a formal definition of the problem of interactive mapping specification, as well as a formal resolution process for which desirable properties are proved. Then, based on this formal resolution process, practical algorithms are provided. The approach behind these algorithms aims at reducing the number of boolean questions users have to answers by making use of quasi-lattice structures to order the set of possible mappings to explore, allowing an efficient pruning of the space of explored mappings. In order to improve this pruning, an extension of this approach to the use of integrity constraints is also provided. The second main contribution is a repairing process allowing to ensure that a mapping is “safe” with respect to a set of policy views defined on its source schema, i.e., that it does not leak sensitive information. A privacy-preservation protocol is provided to visualize the information leaks of a mapping, as well as a process to rewrite an input mapping into a safe one with respect to a set of policy views. As in the first contribution, this process comes with proofs of desirable properties. In order to reduce the number of interactions needed with the user, the interactive part of the repairing process is also enriched with the possibility of learning which rewriting is preferred by users, in order to obtain a completely automatic process. Last but not least, we present extensive experiments over the open source prototypes built from two contributions of this thesis
|
72 |
Privacy-Preserving Data Integration in Public Health SurveillanceHu, Jun 16 May 2011 (has links)
With widespread use of the Internet, data is often shared between organizations in B2B health care networks. Integrating data across all sources in a health care network would be useful to public health surveillance and provide a complete view of how the overall network is performing. Because of the lack of standardization for a common data model across organizations, matching identities between different locations in order to link and aggregate records is difficult. Moreover, privacy legislation controls the use of personal information, and health care data is very sensitive in nature so the protection of data privacy and prevention of personal health information leaks is more important than ever. Throughout the process of integrating data sets from different organizations, consent (explicitly or implicitly) and/or permission to use must be in place, data sets must be de-identified, and identity must be protected. Furthermore, one must ensure that combining data sets from different data sources into a single consolidated data set does not create data that may be potentially re-identified even when only summary data records are created.
In this thesis, we propose new privacy preserving data integration protocols for public health surveillance, identify a set of privacy preserving data integration patterns, and propose a supporting framework that combines a methodology and architecture with which to implement these protocols in practice. Our work is validated with two real world case studies that were developed in partnership with two different public health surveillance organizations.
|
73 |
Data Integration of High-Throughput Proteomic and Transcriptomic Data based on Public Database KnowledgeWachter, Astrid 22 March 2017 (has links)
No description available.
|
74 |
[en] RXQEE - RELATIONAL-XML QUERY EXECUTION ENGINE / [pt] RXQEE: UMA MÁQUINA DE EXECUÇÃO DE CONSULTAS DE INTEGRAÇÃO DE DADOS RELACIONAIS E XMLAMANDA VIEIRA LOPES 10 February 2005 (has links)
[pt] Na abordagem tradicional para execução de consultas em um ambiente de integração de dados, os dados provenientes de fontes heterogêneas são convertidos para o modelo de dados global do sistema integrador, através do uso de adaptadores (wrappers), antes de serem submetidos aos operadores algébricos de uma consulta. Como conseqüência disto, planos de execução de consultas (PECs) contêm
operadores que processam dados representados apenas no modelo de dados global. Esta dissertação apresenta uma nova abordagem para a execução de consultas de integração, denominada Moving Wrappers, na qual a conversão entre os modelos de dados acontece durante o processamento, em
qualquer ponto do PEC, permitindo que os operadores processem dados representados no modelo de dados original de suas fontes. Baseada nesta abordagem, foi desenvolvida uma máquina de execução de consultas (MEC) que executa PECs de integração de dados de fontes Relacionais e XML, combinando, em um mesmo PEC, operadores em ambos os modelos. Esta MEC, denominada RXQEE (Relational-XML Query Execution Engine), foi instanciada a partir do framework QEEF (Query Execution Engine Framework), desenvolvido em um projeto de pesquisa do laboratório TecBD da PUC-Rio. De modo a permitir a execução de PECs de integração, a MEC RXQEE implementa operadores algébricos,
nos modelos XML e Relacional, e operadores interalgébricos, desenvolvidos para a realizar a conversão
entre esses modelos de dados na MEC construída. / [en] In the traditional approach for the evaluation of data integration queries, heterogeneous data in data sources are converted into the global data model by wrappers before being delivered to algebraic operators. Consequently, query execution plans (QEPs) are composed exclusively by operations in accordance to the global data model. This work proposes a new data integration query evaluation strategy, named Moving Wrappers, in which data conversion is considered as an operation placed in any part of the QEP, based on a query optimization process. This permits the use of algebraic operators of the data sourceís data model. So, a QEP may include fragments with operations in different data models converted to the global data model by inter-algebraic operators. Based on this strategy, a query execution engine (QEE), named RXQEE (Relational-XML Query Execution Engine), was developed as an instance of QEEF (Query Execution Engine Framework). In particular, RXQEE explores integration queries over Relational and XML data, and therefore it implements algebraic operators, in XML and Relational models, and inter-algebraic operators, permiting the execution of integration QEPs.
|
75 |
[en] A FRAMEWORK FOR THE CONSTRUCTION OF MEDIATORS OFFERING DEDUPLICATION / [pt] UM FRAMEWORK PARA A CONSTRUÇÃO DE MEDIADORES OFERECENDO ELIMINAÇÃO DE DUPLICATASGUSTAVO LOPES MOURAD 24 January 2011 (has links)
[pt] À medida em que aplicações web que combinam dados de
diferentes fontes ganham importância, soluções para a
detecção online de dados duplicados tornam-se centrais. A
maioria das técnicas existentes são baseadas em algoritmos de
aprendizado de máquina, que dependem do uso de bases de
treino criadas manualmente. Estas soluções não são adequadas
no caso da Deep Web onde, de modo geral, existe pouca
informação acerca do tamanho das fontes de dados, da
volatilidade dos mesmos e do fato de que a obtenção de um
conjunto de dados relevante para o treinamento é uma tarefa
difícil. Nesta dissertação propomos uma estratégia para
extração (scraping), detecção de duplicatas e incorporação de
dados resultantes de consultas realizadas em bancos de dados
na Deep Web. Nossa abordagem não requer o uso de conjuntos
de testes previamente definidos, mas utiliza uma combinação
de um classificador baseado no Vector Space Model, com
funções de cálculo de similaridade para prover uma solução
viável. Para ilustrar nossa proposta, nós apresentamos um
estudo de caso onde o framework é instanciado para uma
aplicação do domínio dos vinhos. / [en] As Web applications that obtain data from different sources
(Mashups) grow in importance, timely solutions to the
duplicate detection problem become central. Most existing
techniques, however, are based on machine learning
algorithms, that heavily rely on the use of relevant, manually
labeled, training datasets. Such solutions are not adequate
when talking about data sources on the Deep Web, as there is
often little information regarding the size, volatility and
hardly any access to relevant samples to be used for training.
In this thesis we propose a strategy to aid in the extraction
(scraping), duplicate detection and integration of data that
resulted from querying Deep Web resources. Our approach
does not require the use of pre-defined training sets , but rather
uses a combination of a Vector Space Model classifier with
similarity functions, in order to provide a viable solution. To
illustrate our approach, we present a case study where the
proposed framework was instantiated for an application in the
wine industry domain.
|
76 |
Uma abordagem de integração de dados públicos sobre comorbidade para a predição de associação de doenças complexas / An approach of integrating public data on comorbidity for the prediction of association of complex diseasesSilva, Carla Fernandes da 02 May 2019 (has links)
Comorbidade é a coocorrência de dois ou mais distúrbios em uma pessoa. Identificar quais fatores genéticos ou quais são os mecanismos subjacentes à comorbidade é um grande desafio da ciência. Outra constatação relevante é que muitos pares de doenças que compartilham genes comuns não mostram comorbidade significativa nos registros clínicos. Vários estudos clínicos e epidemiológicos têm demonstrado que a comorbidade é uma situação médica universal porque pacientes com vários transtornos médicos são a regra e não a exceção. Neste trabalho, é proposta uma metodologia de predição de associação doença-doença por meio da integração de dados públicos sobre genes e sobre doenças e suas comorbidades. Analisando as redes formadas pelos genes e pelas doenças, a partir da utilização de cinco métodos de predição de links: Vizinhos Comuns, Adamic-Adar, Índice de Conexão Preferencial, Índice de Alocação de Recursos e Katz, a fim de encontrar novas relações de comorbidade. Como resultados foram criadas duas redes: uma rede epidemiológica chamada de rede_DATASUS com 1.941 nós e 248.508 arestas e uma rede gênica, rede_KEGG, com 288 nós e 1.983 arestas. E a predição em cima da rede_KEGG, e dentre as associações de doenças preditas e analisadas encontramos 6 associações preditas que estão presentes na rede_DATASUS e relatos na literatura. Acreditamos que as associações entre genes podem elucidar as causas de algumas comorbidades / Comorbidity is the co-occurrence of two or more heath disturbances in a person. Identify which genetic factors or what are the biological mechanisms underlying the comorbidity is a big challenge in science. Another relevant finding is that many pairs of diseases that share common genes do not show significant comorbidity clinical records. Several clinical and epidemiological studies have shown that comorbidity is a universal medical situation because patients with various medical disorders are the rule and not the exception In this work, a methodology of prediction of disease-illness is provided through the integration of data on genes and on diseases and their comorbidities. Analyzing how to redesign genes and diseases, using five link prediction methods: Common Neighbours, Adamic-Adar, Preferential Attachment Index, Resource Allocation Index and emph Katz, an end to find new relationships of comorbidity. As a redesigned network: an epidemiological network called network_DATASUS network with 1,941 nodes and 248,508 edges and a genetic network, network_KEGG, with 288 nodes and 1,983 edges. And the prediction over network_KEGG, and among the predicted and analyzed combinations are 6 predicted classes that are present in network_DATASUS and reports in the literature. We believe that the associations between genes can elucidate the causes of some comorbidities
|
77 |
Analyse intégrée de données de génomique et d’imagerie pour le diagnostic et le suivi du gliome malin chez l’enfant / Integrated analysis of genomic and imaging data dedicated to the diagnosis and follow-up of pediatric high grade gliomaPhilippe, Cathy 08 December 2014 (has links)
Les tumeurs cérébrales malignes sont la première cause de mortalité par cancer chez l’enfant avec une survie médiane de 12 à 14 mois et une survie globale à 5 ans de 20%, pour les gliomes de haut grade. Ce travail de thèse propose des méthodes innovantes pour l’analyse de blocs de données de génomiques, dans le but d’accroître les connaissances biologiques sur ces tumeurs. Les méthodes proposées étendent les travaux de Tenenhaus et al (2011), introduisant le cadre statistique général : Regularized Generalized Canonical Correlation Analysis (RGCCA). Dans un premier temps, nous étendons RGCCA à la gestion de données en grande dimension via une écriture duale de l’algorithme initial (KGCCA). Dans un deuxième temps, la problématique de la sélection de variables dans un contexte multi-Blocs est étudiée. Nous en proposons une solution avec la méthode SGCCA, qui pénalise la norme L1 des poids des composantes. Dans un troisième temps, nous nous intéressons à la nature des liens entre blocs avec deux autres adaptations. D’une part, la régression logistique multi-Blocs (multiblog) permet de prédire une variable binaire, comme la réponse à un traitement. D’autre part, le modèle de Cox multi-Blocs (multiblox) permet d’évaluer, par exemple, le risque instantané de rechute. Enfin, nous appliquons ces méthodes à l’analyse conjointe des données de transcriptome et d’aberrations du nombre de copies, acquises sur une cohorte de 53 jeunes patients avec un gliome de haut grade primaire. Les résultats sont décrits dans le dernier chapitre du manuscrit. / Cerebral malignant tumors are the leading cause of death among pediatric cancers with a median survival from 12 to 14 months and an overall survival of 20% at 5 years for high grade gliomas. This work proposes some innovative methods for the analysis of heterogeneous genomic multi-Block data, with the main objective of increasing biological knowledge about such tumors. These methods extend works of Tenenhaus and Tenenhaus (2011), who introduce Regularized Generalized Canonical Correlation Analysis (RGCCA) as a general statistical framework for multi-Block data analysis. As a first step, we extended RGCCA to handle large-Scale data with kernel methods (KGCCA). As a second step, SGCCA for variable selection within the RGCCA context is studied and leads to an additional constraint on the L1-Norm of the weight vectors. Then, as a third step, we focused on the nature of the links between blocks, with 2 other developments. On one hand, multi-Block logistic regression (multiblog) enables to predict a binary variable, such as response to treatment. On the other hand, the Cox model for multi-Block data (multiblox) enables the assessment of the instant risk, for instance, of relapse. We applied these methods to the joint analysis of Gene Expression and Copy Number Aberrations, acquired on a cohort of 53 young patients with a primary High Grade Glioma. Results are detailed in the last chapter of this work.
|
78 |
Migration et enrichissement sémantique d’entités culturelles / Migration and Semantic Enrichment of Cultural EntitiesDecourselle, Joffrey 28 September 2018 (has links)
De nombreux efforts ont été faits ces dernières années pour faciliter la gestion et la représentation des entités culturelles. Toutefois, il existe encore un grand nombre de systèmes souvent isolés et encore utilisés dans les institutions culturelles reposant sur des modèles non sémantiques qui rendent difficile la validation et l’enrichissement des données. Cette thèse a pour but de proposer de nouvelles solutions pour améliorer la représentation et l’enrichissement sémantique de données culturelles en utilisant les principes du Web Sémantique. Pour ce faire, la recherche est focalisée d’une part sur l’adoption de modèles plus sémantiques comme selon les principes de FRBR qui permet de représenter des familles bibliographiques complexes en utilisant un modèle entités associations avec différents niveaux d’abstraction. Toutefois, la qualité d’une telle transformation est cruciale et c’est pourquoi des améliorations doivent être faites au niveau de la configuration et de l’évaluation d’un tel processus. En parallèle, la thèse cherche à profiter de ces nouveaux modèles sémantiques pour faciliter l’interconnexion des données avec des sources externes comme celles du Linked Open Data ou des sources moins structurées (Sites Web, Flux). Cela doit permettre de générer des bases de connaissances thématiques plus en accord avec les besoins des utilisateurs. Cependant, l’agrégation d’informations depuis des sources hétérogènes implique des étapes d’alignement à la fois au niveau du schéma et au niveau des entités / Many efforts have been done these last two decades to facilitate the management and representation of cultural heritage data. However, many systems used in cultural institutions are still based on flat models and are generally isolated which prevents any reuse or validation of information. This Ph.D. aims at proposing new solutions for enhancing the representation and enrichment of cultural entities using the Semantic Web technologies. This work consists in two major steps to reach this objective. On the one hand, the research is focused on the metadata migration process to transform the schema of existing knowledge catalogs to new semantic models. This study is based on a real-world case study using the concepts from the Functional Requirements for Bibliographic Records (FRBR) which allows to generate graph-based knowledge bases. Yet, the quality of such a migration is the cornerstone for a successful adoption. Thus, several challenges related to the tuning and the evaluation of such a process must be faced. On the other hand, the research aims at taking advantage of these semantic models to facilitate the linkage of information with external and structured sources (e.g., Linked Open Data) and extracting additional information from other sources (e.g., microblogging) to build a new generation of thematic knowledge bases according to the user needs. However, in this case, the aggregation of information from heterogeneous sources requires additional steps to match and merge both correspondences at schema and instance level
|
79 |
[en] IT STRATEGIES FOR THE ELECTRONIC INTEGRATION OF INFORMATION: A STUDY OF THE STATE OF THE ART AND THE PRACTICE / [pt] ESTRATÉGIAS DE TI PARA A INTEGRAÇÃO ELETRÔNICA DA INFORMAÇÃO: UM ESTUDO SOBRE O ESTADO DA ARTE E DA PRÁTICADANIEL VALENTE SERMAN 03 March 2008 (has links)
[pt] A informação passou a ser vista ao longo do tempo como um
insumo
importante para a tomada de decisão e para a obtenção de
vantagens competitivas
pelas empresas. A tecnologia passou a fazer parte do
cotidiano das empresas para
melhor administrá-la e disseminá-la.
Entretanto, nem sempre as organizações adotaram esse
caminho de forma
planejada. Percebe-se uma confusão no uso de conceitos e
de soluções em TI, que
se estende para o tema da integração eletrônica da
informação.
O trabalho consistiu em uma revisão da literatura sobre a
integração de
sistemas e de dados, verificando-se os conceitos mais
comuns, as soluções mais
utilizadas e as promessas encontradas. Além disso,
realizou-se uma pesquisa de
campo, na qual gestores expuseram em entrevistas
qualitativas o que acontece na
prática sobre o assunto, aludindo a benefícios, problemas
e requisitos para o
desenvolvimento e adoção de soluções de integração. / [en] Organizations began to see information like an important
component for
decision making and obtaining above average profits, when
well used.
Computational tools and communication technologies became
common on the
quotidian of these organizations.
However, those tools and technologies weren`t always
adopted through the
right way. We notice confusion on the use of concepts and
the adoption of
solutions on IT and that problem extends to information
integration.
This work consisted on an intense review of the literature
about systems and
data integration, verifying most common concepts, most
utilized solutions and
promises about them. Besides, a field research was
realized, which manages
showed on qualitative interviews what actually happens
about this subject,
referring to benefits, problems and requisites for the
development and adoption of
integration solutions.
|
80 |
The integration of different functional and structural plant modelsLong, Qinqin 20 May 2019 (has links)
No description available.
|
Page generated in 0.0395 seconds