• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 23
  • 21
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 61
  • 61
  • 24
  • 18
  • 18
  • 14
  • 14
  • 13
  • 13
  • 13
  • 12
  • 12
  • 9
  • 9
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Rohde, Florens, Franke, Martin, Sehili, Ziad, Lablans, Martin, Rahm, Erhard 11 February 2022 (has links)
Background: Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods: We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results: The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion: We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.
32

Record Linkage

Larsen, Stasha Ann Bown 11 December 2013 (has links) (PDF)
This document explains the use of different metrics involved with record linkage. There are two forms of record linkage: deterministic and probabilistic. We will focus on probabilistic record linkage used in merging and updating two databases. Record pairs will be compared using character-based and phonetic-based similarity metrics to determine at what level they match. Performance measures are then calculated and Receiver Operating Characteristic (ROC) curves are formed. Finally, an economic model is applied that returns the optimal tolerance level two databases should use to determine a record pair match in order to maximize profit.
33

Uso da técnica de linkage nos sistemas de informação em saúde: aplicação na base de dados do Registro de Câncer de base populacional do município de São Paulo / The use of the linkage technique in health information systems: application in the database of the São Paulo Population-based Cancer Registry

Peres, Stela Verzinhasse 07 December 2011 (has links)
A disponibilidade de grandes bases de dados informatizadas em saúde tornou a técnica de relacionamento de fontes de dados, também conhecida como linkage, uma alternativa para diferentes tipos de estudos. Esta técnica proporciona a geração de uma base de dados mais completa e de baixo custo operacional. Objetivo- Investigar a possibilidade de completar/aperfeiçoar as informações da base de dados do RCBP-SP, no período de 1997 a 2005, utilizando o processo de linkage com três outras bases, a saber: Programa de Aprimoramento de Mortalidade (PRO-AIM), Autorização e Procedimentos de Alta Complexidade (APAC-SIA/SUS) e Fundação Sistema Estadual de Análise de Dados (FSeade). Métodos- Neste estudo foi utilizada a base de dados do RCBP-SP, composta por 343.306 com casos incidentes de câncer do município de São Paulo, registrados no período de 1997 a 2005, com idades que variaram de menos de um a 106 anos, de ambos os sexos. Para a completitude das informações do RCBP-SP foram utilizadas as bases de dados, a saber: PRO-AIM, APAC-SIA/SUS e FSeade. Foram utilizadas as técnicas de linkage probabilística e determinística. O linkage probabilístico foi realizado pelo programa Reclink III versão 3.1.6. Quanto ao linkage determinístico as rotinas foram realizadas em Visual Basic, com as bases hospedadas em SQL Server. Foram calculados os coeficientes brutos de incidência (CBI) e mortalidade (CBM) antes e após o linkage. A análise de sobrevida global foi realizada pela técnica de Kaplan-Meier e para na comparação entre as curvas, utilizou-se o teste de log rank. Foram calculados os valores da área sob a curva, sensibilidade e especificidade para determinar o ponto de corte do escore de maior precisão na identificação dos pares verdadeiros. Resultados- Após o linkage, verificou-se um ganho de 101,5 por cento para a variável endereço e 31,5 por cento para a data do óbito e 80,0 por cento para a data da última informação. Quanto à variável nome da mãe, na base de dados do RCBP-SP antes do linkage esta informação representava somente 0,5 por cento , tendo sido complementada, no geral, em 76.332 registros. A análise de sobrevida global mostrou que antes do processo de linkage havia uma subestimação na probabilidade de estar vivo em todos os períodos analisados. No geral, para a análise de sobrevida truncada em sete anos, a probabilidade de estar vivo no primeiro ano de seguimento antes do linkage foi menor quando comparada a probabilidade de estar vivo ao primeiro ano de seguimento após o linkage (48,8 por cento x 61,1 por cento ; p< 0,001). Conclusão- A técnica de linkage tanto probabilística quanto determinística foi efetiva para completar/aperfeiçoar as informações da base de dados do RCBP-SP. Além do mais, o CBI apresentou um ganho de 3,4 por cento . Quanto ao CBM houve um ganho de 25,8 por cento . Após o uso da técnica de linkage, foi verificado que os valores para a sobrevida global estavam subestimados para ambos os sexos, faixas etárias e para as topografias de câncer / The availability of large computerized databases on health has enabled the record linkage technique, an alternative for different study designs. This technique provides the generation of a more complete database, at low operational cost. Objective to investigate the possibility of completing/improving information from the database of the RCBP-SP, in the period between 1997 and 2005, using the record linkage technique with other three databases, namely: Mortality Improvement Program (PRO-AIM), Authorization of Highly Complex Procedures (APAC-SIA/SUS) and State System of Data Analysis (FSeade), comparing different strategies. Methods In this study we used the database of the RCBP-SP composed of 343,306 incident cancer cases in the Municipality of São Paulo registered in the period between 1997 and 2005 with ages raging from under one to 106 years, from both sexes. To complete the database of the RCBP-SP three databases were used, namely: PRO-AIM, APAC-SIA/SUS and FSeade. Both probabilistic and deterministic record linkage were used. Probabilistic linkage was performed using the Reclink III software, version 3.1.6. As for the the deterministic record linkage, the routines were run in the Visual Basic and databases hosted on a SQL Server. Before and after record linkage, crude incidence (CIR) and mortality rates (CMR) were calculated. The overall survival analysis was performed using the Kaplan-Meier technique and for the comparison between curves, the log rank test was employed. In order to determine the most precise cut-off scores in identifying true matches, we calculated the area under the curve, as well as, sensitivity and specificity. Results After record linkage, it was verified a gain of 101.5 per cent for the variable address, 31.5 per cent for death date and 80,0 per cent for the date of latest information. As for the variable mother´s name, in the database of the RCBP-SP before record linkage, this information represented only 0.5 per cent , having been completed, in general, in 76,332 registries. The overall survival analysis showed that before the record linkage there was an underestimation of the probability of being alive for all periods assessed. In general, for the truncated survival at seven years, the probability of being alive at the first year of follow up before record linkage was lower when compared to the probability of being alive at the first year of follow up after record linkage (48.8 per cent x 61.1 per cent ; p< 0.001). Conclusion Both the probabilistic and deterministic record linkage were effective to complete/improve information from the database of the RCBP-SP. Moreover, the CIR had a gain of de 3.4 per cent . As for the CMR, there was a gain of 25.8 per cent . After using the record linkage technique, it was verified that values for overall survival were underestimated for both sexes, all age groups, and cancer sites
34

Identificação única de pacientes em fontes de dados distribuídas e heterogêneas

Soares, Vinícius de Freitas 25 August 2009 (has links)
Made available in DSpace on 2016-12-23T14:33:39Z (GMT). No. of bitstreams: 1 dissertacao.pdf: 2082796 bytes, checksum: e50f1bc16d61a50d4c9fb2e41ecd3cd5 (MD5) Previous issue date: 2009-08-25 / No decorrer de sua vida, um paciente é atendido por várias instituições de saúde e é submetido a uma série de procedimentos. A quantidade de informações armazenadas sobre esse paciente é crescente, tanto em volume quanto em diversidade. Existem ainda diferentes identificações para um mesmo paciente, gerando alto custo com duplicação de procedimentos e colaborando com a imprecisão dos diagnósticos e tratamentos. Nesse sentido, o presente trabalho utiliza técnicas de Record Linkage e geração de MPI (Master Patient Index), combinadas com as especificações do perfil de integração PIX (Patient Identifier Cross-Referencing), para estabelecer uma identificação única de pacientes em diferentes sistemas de informação em saúde, que contenham fontes de dados heterogêneas e distribuídas. Com a utilização desses conceitos e tecnologias, foi especificado um projeto e desenvolvido um protótipo de um IHE (Integrating the Healthcare Enterprise)/PIX. Experimentos foram realizados em três cenários com dados reais.
35

Uso da técnica de linkage nos sistemas de informação em saúde: aplicação na base de dados do Registro de Câncer de base populacional do município de São Paulo / The use of the linkage technique in health information systems: application in the database of the São Paulo Population-based Cancer Registry

Stela Verzinhasse Peres 07 December 2011 (has links)
A disponibilidade de grandes bases de dados informatizadas em saúde tornou a técnica de relacionamento de fontes de dados, também conhecida como linkage, uma alternativa para diferentes tipos de estudos. Esta técnica proporciona a geração de uma base de dados mais completa e de baixo custo operacional. Objetivo- Investigar a possibilidade de completar/aperfeiçoar as informações da base de dados do RCBP-SP, no período de 1997 a 2005, utilizando o processo de linkage com três outras bases, a saber: Programa de Aprimoramento de Mortalidade (PRO-AIM), Autorização e Procedimentos de Alta Complexidade (APAC-SIA/SUS) e Fundação Sistema Estadual de Análise de Dados (FSeade). Métodos- Neste estudo foi utilizada a base de dados do RCBP-SP, composta por 343.306 com casos incidentes de câncer do município de São Paulo, registrados no período de 1997 a 2005, com idades que variaram de menos de um a 106 anos, de ambos os sexos. Para a completitude das informações do RCBP-SP foram utilizadas as bases de dados, a saber: PRO-AIM, APAC-SIA/SUS e FSeade. Foram utilizadas as técnicas de linkage probabilística e determinística. O linkage probabilístico foi realizado pelo programa Reclink III versão 3.1.6. Quanto ao linkage determinístico as rotinas foram realizadas em Visual Basic, com as bases hospedadas em SQL Server. Foram calculados os coeficientes brutos de incidência (CBI) e mortalidade (CBM) antes e após o linkage. A análise de sobrevida global foi realizada pela técnica de Kaplan-Meier e para na comparação entre as curvas, utilizou-se o teste de log rank. Foram calculados os valores da área sob a curva, sensibilidade e especificidade para determinar o ponto de corte do escore de maior precisão na identificação dos pares verdadeiros. Resultados- Após o linkage, verificou-se um ganho de 101,5 por cento para a variável endereço e 31,5 por cento para a data do óbito e 80,0 por cento para a data da última informação. Quanto à variável nome da mãe, na base de dados do RCBP-SP antes do linkage esta informação representava somente 0,5 por cento , tendo sido complementada, no geral, em 76.332 registros. A análise de sobrevida global mostrou que antes do processo de linkage havia uma subestimação na probabilidade de estar vivo em todos os períodos analisados. No geral, para a análise de sobrevida truncada em sete anos, a probabilidade de estar vivo no primeiro ano de seguimento antes do linkage foi menor quando comparada a probabilidade de estar vivo ao primeiro ano de seguimento após o linkage (48,8 por cento x 61,1 por cento ; p< 0,001). Conclusão- A técnica de linkage tanto probabilística quanto determinística foi efetiva para completar/aperfeiçoar as informações da base de dados do RCBP-SP. Além do mais, o CBI apresentou um ganho de 3,4 por cento . Quanto ao CBM houve um ganho de 25,8 por cento . Após o uso da técnica de linkage, foi verificado que os valores para a sobrevida global estavam subestimados para ambos os sexos, faixas etárias e para as topografias de câncer / The availability of large computerized databases on health has enabled the record linkage technique, an alternative for different study designs. This technique provides the generation of a more complete database, at low operational cost. Objective to investigate the possibility of completing/improving information from the database of the RCBP-SP, in the period between 1997 and 2005, using the record linkage technique with other three databases, namely: Mortality Improvement Program (PRO-AIM), Authorization of Highly Complex Procedures (APAC-SIA/SUS) and State System of Data Analysis (FSeade), comparing different strategies. Methods In this study we used the database of the RCBP-SP composed of 343,306 incident cancer cases in the Municipality of São Paulo registered in the period between 1997 and 2005 with ages raging from under one to 106 years, from both sexes. To complete the database of the RCBP-SP three databases were used, namely: PRO-AIM, APAC-SIA/SUS and FSeade. Both probabilistic and deterministic record linkage were used. Probabilistic linkage was performed using the Reclink III software, version 3.1.6. As for the the deterministic record linkage, the routines were run in the Visual Basic and databases hosted on a SQL Server. Before and after record linkage, crude incidence (CIR) and mortality rates (CMR) were calculated. The overall survival analysis was performed using the Kaplan-Meier technique and for the comparison between curves, the log rank test was employed. In order to determine the most precise cut-off scores in identifying true matches, we calculated the area under the curve, as well as, sensitivity and specificity. Results After record linkage, it was verified a gain of 101.5 per cent for the variable address, 31.5 per cent for death date and 80,0 per cent for the date of latest information. As for the variable mother´s name, in the database of the RCBP-SP before record linkage, this information represented only 0.5 per cent , having been completed, in general, in 76,332 registries. The overall survival analysis showed that before the record linkage there was an underestimation of the probability of being alive for all periods assessed. In general, for the truncated survival at seven years, the probability of being alive at the first year of follow up before record linkage was lower when compared to the probability of being alive at the first year of follow up after record linkage (48.8 per cent x 61.1 per cent ; p< 0.001). Conclusion Both the probabilistic and deterministic record linkage were effective to complete/improve information from the database of the RCBP-SP. Moreover, the CIR had a gain of de 3.4 per cent . As for the CMR, there was a gain of 25.8 per cent . After using the record linkage technique, it was verified that values for overall survival were underestimated for both sexes, all age groups, and cancer sites
36

Semi-automated co-reference identification in digital humanities collections

Croft, David January 2014 (has links)
Locating specific information within museum collections represents a significant challenge for collection users. Even when the collections and catalogues exist in a searchable digital format, formatting differences and the imprecise nature of the information to be searched mean that information can be recorded in a large number of different ways. This variation exists not just between different collections, but also within individual ones. This means that traditional information retrieval techniques are badly suited to the challenges of locating particular information in digital humanities collections and searching, therefore, takes an excessive amount of time and resources. This thesis focuses on a particular search problem, that of co-reference identification. This is the process of identifying when the same real world item is recorded in multiple digital locations. In this thesis, a real world example of a co-reference identification problem for digital humanities collections is identified and explored. In particular the time consuming nature of identifying co-referent records. In order to address the identified problem, this thesis presents a novel method for co-reference identification between digitised records in humanities collections. Whilst the specific focus of this thesis is co-reference identification, elements of the method described also have applications for general information retrieval. The new co-reference method uses elements from a broad range of areas including; query expansion, co-reference identification, short text semantic similarity and fuzzy logic. The new method was tested against real world collections information, the results of which suggest that, in terms of the quality of the co-referent matches found, the new co-reference identification method is at least as effective as a manual search. The number of co-referent matches found however, is higher using the new method. The approach presented here is capable of searching collections stored using differing metadata schemas. More significantly, the approach is capable of identifying potential co-reference matches despite the highly heterogeneous and syntax independent nature of the Gallery, Library Archive and Museum (GLAM) search space and the photo-history domain in particular. The most significant benefit of the new method is, however, that it requires comparatively little manual intervention. A co-reference search using it has, therefore, significantly lower person hour requirements than a manually conducted search. In addition to the overall co-reference identification method, this thesis also presents: • A novel and computationally lightweight short text semantic similarity metric. This new metric has a significantly higher throughput than the current prominent techniques but a negligible drop in accuracy. • A novel method for comparing photographic processes in the presence of variable terminology and inaccurate field information. This is the first computational approach to do so.
37

Monitoramento de doadores de sangue através de integração de bases de texto heterogêneas

Pinha, André Teixeira January 2016 (has links)
Orientador: Prof. Dr. Márcio Katsumi Oikawa / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2016. / Através do relacionamento probabilístico de bases de dados é possível obter informações que a análise individual ou manual de bases de dados não proporcionaria. Esse trabalho visa encontrar, através do relacionamento probabilístico de registros, doadores de sangue da base de dados da Fundação Pró-Sangue (FPS) no Sistema de Informações sobre Mortalidade (SIM), nos anos de 2001 a 2006, favorecendo assim a manutenção de hemoderivados da instituição, inferindo se determinado doador veio à óbito. Para tal, foram avaliadas a eficiência de diferentes chaves de blocking que foram aplicadas em um conjunto de softwares gratuitos de record linkage e no software implementado para uso específico do estudo, intitulado SortedLink. Nos estudos, os registros foram padronizados e apenas os que possuíam dados da mãe cadastrados foram utilizados. Para avaliar a eficiência das chaves de blocking, foram selecionados 100.000 registros aleatoriamente das bases de dados SIM e FPS, e adicionados 30 registros de validação para cada conjunto. Sendo que o software SortedLink, implementado no trabalho, foi o que apresentou os melhores resultados e foi utilizado para obter os resultados dos possíveis pares de registros na base total de dados, 1.709.819 de registros para o SIM e 334.077 para o FPS. Além disso, o estudo também avalia a eficiência dos algoritmos de codificação fonética SOUNDEX, tipicamente utilizado no processo de record linkage, e do BRSOUND, desenvolvido para codificação de nomes e sobrenomes oriundos da língua portuguesa do Brasil. / Through probabilistic record linkage of databases is possible to obtain information that the individual or manual analysis of databases do not provide. This work aims to find, through probabilistic record relationship, blood donors from the database of Fundação Pró-Sangue (FPS) in the Sistema de Informações sobre Mortalidade (SIM) from Brazil, in the year 2001 to 2006, thus favoring maintenance blood products of the institution, inferring whether a donor came to death. For this purpose, we evaluated the effectiveness of different blocking keys that were applied to a set of free software record linkage and a software implemented for specific use of the study, entitled SortedLink. In the studies, the records were standardized and only those who had registered mother information were used. To assess the effectiveness of blocking keys were selected randomly 100, 000 records of SIM and FPS databases, and added 30 validation records for each set. Since the SortedLink software, implemented in this work, showed the best results, it was used to obtain the results of the possible pairs of records in the total database, 1.709.819 records from SIM and 334.077 from FPS. In addition, the study also evaluated the efficiency of SOUNDEX phonetic encoding algorithms, typically used in the record linkage process and the BRSOUND, developed for encoding names and surnames derived from the Portuguese language of Brazil.
38

Condições de risco ao nascer relacionadas aos critérios de near miss neonatal : estudo de linkage entre o SINASC e o SIM no estado de Sergipe / Conditions of risk at birth related to near miss neonatal criteria : linkage study between SINASC and SIM in the state of Sergipe

Silva, Márcia Estela Lopes da 29 August 2018 (has links)
The technological advance has been contributing to the survival of newborns considered to be at risk for infant mortality. The concept of Near Miss Neonatal, defined as a newborn who presented a severe complication at birth but survived the neonatal period, came to define further studies on those infants who overcame causes of probable early neonatal death, and to evaluate the conditions of perinatal care. Objective: To identify the birth risk conditions related to the Near Miss Neonatal criteria from the secondary database analysis through a linkage between SINASC and SIM, in the period from 2011 to 2016 in the State of Sergipe. Methodology: an analytical retrospective cohort study with analysis of secondary data in a historical series in the official databases, through the linkage between SINASC and SIM. Data were collected on live births resident in the State of Sergipe, and the sample was selected from all newborns with early Near Miss Neonatal criteria: gestational age less than 31 weeks, birth weight less than 1,500 g and APGAR at the fifth minute below 7. We selected variables present in the SINASC: sociodemographic, obstetric and newborn, categorized and analyzed within the sample. The final statistical analysis evaluated the results by determining the relative risk and their respective confidence intervals, which identified the probability of the outcome selected by the study. Results: the variables that most had relation with the risk conditions at birth with a relative risk greater than 1 were: place of birth in the capital, home birth, illiteracy or low maternal schooling, age under 20 years and over 35 years, losses fetal abortions, prenatal care with 6 or fewer visits, number of prenatal consultations not suitable for gestational age at the start of follow-up, multiple gestation, non-cephalic presentation and congenital anomaly. The variables identified as protection factors with relative risk less than 1 were: the presence of partners, pregnant women aged between 20 and 35 years, maternal schooling over 4 years, cesarean delivery and induced labor. And the variables unrelated to the outcome, with relative risk crossing the 1 value in their confidence interval, were the maternal color / race and the sex of the newborn. Conclusion: In this study it was verified that the variables found as risk factors come according to what the literature has described and, therefore, it is intended that it serve as support for other studies on the subject and contributes to the survey of evidences that can subsidize bases for the construction of programs and public policies directed to the reduction of the infant morbimortality. / O avanço tecnológico vem colaborando para a sobrevida de recém-nascidos considerados de risco para mortalidade infantil. O conceito de Near Miss Neonatal, definido como um recém-nascido que apresentou uma complicação grave ao nascer, mas sobreviveu ao período neonatal, veio para que se delimitassem mais estudos sobre essas crianças que superaram causas de provável óbito neonatal precoce, e se avaliasse as condições de assistência perinatal. Objetivo: Identificar as condições de risco ao nascer relacionadas aos critérios de Near Miss Neonatal a partir da análise secundária de banco de dados através de um linkage entre o SINASC e o SIM, no período de 2011 a 2016 no Estado de Sergipe. Metodologia: estudo analítico de coorte retrospectiva com análise de dados secundários em uma série histórica nos bancos de dados oficiais, através do linkage entre o SINASC e o SIM. Foram coletados os dados referentes aos nascidos vivos residentes no Estado de Sergipe, sendo que a amostra selecionada foi de todos recém-nascidos com os critérios de Near Miss Neonatal precoce: idade gestacional menor que 31 semanas, peso ao nascer menor que 1.500g e APGAR no quinto minuto menor que 7. Foram selecionadas variáveis presentes no SINASC: sociodemograficas, obstétricas e do recém nascido, categorizadas e analisadas dentro da amostra. A análise estatística final avaliou os resultados através da determinação do risco relativo e de seus respectivos intervalos de confiança os quais identificaram a probabilidade do desfecho selecionado pelo estudo. Resultados: as variáveis que mais tiveram relação com as condições de risco ao nascer apresentando risco relativo maior que 1 foram: local de ocorrência na capital, parto domiciliar, analfabetismo ou baixa escolaridade materna, idade menor a 20 anos e maior que 35 anos, perdas fetais/abortos prévios, pré-natal com 6 ou menos consultas, número de consultas pré-natais inadequadas à idade gestacional de início do acompanhamento, gestação múltipla, apresentação não cefálica e anomalia congênita. As variáveis identificadas como fatores de proteção com risco relativo menor que 1 foram: presença de companheiro, gestantes com idade entre 20 e 35 anos, escolaridade materna maior que 4 anos, parto cesáreo e trabalho de parto induzido. E as variáveis sem relação com o desfecho, com risco relativo perpassando o valor 1 em seu intervalo de confiança, foram a cor/raça materna e o sexo do recém-nascido. Conclusão: Neste estudo verificou-se que as variáveis encontradas como fatores de risco vêm de acordo com o que a literatura tem descrito e, assim sendo, pretende-se que o mesmo sirva de apoio a demais estudos sobre o tema e contribua para o levantamento de evidências que possam subsidiar bases para a construção de programas e políticas públicas direcionadas à diminuição da morbimortalidade infantil. / Aracaju
39

Création d'un environnement de gestion de base de données "en grille" : application à l'échange de données médicales / Creating a "grid" database management environment : application to medical data exchange

De Vlieger, Paul 12 July 2011 (has links)
La problématique du transport de la donnée médicale, de surcroît nominative, comporte de nombreuses contraintes, qu’elles soient d’ordre technique, légale ou encore relationnelle. Les nouvelles technologies, issues particulièrement des grilles informatiques, permettent d’offrir une nouvelle approche au partage de l’information. En effet, le développement des intergiciels de grilles, notamment ceux issus du projet européen EGEE, ont permis d’ouvrir de nouvelles perspectives pour l’accès distribué aux données. Les principales contraintes d’un système de partage de données médicales, outre les besoins en termes de sécurité, proviennent de la façon de recueillir et d’accéder à l’information. En effet, la collecte, le déplacement, la concentration et la gestion de la donnée, se fait habituellement sur le modèle client-serveur traditionnel et se heurte à de nombreuses problématiques de propriété, de contrôle, de mise à jour, de disponibilité ou encore de dimensionnement des systèmes. La méthodologie proposée dans cette thèse utilise une autre philosophie dans la façon d’accéder à l’information. En utilisant toute la couche de contrôle d’accès et de sécurité des grilles informatiques, couplée aux méthodes d’authentification robuste des utilisateurs, un accès décentralisé aux données médicales est proposé. Ainsi, le principal avantage est de permettre aux fournisseurs de données de garder le contrôle sur leurs informations et ainsi de s’affranchir de la gestion des données médicales, le système étant capable d’aller directement chercher la donnée à la source.L’utilisation de cette approche n’est cependant pas complètement transparente et tous les mécanismes d’identification des patients et de rapprochement d’identités (data linkage) doivent être complètement repensés et réécris afin d’être compatibles avec un système distribué de gestion de bases de données. Le projet RSCA (Réseau Sentinelle Cancer Auvergne – www.e-sentinelle.org) constitue le cadre d’application de ce travail. Il a pour objectif de mutualiser les sources de données auvergnates sur le dépistage organisé des cancers du sein et du côlon. Les objectifs sont multiples : permettre, tout en respectant les lois en vigueur, d’échanger des données cancer entre acteurs médicaux et, dans un second temps, offrir un support à l’analyse statistique et épidémiologique. / Nominative medical data exchange is a growing challenge containing numerous technical, legislative or relationship barriers. New advanced technologies, in the particular field of grid computing, offer a new approach to handle medical data exchange. The development of the gLite grid middleware within the EGEE project opened new perspectives in distributed data access and database federation. The main requirements of a medical data exchange system, except the high level of security, come from the way to collect and provide data. The original client-server model of computing has many drawbacks regarding data ownership, updates, control, availability and scalability. The method described in this dissertation uses another philosophy in accessing medical data. Using the grid security layer and a robust user access authentication and control system, we build up a dedicated grid network able to federate distributed medical databases. In this way, data owners keep control over the data they produce.This approach is therefore not totally straightforward, especially for patient identification and medical data linkage which is an open problem even in centralized medical systems. A new method is then proposed to handle these specific issues in a highly distributed environment. The Sentinelle project (RSCA) constitutes the applicative framework of this project in the field of cancer screening in French Auvergne region. The first objective is to allow anatomic pathology reports exchange between laboratories and screening structures compliant with pathologists’ requirements and legal issues. Then, the second goal is to provide a framework for epidemiologists to access high quality medical data for statistical studies and global epidemiology.
40

Evaluation et amélioration des méthodes de chaînage de données / Evaluation and improvement of data chaining methods

Li, Xinran 29 January 2015 (has links)
Le chaînage d’enregistrements est la tâche qui consiste à identifier parmi différentes sources de données les enregistrements qui concernent les mêmes entités. En l'absence de clé d’identification commune, cette tâche peut être réalisée à l’aide d’autres champs contenant des informations d’identifications, mais dont malheureusement la qualité n’est pas parfaite. Pour ce faire, de nombreuses méthodes dites « de chaînage de données » ont été proposées au cours des dernières décennies.Afin d’assurer le chaînage valide et rapide des enregistrements des mêmes patients dans le cadre de GINSENG, projet qui visait à mettre en place une infrastructure de grille informatique pour le partage de données médicales distribuées, il a été nécessaire d’inventorier, d’étudier et parfois d’adapter certaines des diverses méthodes couramment utilisées pour le chaînage d’enregistrements. Citons notamment les méthodes de comparaison approximative des champs d’enregistrement selon leurs épellations et leurs prononciations, les chaînages déterministe et probabiliste d’enregistrements, ainsi que leurs extensions. Ces méthodes comptent des avantages et des inconvénients qui sont ici clairement exposés.Dans la pratique, les champs à comparer étant souvent imparfaits du fait d’erreurs typographiques, notre intérêt porte particulièrement sur les méthodes probabilistes de chaînage d’enregistrements. L’implémentation de ces méthodes probabilistes proposées par Fellegi et Sunter (PRL-FS) et par Winkler (PRL-W) est précisément décrite, ainsi que leur évaluation et comparaison. La vérité des correspondances des enregistrements étant indispensable à l’évaluation de la validité des résultats de chaînages, des jeux de données synthétiques sont générés dans ce travail et des algorithmes paramétrables proposés et détaillés.Bien qu’à notre connaissance, le PRL-W soit une des méthodes les plus performantes en termes de validité de chaînages d’enregistrements en présence d’erreurs typographiques dans les champs contenant les traits d’identification, il présente cependant quelques caractéristiques perfectibles. Le PRL-W ne permet par exemple pas de traiter de façon satisfaisante le problème de données manquantes. Notons également qu’il s’agit d’une méthode dont l’implémentation n’est pas simple et dont les temps de réponse sont difficilement compatibles avec certains usages de routine. Certaines solutions ont été proposées et évaluées pour pallier ces difficultés, notamment plusieurs approches permettant d’améliorer l’efficacité du PRL-W en présence de données manquantes et d’autres destinées à optimiser les temps de calculs de cette méthode en veillant à ce que cette réduction du temps de traitement n’entache pas la validité des décisions de chaînage issues de cette méthode. / Record linkage is the task of identifying which records from different data sources refer to the same entities. Without the common identification key among different databases, this task could be performed by comparison of corresponding fields (containing the information for identification) in records to link. To do this, many record linkage methods have been proposed in the last decades.In order to ensure a valid and fast linkage of the same patients’ records for GINSENG, a research project which aimed to implement a grid computing infrastructure for sharing medical data, we first studied various commonly used methods for record linkage. These are the methods of approximate comparison of fields in record according to their spellings and pronunciations; the deterministic and probabilistic record linkages and their extensions. The advantages and disadvantages of these methods are clearly demonstrated.In practice, as fields to compare are sometimes subject to typographical errors, we focused on probabilistic record linkage. The implementation of these probabilistic methods proposed by Fellegi and Sunter (PRL-FS) and Winkler (PRL-W) is described in details, and also their evaluation and comparison. Synthetic data sets were used in this work for knowing the truth of matches to evaluate the linkage results. A configurable algorithm for generating synthetic data was therefore proposed.To our knowledge, the PRL-W is one of the most effective methods in terms of validity of linkages in the presence of typographical errors in the field. However, the PRL-W does not satisfactorily treat the missing data problem in the fields, and the implementation of PRL-W is complex and has a computational time that impairs its opportunity in routine use. Solutions are proposed here with the objective of improving the effectiveness of PRL-W in the presence of missing data in the fields. Other solutions are tested to simplify the PRL-W algorithm and both reduce computational time and keep and optimal linkage accuracy.Keywords:

Page generated in 0.0719 seconds