Global ETD Search

111	A Model for Managing Data Integrity Mallur, Vikram 22 September 2011 (has links) Consistent, accurate and timely data are essential to the functioning of a modern organization. Managing the integrity of an organization’s data assets in a systematic manner is a challenging task in the face of continuous update, transformation and processing to support business operations. Classic approaches to constraint-based integrity focus on logical consistency within a database and reject any transaction that violates consistency, but leave unresolved how to fix or manage violations. More ad hoc approaches focus on the accuracy of the data and attempt to clean data assets after the fact, using queries to flag records with potential violations and using manual efforts to repair. Neither approach satisfactorily addresses the problem from an organizational point of view. In this thesis, we provide a conceptual model of constraint-based integrity management (CBIM) that flexibly combines both approaches in a systematic manner to provide improved integrity management. We perform a gap analysis that examines the criteria that are desirable for efficient management of data integrity. Our approach involves creating a Data Integrity Zone and an On Deck Zone in the database for separating the clean data from data that violates integrity constraints. We provide tool support for specifying constraints in a tabular form and generating triggers that flag violations of dependencies. We validate this by performing case studies on two systems used to manage healthcare data: PAL-IS and iMED-Learn. Our case studies show that using views to implement the zones does not cause any significant increase in the running time of a process. ad hoc methods constraints data dependency data processing data quality database integrity logical consistency
112	Integration of vector datasets Hope, Susannah Jayne January 2008 (has links) As the spatial information industry moves from an era of data collection to one of data maintenance, new integration methods to consolidate or to update datasets are required. These must reduce the discrepancies that are becoming increasingly apparent when spatial datasets are overlaid. It is essential that any such methods consider the quality characteristics of, firstly, the data being integrated and, secondly, the resultant data. This thesis develops techniques that give due consideration to data quality during the integration process.
113	Critical success factors for accounting information systems data quality Xu, Hongjiang January 2003 (has links) Quality information is critical to organisations’ success in today’s highly competitive environment. Accounting information systems (AIS) as a discipline within information systems require high quality data. However, empirical evidence suggests that data quality is problematic in AIS. Therefore, knowledge of critical factors that are important in ensuring data quality in accounting information systems is desirable. A literature review evaluates previous research work in quality management, data quality, and accounting information systems. It was found that there was a gap in the literature about critical success factors for data quality in accounting information systems. Based on this gap in the literature and the findings of the exploratory stage of the research, a preliminary research model for factors influence data quality in AIS was developed. A framework for understanding relationships between stakeholder groups and data quality in accounting information systems was also developed. The major stakeholders are information producers, information custodians, information managers, information users, and internal auditors. Case study and survey methodology were adopted for this research. Case studies in seven Australian organisations were carried out, where four of them were large organisations and the other three are small to medium organisations (SMEs). Each case was examined as a whole to obtain an understanding of the opinions and perspectives of the respondents from each individual organisation as to what are considered to be the important factors in the case. Then, cross-case analysis was used to analyze the similarities and differences of the seven cases, which also include the variations between large organisations and small to medium organisations (SMEs). Furthermore, the variations between five different stakeholder groups were also examined. The results of the seven main case studies suggested 26 factors that may have impact on data quality in AIS. Survey instrument was developed based on the findings from case studies. Two large-scale surveys were sent to selected members of Australian CPA, and Australian Computer Society to further develop and test the research framework. The major findings from the survey are: 1. respondents rated the importance of the factors consistent higher than the actual performance of those factors. 2. There was only one factor, ‘audit and reviews’, that was found to be different between different sized organisations. 3. Four factors were found to be significantly different between different stakeholder groups: user focus, measurement and reporting, data supplier quality management and audit and reviews. 4. The top three critical factors for ensuring data quality in AIS were: top management commitment, education and training, and the nature of the accounting information systems. The key contribution of this thesis is the theoretical framework developed from the analysis of the findings of this research, which is the first such framework built upon empirical study that explored factors influencing data quality in AIS and their interrelationships with stakeholder groups and data quality outcomes. That is, it is now clear which factors impact on data quality in AIS, and which of those factors are critical success factors for ensuring high quality information outcomes. In addition, the performance level of factors was also incorporated into the research framework. Since the actual performance of factors has not been highlighted in other studies, this research adds new theoretical insights to the extant literature. In turn, this research confirms some of the factors mentioned in the literature and adds a few new factors. Moreover, stakeholder groups of data quality in AIS are important considerations and need more attention. The research framework of this research shows the relationship between stakeholder groups, important factors and data quality outcomes by highlighting stakeholder groups’ influence on identifying the important factors, as well as the evaluation of the importance and p erformance of the factors. accounting accounting information systems (AIS) organisation data quality most critical factor (MCF)
114	Genomic data analyses for population history and population health Bycroft, Clare January 2017 (has links) Many of the patterns of genetic variation we observe today have arisen via the complex dynamics of interactions and isolation of historic human populations. In this thesis, we focus on two important features of the genetics of populations that can be used to learn about human history: population structure and admixture. The Iberian peninsula has a complex demographic history, as well as rich linguistic and cultural diversity. However, previous studies using small genomic regions (such as Y-chromosome and mtDNA) as well as genome-wide data have so far detected limited genetic structure in Iberia. Larger datasets and powerful new statistical methods that exploit information in the correlation structure of nearby genetic markers have made it possible to detect and characterise genetic differentiation at fine geographic scales. We performed the largest and most comprehensive study of Spanish population structure to date by analysing genotyping array data for ~1,400 Spanish individuals genotyped at ~700,000 polymorphic loci. We show that at broad scales, the major axis of genetic differentiation in Spain runs from west to east, while there is remarkable genetic similarity in the north-south direction. Our analysis also reveals striking patterns of geographically-localised and subtle population structure within Spain at scales down to tens of kilometres. We developed and applied new approaches to show how this structure has arisen from a complex and regionally-varying mix of genetic isolation and recent gene-flow within and from outside of Iberia. To further explore the genetic impact of historical migrations and invasions of Iberia, we assembled a data set of 2,920 individuals (~300,000 markers) from Iberia and the surrounding regions of north Africa, Europe, and sub-Saharan Africa. Our admixture analysis implies that north African-like DNA in Iberia was mainly introduced in the earlier half (860 - 1120 CE) of the period of Muslim rule in Iberia, and we estimate that the closest modern-day equivalents to the initial migrants are located in Western Sahara. We also find that north African-like DNA in Iberia shows striking regional variation, with near-zero contributions in the Basque regions, low amounts (~3%) in the north east of Iberia, and as high as (~11%) in Galicia and Portugal. The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Understanding the role that genetics plays in phenotypic variation, and its potential interactions with other factors, provides a critical route to a better understanding of human biology and population health. As such, a key component of the UK Biobank resource has been the collection of genome-wide genetic data (~805,000 markers) on every participant using purpose-designed genotyping arrays. These data are the focus of the second part of this thesis. In particular, we designed and implemented a quality control (QC) pipeline on behalf of the current and future use of this multi-purpose resource. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestral backgrounds in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data, including population structure and familial relatedness, that can be important for downstream analyses. We find that cryptic relatedness is common among UK Biobank participants (~30% have at least one first cousin relative or closer), and a full range of human population structure is present in this cohort: from world-wide ancestral diversity to subtle population structure at sub-national geographic scales. Finally, we performed a genome-wide association scan on a well-studied and highly polygenic phenotype: standing height. This provided a further test of the effectiveness of our QC, as well as highlighting the potential of the resource to uncover novel regions of association.
115	An investigation into improving the repeatability of steady-state measurements from nonlinear systems : methods for measuring repeatable data from steady-state engine tests were evaluated : a comprehensive and novel approach to acquiring high quality steady-state emissions data was developed Dwyer, Thomas Patrick January 2014 (has links) The calibration of modern internal combustion engines requires ever improving measurement data quality such that they comply with increasingly stringent emissions legislation. This study establishes methodology and a software tool to improve the quality of steady-state emissions measurements from engine dynamometer tests. Literature shows state of the art instrumentation are necessary to monitor the cycle-by-cycle variations that significantly alter emissions measurements. Test methodologies that consider emissions formation mechanisms invariably focus on thermal transients and preconditioning of internal surfaces. This work sought data quality improvements using three principle approaches. An adapted steady-state identifier to more reliably indicate when the test conditions reached steady-state; engine preconditioning to reduce the influence of the prior day’s operating conditions on the measurements; and test point ordering to reduce measurement deviation. Selection of an improved steady-state indicator was identified using correlations in test data. It was shown by repeating forty steady-state test points that a more robust steady-state indicator has the potential to reduce the measurement deviation of particulate number by 6%, unburned hydrocarbons by 24%, carbon monoxide by 10% and oxides of nitrogen by 29%. The variation of emissions measurements from those normally observed at a repeat baseline test point were significantly influenced by varying the preconditioning power. Preconditioning at the baseline operating condition converged emissions measurements with the mean of those typically observed. Changing the sequence of steady-state test points caused significant differences in the measured engine performance. Examining the causes of measurement deviation allowed an optimised test point sequencing method to be developed. A 30% reduction in measurement deviation of a targeted engine response (particulate number emissions) was obtained using the developed test methodology. This was achieved by selecting an appropriate steady-state indicator and sequencing test points. The benefits of preconditioning were deemed short-lived and impractical to apply in every-day engine testing although the principles were considered when developing the sequencing methodology.
116	Experience in Data Quality Assessment on Archived Historical Freeway Traffic Data January 2011 (has links) abstract: Concern regarding the quality of traffic data exists among engineers and planners tasked with obtaining and using the data for various transportation applications. While data quality issues are often understood by analysts doing the hands on work, rarely are the quality characteristics of the data effectively communicated beyond the analyst. This research is an exercise in measuring and reporting data quality. The assessment was conducted to support the performance measurement program at the Maricopa Association of Governments in Phoenix, Arizona, and investigates the traffic data from 228 continuous monitoring freeway sensors in the metropolitan region. Results of the assessment provide an example of describing the quality of the traffic data with each of six data quality measures suggested in the literature, which are accuracy, completeness, validity, timeliness, coverage and accessibility. An important contribution is made in the use of data quality visualization tools. These visualization tools are used in evaluating the validity of the traffic data beyond pass/fail criteria commonly used. More significantly, they serve to educate an intuitive sense or understanding of the underlying characteristics of the data considered valid. Recommendations from the experience gained in this assessment include that data quality visualization tools be developed and used in the processing and quality control of traffic data, and that these visualization tools, along with other information on the quality control effort, be stored as metadata with the processed data. / Dissertation/Thesis / M.S. Civil and Environmental Engineering 2011 Engineering Transportation assessment data quality experience quality control traffic data visualization
117	évaluation de la véracité des données : améliorer la découverte de la vérité en utilisant des connaissances a priori / data veracity assessment : enhancing truth discovery using a priori knowledge Beretta, Valentina 30 October 2018 (has links) Face au danger de la désinformation et de la prolifération de fake news (fausses nouvelles) sur le Web, la notion de véracité des données constitue un enjeu crucial. Dans ce contexte, il devient essentiel de développer des modèles qui évaluent de manière automatique la véracité des informations. De fait, cette évaluation est déjà très difficile pour un humain, en raison notamment du biais de confirmation qui empêche d’évaluer objectivement la fiabilité des informations. De plus, la quantité d'informations disponibles sur le Web rend cette tâche quasiment impossible. Il est donc nécessaire de disposer d'une grande puissance de calcul et de développer des méthodes capables d'automatiser cette tâche.Dans cette thèse, nous nous concentrons sur les modèles de découverte de la vérité. Ces approches analysent les assertions émises par différentes sources afin de déterminer celle qui est la plus fiable et digne de confiance. Cette étape est cruciale dans un processus d'extraction de connaissances, par exemple, pour constituer des bases de qualité, sur lesquelles pourront s'appuyer différents traitements ultérieurs (aide à la décision, recommandation, raisonnement…). Plus précisément, les modèles de la littérature sont des modèles non supervisés qui reposent sur un postulat : les informations exactes sont principalement fournies par des sources fiables et des sources fiables fournissent des informations exactes.Les approches existantes faisaient jusqu'ici abstraction de la connaissance a priori d'un domaine. Dans cette contribution, nous montrons comment les modèles de connaissance (ontologies de domaine) peuvent avantageusement être exploités pour améliorer les processus de recherche de vérité. Nous insistons principalement sur deux approches : la prise en compte de la hiérarchisation des concepts de l'ontologie et l'identification de motifs dans les connaissances qui permet, en exploitant certaines règles d'association, de renforcer la confiance dans certaines assertions. Dans le premier cas, deux valeurs différentes ne seront plus nécessairement considérées comme contradictoires ; elles peuvent, en effet, représenter le même concept mais avec des niveaux de détail différents. Pour intégrer cette composante dans les approches existantes, nous nous basons sur les modèles mathématiques associés aux ordres partiels. Dans le second cas, nous considérons des modèles récurrents (modélisés en utilisant des règles d'association) qui peuvent être dérivés à partir des ontologies et de bases de connaissances existantes. Ces informations supplémentaires peuvent renforcer la confiance dans certaines valeurs lorsque certains schémas récurrents sont observés. Chaque approche est validée sur différents jeux de données qui sont rendus disponibles à la communauté, tout comme le code de calcul correspondant aux deux approches. / The notion of data veracity is increasingly getting attention due to the problem of misinformation and fake news. With more and more published online information it is becoming essential to develop models that automatically evaluate information veracity. Indeed, the task of evaluating data veracity is very difficult for humans. They are affected by confirmation bias that prevents them to objectively evaluate the information reliability. Moreover, the amount of information that is available nowadays makes this task time-consuming. The computational power of computer is required. It is critical to develop methods that are able to automate this task.In this thesis we focus on Truth Discovery models. These approaches address the data veracity problem when conflicting values about the same properties of real-world entities are provided by multiple sources.They aim to identify which are the true claims among the set of conflicting ones. More precisely, they are unsupervised models that are based on the rationale stating that true information is provided by reliable sources and reliable sources provide true information. The main contribution of this thesis consists in improving Truth Discovery models considering a priori knowledge expressed in ontologies. This knowledge may facilitate the identification of true claims. Two particular aspects of ontologies are considered. First of all, we explore the semantic dependencies that may exist among different values, i.e. the ordering of values through certain conceptual relationships. Indeed, two different values are not necessary conflicting. They may represent the same concept, but with different levels of detail. In order to integrate this kind of knowledge into existing approaches, we use the mathematical models of partial order. Then, we consider recurrent patterns that can be derived from ontologies. This additional information indeed reinforces the confidence in certain values when certain recurrent patterns are observed. In this case, we model recurrent patterns using rules. Experiments that were conducted both on synthetic and real-world datasets show that a priori knowledge enhances existing models and paves the way towards a more reliable information world. Source code as well as synthetic and real-world datasets are freely available. Découverte de la vérité Ontologies Données liées Qualite des données Truth Discovery Ontologies Linked data Data quality
118	Citizen science data quality: Harnessing the power of recreational SCUBA divers for rockfish (Sebastes spp.) conservation Gorgopa, Stefania M. 30 August 2018 (has links) Monitoring rare or elusive species can be especially difficult in marine environments, resulting in poor data density. SCUBA-derived citizen science data has the potential to improve data density for conservation. However, citizen science data quality may be perceived to be of low quality relative to professional data due to a lack of ‘expertise’ and increased observer variability. We evaluated the quality of data collected by citizen science scuba divers for rockfish (Sebastes spp.) conservation around Southern Vancouver Island, Canada. An information-theoretic approach was taken in two separate analyses to address the overarching question: ‘what factors are important for SCUBA-derived citizen science data quality?’. The first analysis identified predictors of variability in precision between paired divers. We found that professional scientific divers did not exhibit greater data precision than recreational divers. Instead, precision variation was best explained by study site and divers’ species identification or recreational training. A second analysis identified what observer and environmental factors correlated with higher resolution identifications (i.e. identified to the species level rather than family or genus). We found divers provided higher resolution identifications on surveys when they had high species ID competency and diving experience. Favorable conditions (high visibility and earlier in the day) also increased taxonomic resolution on dive surveys. With our findings, we are closer to realizing the full potential of citizen science to increase our capacity to monitor rare and elusive species. / Graduate citizen science data quality marine conservation SCUBA diving precision resolution observer expertise observer error
119	Um modelo de qualidade para caracterização e seleção de bancos de dados de biologia molecular / A quality model for characterizing and selecting molecular biology databases Lichtnow, Daniel January 2012 (has links) O número de banco de dados de biologia molecular presentes na Web vem aumentando significativamente nos últimos anos. A dificuldade de localizar estes bancos de dados na Web incentivou a criação de uma série de catálogos. Mesmo com estes catálogos, persiste o desafio de selecionar aqueles bancos de dados que possuem maior qualidade. Normalmente, a seleção é feita por usuários, que nem sempre possuem o conhecimento necessário e enfrentam problemas pela ausência de uma descrição mais rica dos bancos de dados nestes catálogos. Esta ausência de uma descrição mais rica dos bancos de dados gerou iniciativas recentes que visam identificar metadados relevantes para descrição dos bancos de dados de biologia molecular. No entanto, até o momento, como utilizar estes metadados na seleção dos bancos de dados presentes em um catálogo, relacionando estes às dimensões de qualidade de dados, é um tema pouco explorado. Da mesma forma, o uso de Web metrics, utilizadas na seleção de páginas Web, vem sendo quase ignorado na determinação da qualidade de bancos de dados de biologia molecular. Tendo em vista este cenário, nesta tese foi desenvolvido um modelo de qualidade que visa auxiliar na seleção de bancos de dados de biologia molecular presentes em catálogos na Web a partir da avaliação global de um banco de dados por meio de metadados e Web metrics. A definição deste modelo envolve adoção de metadados propostos em outros trabalhos, a proposição de novos metadados e a análise das dimensões de qualidade de dados. Experimentos são realizados de forma a avaliar a utilidade de alguns dos metadados e Web metrics na determinação da qualidade global de um banco de dados. A representação dos metadados, dimensões de qualidade, indicadores de qualidade e métricas usando recursos de Web Semântica é também discutida. O principal cenário de aplicação da abordagem é relacionado à necessidade que um usuário tem de escolher o melhor banco de dados para buscar informações relevantes para o seu trabalho dentre os existentes em um catálogo. Outro cenário está relacionado a sistemas que integram dados de fontes distintas e que necessitam, em muitos casos, reduzir o número de bancos de dados candidatos a um processo de integração. / The number of molecular biology databases has increased in the last years. The difficulty of identifying these databases on the Web is the motivation to create database catalogs. However, even using these catalogs, the challenge is how to identify the best databases within these sets of identified databases. In general, the selection process is done by users, who sometimes have little knowledge about databases related to a specific domain and will have difficulties to select the best databases. These difficulties are related to the absence of information about databases in these catalogs. This absence of information has generated some recent initiatives aiming to identify relevant metadata for describing molecular biology databases. However, at the present moment, how to use these metadata for selecting databases from a catalog, taking into account data quality dimensions, is underexplored. In a similar way, Web metrics used for selecting Web pages is almost ignored in the molecular biology databases evaluation process. In this scenario, this thesis defines a quality model, based on some identified data quality dimensions, aiming to help selecting a database from molecular biology database catalogs. This selection process is done by considering database metadata and Web metrics. The definition of this model involves the adoption of metadata from related works, the definition of new metadata and the analysis of data quality dimensions. A set of experiments evaluates the usefulness of metadata and Web metrics for evaluating the overall quality of databases. How to represent database metadata, quality dimensions, quality indicators and quality metrics using Semantic Web resources is also discussed. One application scenario relates to users who need to choose the best databases available in a catalog. Another application scenario is related to database integration systems in which it is necessary to determinate the overall quality of a database for reducing the number of databases to be integrated. Recuperacao : Informacao Web semântica Informática médica Data quality Database selection Molecular biology database
120	Traitement de l'information issue d'un réseau de surveillance de la paralysie cérébrale : qualité et analyse des données / Information processing in a network of cerebral palsy : data quality and analysis Sellier, Elodie 18 June 2012 (has links) Le réseau européen de paralysie cérébrale nommé Surveillance of Cerebral Palsy in Europe (SCPE) est né de la volonté de différents registres européens de s’associer afin d’harmoniser leurs données et de créer une base de données commune. Aujourd’hui il compte 24 registres dont 16 actifs. La base contient plus de 14000 cas d’enfants avec paralysie cérébrale (PC) nés entre 1976 et 2002. Elle permet de fournir des estimations précises sur les taux de prévalence de la PC, notamment dans les différents sous-groupes d’enfants (sous groupes d’âge gestationnel ou de poids de naissance, type neurologique de PC). La thèse s’est articulée autour de la base de données commune du réseau SCPE. Dans un premier temps, nous avons réalisé un état des lieux de la qualité des données de la base commune, puis développé de nouveaux outils pour l’amélioration de la qualité des données. Nous avons notamment mis en place un retour d’informations personnalisé aux registres registre suite à chaque soumission de données et écrit un guide d’aide à l’analyse des données. Nous avons également mené deux études de reproductibilité de la classification des enfants. La première étude incluait des médecins visualisant des séquences vidéos d’enfants avec ou sans PC. La deuxième étude incluait différents professionnels travaillant dans les registres qui avaient à leur disposition une description écrite de l’examen clinique des enfants. L’objectif de ces études originales était d’évaluer si face à un même enfant, les différents professionnels le classaient de la même manière pour le diagnostic de PC, le type neurologique et la sévérité de l’atteinte motrice. Les résultats ont montré une reproductibilité excellente pour les pédiatres ayant visualisé les vidéos et bonne pour les professionnels ayant classé les enfants à partir de la description écrite. Dans un second temps, nous avons réalisé des travaux sur l’analyse des données à partir de deux études : l’analyse de la tendance du taux de prévalence de la PC chez les enfants nés avec un poids >2499g entre 1980 et 1998 et l’analyse du taux de prévalence de la PC associée à l’épilepsie chez les enfants nés entre 1976 et 1998. Ces travaux ont porté principalement sur les méthodes d’analyse des tendances dans le temps du taux de prévalence, et sur la prise en compte des interactions tendance-registre. / Several European Cerebral Palsy (CP) registers formed a collaborativenetwork of Cerebral Palsy in order to harmonize their data and to establish acommon database. At the present time, the network gathers 24 CP registers,with 16 being active. The common database includes more than 14000 casesof children with CP, born between 1976 and 2002. Thanks to this largedatabase, the network can provide reliable estimates of prevalence rates ofchildren with CP, especially in the different CP subgroups (according togestational age or birthweight, neurological subtype).Our work was based on the SCPE common database. Firstly, we performeda survey on the data quality of the common database. Then we developednew tools to improve the quality of data. We provide now the registers witha feedback after the submission of their data and we wrote a data useguideline. We also conducted two studies to evaluate the reliability of theclassification of children with CP. The first study included pediatriciansseeing video-sequences of children with or without CP. The second studyincluded different professionals working in registers and who were given thewritten clinical description of the same children. The aim of these originalstudies was to evaluate whether the professionals classified a same child inthe same way concerning the diagnosis of CP, the neurological subtype andthe severity of gross and fine motor function. Results showed that interraterreliability was excellent for pediatricians seeing video-sequences andsubstantial for professionals reading the clinical description.Secondly, we worked on the analysis of data through two studies : theanalysis of the trend in prevalence rate of children with CP with abirthweight >2499g and born between 1980 and 1998 and the analysis of thetrend in prevalence rate of children with CP and epilepsy born between 1976and 1998. This work focused on the methods of trend analysis and on takinginto account the interaction between trend and register. Paralysie Cérébrale Epidémiologie Registres Qualité des données Cerebral Palsy Epidemiology Registers Data Quality

Search results