• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 147
  • 34
  • 33
  • 26
  • 12
  • 10
  • 9
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 327
  • 327
  • 75
  • 58
  • 48
  • 34
  • 33
  • 32
  • 30
  • 30
  • 30
  • 29
  • 28
  • 28
  • 27
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Privacy preservation in data mining through noise addition

Islam, Md Zahidul January 2008 (has links)
Research Doctorate - Doctor of Philosophy (PhD) / Due to advances in information processing technology and storage capacity, nowadays huge amount of data is being collected for various data analyses. Data mining techniques, such as classification, are often applied on these data to extract hidden information. During the whole process of data mining the data get exposed to several parties and such an exposure potentially leads to breaches of individual privacy. This thesis presents a comprehensive noise addition technique for protecting individual privacy in a data set used for classification, while maintaining the data quality. We add noise to all attributes, both numerical and categorical, and both to class and non-class, in such a way so that the original patterns are preserved in a perturbed data set. Our technique is also capable of incorporating previously proposed noise addition techniques that maintain the statistical parameters of the data set, including correlations among attributes. Thus the perturbed data set may be used not only for classification but also for statistical analysis. Our proposal has two main advantages. Firstly, as also suggested by our experimental results the perturbed data set maintains the same or very similar patterns as the original data set, as well as the correlations among attributes. While there are some noise addition techniques that maintain the statistical parameters of the data set, to the best of our knowledge this is the first comprehensive technique that preserves the patterns and thus removes the so called Data Mining Bias from the perturbed data set. Secondly, re-identification of the original records directly depends on the amount of noise added, and in general can be made arbitrarily hard, while still preserving the original patterns in the data set. The only exception to this is the case when an intruder knows enough about the record to learn the confidential class value by applying the classifier. However, this is always possible, even when the original record has not been used in the training data set. In other words, providing that enough noise is added, our technique makes the records from the training set as safe as any other previously unseen records of the same kind. In addition to the above contribution, this thesis also explores the suitability of pre-diction accuracy as a sole indicator of data quality, and proposes technique for clustering both categorical values and records containing such values.
42

China's far below replacement level fertility: a reality or illusion arising from underreporting of births?

Zhang, Guangyu, Zhang.Guangyu@anu.edu.au January 2004 (has links)
How fast and how far China’s fertility declined in the 1990s has long been a matter of considerable debate, despite very low fertility consistently being reported in a number of statistical investigations over time. Most demographers interpreted this as a result of serious underreporting of births in population statistics, due to the family planning program, especially the program strengthening after 1991. Consequently, they suggested that fertility fell only moderately below-replacement level, around 1.8 children per woman from the early 1990s. But some demographers argued that surveys and census may have reflected a real decline of fertility even allowing for some underreporting of births, given the consistency between data sources and over time. They believed that fertility declined substantially in the 1990s, very likely in the range between 1.5 and 1.6 by the year 2000.¶ The controversy over fertility is primarily related to the problem of underreporting of births, in particular the different estimations of the extent of underreporting. However, a correct interpretation of fertility data goes far beyond the pure numbers, which calls for a thorough understanding of different data sources, the programmatic and societal changes that occurred in the 1990s, and their effects on both fertility changes and data collection efforts. This thesis aims to address the question whether the reported far-below-replacement level fertility was a reality of substantial fertility decline or just an illusion arising from underreporting of births. Given the nature of the controversy, it devotes most efforts in assessing data quality, through examining the patterns, causes and extent of underreporting of births in each data source; reconstructing the decline of fertility in the 1990s; and searching corroborating evidence for the decline.¶ After reviewing programmatic changes in the 1990s, this thesis suggests that the program efforts were greatly strengthened, which would help to bring fertility down, but the birth control policy and program target were not tightened as generally believed. The program does affect individual reporting of births, but the completeness of births in each data source is greatly dependent on who collects fertility data and how the data are collected. The thesis then carefully examines the data collection operations and underreporting of births in five sets of fertility data: the hukou statistics, the family planning statistics, population census, annual survey and retrospective survey. The analysis does not find convincing evidence that fertility data deteriorated more seriously in the 1990s than the preceding decade. Rather, it finds that surveys and censuses have a far more complete reporting of births than the registration-based statistics, because they directly obtain information from respondents, largely avoiding intermediate interference from local program workers. In addition, the detailed examination suggests that less than 10 percent births may have been unreported in surveys and censuses. The annual surveys, which included many higher-order our-of-plan births being misreported as first-order births, have more complete reporting of births than censuses, which were affected by the increasing population mobility and field enumeration difficulties, and retrospective surveys, which suffered from underreporting of higher-order births.¶ Using the unadjusted data of annual surveys from 1991 to 1999, 1995 sample census and 2000 census, this research shows that fertility first dropped from 2.3 to 1.7 in the first half of the 1990s, and further declined to a lower level around 1.5-1.6 in the second half of the decade. The comparison with other independent sources corroborates the reliability of this estimation. Putting China’s fertility decline in international perspective, comparison with the experiences of Thailand and Korea also supports such a rapid decline. Subsequently, the thesis reveals an increasingly narrow gap between state demands and popular fertility preferences, and great contributions from delayed marriage and nearly universal contraception. It is concluded that the fertility declined substantially over the course of the 1990s and dropped to a very low level by the end of last century. It is very likely that the combination of a government-enforced birth control program and rapid societal changes quickly moved China into the group of very low-fertility countries earlier than that might have been anticipated, as almost all the others are developed countries.
43

Improving Data Quality Through Effective Use of Data Semantics

Madnick, Stuart E. 01 1900 (has links)
Data quality issues have taken on increasing importance in recent years. In our research, we have discovered that many “data quality” problems are actually “data misinterpretation” problems – that is, problems with data semantics. In this paper, we first illustrate some examples of these problems and then introduce a particular semantic problem that we call “corporate householding.” We stress the importance of “context” to get the appropriate answer for each task. Then we propose an approach to handle these tasks using extensions to the COntext INterchange (COIN) technology for knowledge storage and knowledge processing. / Singapore-MIT Alliance (SMA)
44

Data Quality By Design: A Goal-oriented Approach

Jiang, Lei 13 August 2010 (has links)
A successful information system is the one that meets its design goals. Expressing these goals and subsequently translating them into a working solution is a major challenge for information systems engineering. This thesis adopts the concepts and techniques from goal-oriented (software) requirements engineering research for conceptual database design, with a focus on data quality issues. Based on a real-world case study, a goal-oriented process is proposed for database requirements analysis and modeling. It spans from analysis of high-level stakeholder goals to detailed design of a conceptual databases schema. This process is then extended specifically for dealing with data quality issues: data of low quality may be detected and corrected by performing various quality assurance activities; to support these activities, the schema needs to be revised by accommodating additional data requirements. The extended process therefore focuses on analyzing and modeling quality assurance data requirements. A quality assurance activity supported by a revised schema may involve manual work, and/or rely on some automatic techniques, which often depend on the specification and enforcement of data quality rules. To address the constraint aspect in conceptual database design, data quality rules are classified according to a number of domain and application independent properties. This classification can be used to guide rule designers and to facilitate building of a rule repository. A quantitative framework is then proposed for measuring and comparing DQ rules according to one of these properties: effectiveness; this framework relies on derivation of formulas that represent the effectiveness of DQ rules under different probabilistic assumptions. A semi-automatic approach is also presented to derive these effectiveness formulas.
45

GIS-Based Probabilistic Approach for Assessing and Enhancing Infrastructure Data Quality

Saliminejad, Siamak 1983- 14 March 2013 (has links)
The task of preserving and improving infrastructure systems is becoming extremely challenging because these systems are decaying due to aging and over utilization, have limited funding, and are complex in nature (geographically spread, and affect and are affected by technological, environmental, social, security, political, and economic factors). The infrastructure management paradigm has emerged to assist in the challenging task of managing infrastructure systems in a systematic and cost-effective manner. Infrastructure management is a data-driven process. It relies on large databases that contain information on the system’s inventory, condition, maintenance and rehabilitation (M&R) history, utilization, and cost. This data feeds into analytical models that assess infrastructure current conditions, predict future conditions, and develop optimal M&R strategies. Thus, complete and accurate data is essential to a reliable infrastructure management system. This study contributes to advancing the infrastructure management paradigm (with focus on pavement management) in two primary ways: (a) it provides in-depth understanding of the impact of errors in condition data on the outputs of infrastructure management systems, and (b) it provides efficient computational methods for improving infrastructure data quality. First, this research provides a quantitative assessment of the effects of error magnitude and type (both systematic and random) in pavement condition data on the accuracy of PMS outputs (i.e., forecasted needed budget and M&R activities in a multi-year planning period). Second, a new technique for detecting gross outliers and pseudo outliers in pavement condition data was developed and tested. Gross outliers are data values that are likely to be erroneous, whereas pseudo outliers are pavement sections performing exceptionally well or poor due to isolated local conditions. Third, a new technique for estimating construction and M&R history data from pavement condition data was developed and tested. This technique is especially beneficial when M&R data and condition data are stored in disparate heterogeneous databases that are difficult to integrate (i.e., legacy databases). The main merit of the developed techniques is their ability to integrate methods and principles from Bayesian and spatial statistics, GIS, and operations research in an efficient manner. The application of these techniques to a real-world cases study (pavement network in Bryan district) demonstrated the potential benefits of these techniques to infrastructure managers and engineers.
46

Query Optimization for On-Demand Information Extraction Tasks over Text Databases

Farid, Mina H. 12 March 2012 (has links)
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries. In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents.
47

Data Quality By Design: A Goal-oriented Approach

Jiang, Lei 13 August 2010 (has links)
A successful information system is the one that meets its design goals. Expressing these goals and subsequently translating them into a working solution is a major challenge for information systems engineering. This thesis adopts the concepts and techniques from goal-oriented (software) requirements engineering research for conceptual database design, with a focus on data quality issues. Based on a real-world case study, a goal-oriented process is proposed for database requirements analysis and modeling. It spans from analysis of high-level stakeholder goals to detailed design of a conceptual databases schema. This process is then extended specifically for dealing with data quality issues: data of low quality may be detected and corrected by performing various quality assurance activities; to support these activities, the schema needs to be revised by accommodating additional data requirements. The extended process therefore focuses on analyzing and modeling quality assurance data requirements. A quality assurance activity supported by a revised schema may involve manual work, and/or rely on some automatic techniques, which often depend on the specification and enforcement of data quality rules. To address the constraint aspect in conceptual database design, data quality rules are classified according to a number of domain and application independent properties. This classification can be used to guide rule designers and to facilitate building of a rule repository. A quantitative framework is then proposed for measuring and comparing DQ rules according to one of these properties: effectiveness; this framework relies on derivation of formulas that represent the effectiveness of DQ rules under different probabilistic assumptions. A semi-automatic approach is also presented to derive these effectiveness formulas.
48

An Automated Quality Assurance Procedure for Archived Transit Data from APC and AVL Systems

Saavedra, Marian Ruth January 2010 (has links)
Automatic Vehicle Location (AVL) and Automatic Passenger Counting (APC) systems can be powerful tools for transit agencies to archive large, detailed quantities of transit operations data. Managing data quality is an important first step for exploiting these rich datasets. This thesis presents an automated quality assurance (QA) methodology that identifies unreliable archived AVL/APC data. The approach is based on expected travel and passenger activity patterns derived from the data. It is assumed that standard passenger balancing and schedule matching algorithms are applied to the raw AVL/APC data along with any existing automatic validation programs. The proposed QA methodology is intended to provide transit agencies with a supplementary tool to manage data quality that complements, but does not replace, conventional processing routines (that can be vendor-specific and less transparent). The proposed QA methodology endeavours to flag invalid data as “suspect” and valid data as “non-suspect”. There are three stages: i) the first stage screens data that demonstrate a violation of physical constraints; ii) the second stage looks for data that represent outliers; and iii) the third stage evaluates whether the outlier data can be accounted for with valid or invalid pattern. Stop-level tests are mathematically defined for each stage; however data is filtered at the trip-level. Data that do not violate any physical constraints and do not represent any outliers are considered valid trip data. Outlier trips that may be accounted for with a valid outlier pattern are also considered valid. The remaining trip data is considered suspect. The methodology is applied to a sample set of AVL/APC data from Grand River Transit in the Region of Waterloo, Ontario, Canada. The sample data consist of 4-month’s data from September to December of 2008; it is comprised of 612,000 stop-level records representing 25,012 trips. The results show 14% of the trip-level data is flagged as suspect for the sample dataset. The output is further dissected by: reviewing which tests most contribute to the set of suspect trips; confirming the pattern assumptions for the valid outlier cases; and comparing the sample data by various traits before and after the QA methodology is applied. The latter task is meant to recognize characteristics that may contribute to higher or lower quality data. Analysis shows that the largest portion of suspect trips, for this sample set, suggests the need for improved passenger balancing algorithms or greater accuracy of the APC equipment. The assumptions for valid outlier case patterns were confirmed to be reasonable. It was found that poor schedule data contributes to poorer quality in AVL-APC data. An examination of data distribution by vehicle showed that usage and the portion of suspect data varied substantially between vehicles. This information can be useful in the development of maintenance plans and sampling plans (when combined with information of data distribution by route). A sensitivity analysis was conducted along with an impact analysis on downstream data uses. The model was found to be sensitive to three of the ten user-defined parameters. The impact of the QA procedure on network-level measures of performance (MOPs) was not found to be significant, however the impact was shown to be more substantial for route-specific MOPs.
49

Data Consistency Checks on Flight Test Data

Mueller, G. 10 1900 (has links)
ITC/USA 2014 Conference Proceedings / The Fiftieth Annual International Telemetering Conference and Technical Exhibition / October 20-23, 2014 / Town and Country Resort & Convention Center, San Diego, CA / This paper reflects the principal results of a study performed internally by Airbus's flight test centers. The purpose of this study was to share the body of knowledge concerning data consistency checks between all Airbus business units. An analysis of the test process is followed by the identification of the process stakeholders involved in ensuring data consistency. In the main part of the paper several different possibilities for improving data consistency are listed; it is left to the discretion of the reader to determine the appropriateness these methods.
50

An Automated Quality Assurance Procedure for Archived Transit Data from APC and AVL Systems

Saavedra, Marian Ruth January 2010 (has links)
Automatic Vehicle Location (AVL) and Automatic Passenger Counting (APC) systems can be powerful tools for transit agencies to archive large, detailed quantities of transit operations data. Managing data quality is an important first step for exploiting these rich datasets. This thesis presents an automated quality assurance (QA) methodology that identifies unreliable archived AVL/APC data. The approach is based on expected travel and passenger activity patterns derived from the data. It is assumed that standard passenger balancing and schedule matching algorithms are applied to the raw AVL/APC data along with any existing automatic validation programs. The proposed QA methodology is intended to provide transit agencies with a supplementary tool to manage data quality that complements, but does not replace, conventional processing routines (that can be vendor-specific and less transparent). The proposed QA methodology endeavours to flag invalid data as “suspect” and valid data as “non-suspect”. There are three stages: i) the first stage screens data that demonstrate a violation of physical constraints; ii) the second stage looks for data that represent outliers; and iii) the third stage evaluates whether the outlier data can be accounted for with valid or invalid pattern. Stop-level tests are mathematically defined for each stage; however data is filtered at the trip-level. Data that do not violate any physical constraints and do not represent any outliers are considered valid trip data. Outlier trips that may be accounted for with a valid outlier pattern are also considered valid. The remaining trip data is considered suspect. The methodology is applied to a sample set of AVL/APC data from Grand River Transit in the Region of Waterloo, Ontario, Canada. The sample data consist of 4-month’s data from September to December of 2008; it is comprised of 612,000 stop-level records representing 25,012 trips. The results show 14% of the trip-level data is flagged as suspect for the sample dataset. The output is further dissected by: reviewing which tests most contribute to the set of suspect trips; confirming the pattern assumptions for the valid outlier cases; and comparing the sample data by various traits before and after the QA methodology is applied. The latter task is meant to recognize characteristics that may contribute to higher or lower quality data. Analysis shows that the largest portion of suspect trips, for this sample set, suggests the need for improved passenger balancing algorithms or greater accuracy of the APC equipment. The assumptions for valid outlier case patterns were confirmed to be reasonable. It was found that poor schedule data contributes to poorer quality in AVL-APC data. An examination of data distribution by vehicle showed that usage and the portion of suspect data varied substantially between vehicles. This information can be useful in the development of maintenance plans and sampling plans (when combined with information of data distribution by route). A sensitivity analysis was conducted along with an impact analysis on downstream data uses. The model was found to be sensitive to three of the ten user-defined parameters. The impact of the QA procedure on network-level measures of performance (MOPs) was not found to be significant, however the impact was shown to be more substantial for route-specific MOPs.

Page generated in 0.0821 seconds