Spelling suggestions: "subject:"data quality"" "subject:"mata quality""
41 |
China's far below replacement level fertility: a reality or illusion arising from underreporting of births?Zhang, Guangyu, Zhang.Guangyu@anu.edu.au January 2004 (has links)
How fast and how far Chinas fertility declined in the 1990s has long been a matter
of considerable debate, despite very low fertility consistently being reported in a
number of statistical investigations over time. Most demographers interpreted this as
a result of serious underreporting of births in population statistics, due to the family
planning program, especially the program strengthening after 1991. Consequently,
they suggested that fertility fell only moderately below-replacement level, around 1.8
children per woman from the early 1990s. But some demographers argued that
surveys and census may have reflected a real decline of fertility even allowing for
some underreporting of births, given the consistency between data sources and over
time. They believed that fertility declined substantially in the 1990s, very likely in
the range between 1.5 and 1.6 by the year 2000.¶
The controversy over fertility is primarily related to the problem of underreporting of
births, in particular the different estimations of the extent of underreporting.
However, a correct interpretation of fertility data goes far beyond the pure numbers,
which calls for a thorough understanding of different data sources, the programmatic
and societal changes that occurred in the 1990s, and their effects on both fertility
changes and data collection efforts. This thesis aims to address the question whether
the reported far-below-replacement level fertility was a reality of substantial fertility
decline or just an illusion arising from underreporting of births. Given the nature of
the controversy, it devotes most efforts in assessing data quality, through examining
the patterns, causes and extent of underreporting of births in each data source;
reconstructing the decline of fertility in the 1990s; and searching corroborating
evidence for the decline.¶
After reviewing programmatic changes in the 1990s, this thesis suggests that the
program efforts were greatly strengthened, which would help to bring fertility down,
but the birth control policy and program target were not tightened as generally
believed. The program does affect individual reporting of births, but the
completeness of births in each data source is greatly dependent on who collects
fertility data and how the data are collected. The thesis then carefully examines the
data collection operations and underreporting of births in five sets of fertility data:
the hukou statistics, the family planning statistics, population census, annual survey
and retrospective survey. The analysis does not find convincing evidence that
fertility data deteriorated more seriously in the 1990s than the preceding decade.
Rather, it finds that surveys and censuses have a far more complete reporting of
births than the registration-based statistics, because they directly obtain information
from respondents, largely avoiding intermediate interference from local program
workers. In addition, the detailed examination suggests that less than 10 percent
births may have been unreported in surveys and censuses. The annual surveys, which
included many higher-order our-of-plan births being misreported as first-order births,
have more complete reporting of births than censuses, which were affected by the
increasing population mobility and field enumeration difficulties, and retrospective
surveys, which suffered from underreporting of higher-order births.¶
Using the unadjusted data of annual surveys from 1991 to 1999, 1995 sample census
and 2000 census, this research shows that fertility first dropped from 2.3 to 1.7 in the
first half of the 1990s, and further declined to a lower level around 1.5-1.6 in the
second half of the decade. The comparison with other independent sources
corroborates the reliability of this estimation. Putting Chinas fertility decline in
international perspective, comparison with the experiences of Thailand and Korea
also supports such a rapid decline. Subsequently, the thesis reveals an increasingly
narrow gap between state demands and popular fertility preferences, and great
contributions from delayed marriage and nearly universal contraception. It is
concluded that the fertility declined substantially over the course of the 1990s and
dropped to a very low level by the end of last century. It is very likely that the
combination of a government-enforced birth control program and rapid societal
changes quickly moved China into the group of very low-fertility countries earlier
than that might have been anticipated, as almost all the others are developed
countries.
|
42 |
Improving Data Quality Through Effective Use of Data SemanticsMadnick, Stuart E. 01 1900 (has links)
Data quality issues have taken on increasing importance in recent years. In our research, we have discovered that many “data quality” problems are actually “data misinterpretation” problems – that is, problems with data semantics. In this paper, we first illustrate some examples of these problems and then introduce a particular semantic problem that we call “corporate householding.” We stress the importance of “context” to get the appropriate answer for each task. Then we propose an approach to handle these tasks using extensions to the COntext INterchange (COIN) technology for knowledge storage and knowledge processing. / Singapore-MIT Alliance (SMA)
|
43 |
Data Quality By Design: A Goal-oriented ApproachJiang, Lei 13 August 2010 (has links)
A successful information system is the one that meets its design goals. Expressing these goals and subsequently translating them into a working solution is a major challenge for information systems engineering. This thesis adopts the concepts and techniques from goal-oriented (software)
requirements engineering research for conceptual database design, with a focus on data quality issues. Based on a real-world case study, a goal-oriented process is proposed for database requirements analysis and modeling. It spans from analysis of high-level stakeholder goals to detailed design of a conceptual databases schema. This process is then extended specifically for dealing with data quality issues: data of low quality may be detected and corrected by performing various quality assurance activities; to support these activities, the schema needs to be revised by accommodating additional data requirements. The extended process therefore focuses on analyzing and modeling quality assurance data requirements.
A quality assurance activity supported by a revised schema may involve manual work,
and/or rely on some automatic techniques, which often depend on the specification and enforcement of data quality rules. To address the constraint aspect in conceptual database design, data quality rules are classified according to a number of domain and application independent properties. This classification can be used to guide rule designers and to facilitate building of a
rule repository. A quantitative framework is then proposed for measuring and comparing DQ
rules according to one of these properties: effectiveness; this framework relies on derivation of formulas that represent the effectiveness of DQ rules under different probabilistic assumptions.
A semi-automatic approach is also presented to derive these effectiveness formulas.
|
44 |
GIS-Based Probabilistic Approach for Assessing and Enhancing Infrastructure Data QualitySaliminejad, Siamak 1983- 14 March 2013 (has links)
The task of preserving and improving infrastructure systems is becoming extremely challenging because these systems are decaying due to aging and over utilization, have limited funding, and are complex in nature (geographically spread, and affect and are affected by technological, environmental, social, security, political, and economic factors). The infrastructure management paradigm has emerged to assist in the challenging task of managing infrastructure systems in a systematic and cost-effective manner. Infrastructure management is a data-driven process. It relies on large databases that contain information on the system’s inventory, condition, maintenance and rehabilitation (M&R) history, utilization, and cost. This data feeds into analytical models that assess infrastructure current conditions, predict future conditions, and develop optimal M&R strategies. Thus, complete and accurate data is essential to a reliable infrastructure management system.
This study contributes to advancing the infrastructure management paradigm (with focus on pavement management) in two primary ways: (a) it provides in-depth understanding of the impact of errors in condition data on the outputs of infrastructure management systems, and (b) it provides efficient computational methods for improving infrastructure data quality. First, this research provides a quantitative assessment of the effects of error magnitude and type (both systematic and random) in pavement condition data on the accuracy of PMS outputs (i.e., forecasted needed budget and M&R activities in a multi-year planning period). Second, a new technique for detecting gross outliers and pseudo outliers in pavement condition data was developed and tested. Gross outliers are data values that are likely to be erroneous, whereas pseudo outliers are pavement sections performing exceptionally well or poor due to isolated local conditions. Third, a new technique for estimating construction and M&R history data from pavement condition data was developed and tested. This technique is especially beneficial when M&R data and condition data are stored in disparate heterogeneous databases that are difficult to integrate (i.e., legacy databases).
The main merit of the developed techniques is their ability to integrate methods and principles from Bayesian and spatial statistics, GIS, and operations research in an efficient manner. The application of these techniques to a real-world cases study (pavement network in Bryan district) demonstrated the potential benefits of these techniques to infrastructure managers and engineers.
|
45 |
Query Optimization for On-Demand Information Extraction Tasks over Text DatabasesFarid, Mina H. 12 March 2012 (has links)
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries.
In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents.
|
46 |
Data Quality By Design: A Goal-oriented ApproachJiang, Lei 13 August 2010 (has links)
A successful information system is the one that meets its design goals. Expressing these goals and subsequently translating them into a working solution is a major challenge for information systems engineering. This thesis adopts the concepts and techniques from goal-oriented (software)
requirements engineering research for conceptual database design, with a focus on data quality issues. Based on a real-world case study, a goal-oriented process is proposed for database requirements analysis and modeling. It spans from analysis of high-level stakeholder goals to detailed design of a conceptual databases schema. This process is then extended specifically for dealing with data quality issues: data of low quality may be detected and corrected by performing various quality assurance activities; to support these activities, the schema needs to be revised by accommodating additional data requirements. The extended process therefore focuses on analyzing and modeling quality assurance data requirements.
A quality assurance activity supported by a revised schema may involve manual work,
and/or rely on some automatic techniques, which often depend on the specification and enforcement of data quality rules. To address the constraint aspect in conceptual database design, data quality rules are classified according to a number of domain and application independent properties. This classification can be used to guide rule designers and to facilitate building of a
rule repository. A quantitative framework is then proposed for measuring and comparing DQ
rules according to one of these properties: effectiveness; this framework relies on derivation of formulas that represent the effectiveness of DQ rules under different probabilistic assumptions.
A semi-automatic approach is also presented to derive these effectiveness formulas.
|
47 |
An Automated Quality Assurance Procedure for Archived Transit Data from APC and AVL SystemsSaavedra, Marian Ruth January 2010 (has links)
Automatic Vehicle Location (AVL) and Automatic Passenger Counting (APC) systems can be powerful tools for transit agencies to archive large, detailed quantities of transit operations data. Managing data quality is an important first step for exploiting these rich datasets.
This thesis presents an automated quality assurance (QA) methodology that identifies unreliable archived AVL/APC data. The approach is based on expected travel and passenger activity patterns derived from the data. It is assumed that standard passenger balancing and schedule matching algorithms are applied to the raw AVL/APC data along with any existing automatic validation programs. The proposed QA methodology is intended to provide transit agencies with a supplementary tool to manage data quality that complements, but does not replace, conventional processing routines (that can be vendor-specific and less transparent).
The proposed QA methodology endeavours to flag invalid data as “suspect” and valid data as “non-suspect”. There are three stages: i) the first stage screens data that demonstrate a violation of physical constraints; ii) the second stage looks for data that represent outliers; and iii) the third stage evaluates whether the outlier data can be accounted for with valid or invalid pattern. Stop-level tests are mathematically defined for each stage; however data is filtered at the trip-level. Data that do not violate any physical constraints and do not represent any outliers are considered valid trip data. Outlier trips that may be accounted for with a valid outlier pattern are also considered valid. The remaining trip data is considered suspect.
The methodology is applied to a sample set of AVL/APC data from Grand River Transit in the Region of Waterloo, Ontario, Canada. The sample data consist of 4-month’s data from September to December of 2008; it is comprised of 612,000 stop-level records representing 25,012 trips. The results show 14% of the trip-level data is flagged as suspect for the sample dataset. The output is further dissected by: reviewing which tests most contribute to the set of suspect trips; confirming the pattern assumptions for the valid outlier cases; and comparing the sample data by various traits before and after the QA methodology is applied. The latter task is meant to recognize characteristics that may contribute to higher or lower quality data. Analysis shows that the largest portion of suspect trips, for this sample set, suggests the need for improved passenger balancing algorithms or greater accuracy of the APC equipment. The assumptions for valid outlier case patterns were confirmed to be reasonable. It was found that poor schedule data contributes to poorer quality in AVL-APC data. An examination of data distribution by vehicle showed that usage and the portion of suspect data varied substantially between vehicles. This information can be useful in the development of maintenance plans and sampling plans (when combined with information of data distribution by route).
A sensitivity analysis was conducted along with an impact analysis on downstream data uses. The model was found to be sensitive to three of the ten user-defined parameters. The impact of the QA procedure on network-level measures of performance (MOPs) was not found to be significant, however the impact was shown to be more substantial for route-specific MOPs.
|
48 |
Data Consistency Checks on Flight Test DataMueller, G. 10 1900 (has links)
ITC/USA 2014 Conference Proceedings / The Fiftieth Annual International Telemetering Conference and Technical Exhibition / October 20-23, 2014 / Town and Country Resort & Convention Center, San Diego, CA / This paper reflects the principal results of a study performed internally by Airbus's flight test centers. The purpose of this study was to share the body of knowledge concerning data consistency checks between all Airbus business units. An analysis of the test process is followed by the identification of the process stakeholders involved in ensuring data consistency. In the main part of the paper several different possibilities for improving data consistency are listed; it is left to the discretion of the reader to determine the appropriateness these methods.
|
49 |
An Automated Quality Assurance Procedure for Archived Transit Data from APC and AVL SystemsSaavedra, Marian Ruth January 2010 (has links)
Automatic Vehicle Location (AVL) and Automatic Passenger Counting (APC) systems can be powerful tools for transit agencies to archive large, detailed quantities of transit operations data. Managing data quality is an important first step for exploiting these rich datasets.
This thesis presents an automated quality assurance (QA) methodology that identifies unreliable archived AVL/APC data. The approach is based on expected travel and passenger activity patterns derived from the data. It is assumed that standard passenger balancing and schedule matching algorithms are applied to the raw AVL/APC data along with any existing automatic validation programs. The proposed QA methodology is intended to provide transit agencies with a supplementary tool to manage data quality that complements, but does not replace, conventional processing routines (that can be vendor-specific and less transparent).
The proposed QA methodology endeavours to flag invalid data as “suspect” and valid data as “non-suspect”. There are three stages: i) the first stage screens data that demonstrate a violation of physical constraints; ii) the second stage looks for data that represent outliers; and iii) the third stage evaluates whether the outlier data can be accounted for with valid or invalid pattern. Stop-level tests are mathematically defined for each stage; however data is filtered at the trip-level. Data that do not violate any physical constraints and do not represent any outliers are considered valid trip data. Outlier trips that may be accounted for with a valid outlier pattern are also considered valid. The remaining trip data is considered suspect.
The methodology is applied to a sample set of AVL/APC data from Grand River Transit in the Region of Waterloo, Ontario, Canada. The sample data consist of 4-month’s data from September to December of 2008; it is comprised of 612,000 stop-level records representing 25,012 trips. The results show 14% of the trip-level data is flagged as suspect for the sample dataset. The output is further dissected by: reviewing which tests most contribute to the set of suspect trips; confirming the pattern assumptions for the valid outlier cases; and comparing the sample data by various traits before and after the QA methodology is applied. The latter task is meant to recognize characteristics that may contribute to higher or lower quality data. Analysis shows that the largest portion of suspect trips, for this sample set, suggests the need for improved passenger balancing algorithms or greater accuracy of the APC equipment. The assumptions for valid outlier case patterns were confirmed to be reasonable. It was found that poor schedule data contributes to poorer quality in AVL-APC data. An examination of data distribution by vehicle showed that usage and the portion of suspect data varied substantially between vehicles. This information can be useful in the development of maintenance plans and sampling plans (when combined with information of data distribution by route).
A sensitivity analysis was conducted along with an impact analysis on downstream data uses. The model was found to be sensitive to three of the ten user-defined parameters. The impact of the QA procedure on network-level measures of performance (MOPs) was not found to be significant, however the impact was shown to be more substantial for route-specific MOPs.
|
50 |
The automatic classification of building maintenanceHague, Douglas James January 1997 (has links)
No description available.
|
Page generated in 0.0449 seconds