51 |
Query Optimization for On-Demand Information Extraction Tasks over Text DatabasesFarid, Mina H. 12 March 2012 (has links)
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries.
In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents.
|
52 |
Privacy preservation in data mining through noise additionIslam, Md Zahidul January 2008 (has links)
Research Doctorate - Doctor of Philosophy (PhD) / Due to advances in information processing technology and storage capacity, nowadays huge amount of data is being collected for various data analyses. Data mining techniques, such as classification, are often applied on these data to extract hidden information. During the whole process of data mining the data get exposed to several parties and such an exposure potentially leads to breaches of individual privacy. This thesis presents a comprehensive noise addition technique for protecting individual privacy in a data set used for classification, while maintaining the data quality. We add noise to all attributes, both numerical and categorical, and both to class and non-class, in such a way so that the original patterns are preserved in a perturbed data set. Our technique is also capable of incorporating previously proposed noise addition techniques that maintain the statistical parameters of the data set, including correlations among attributes. Thus the perturbed data set may be used not only for classification but also for statistical analysis. Our proposal has two main advantages. Firstly, as also suggested by our experimental results the perturbed data set maintains the same or very similar patterns as the original data set, as well as the correlations among attributes. While there are some noise addition techniques that maintain the statistical parameters of the data set, to the best of our knowledge this is the first comprehensive technique that preserves the patterns and thus removes the so called Data Mining Bias from the perturbed data set. Secondly, re-identification of the original records directly depends on the amount of noise added, and in general can be made arbitrarily hard, while still preserving the original patterns in the data set. The only exception to this is the case when an intruder knows enough about the record to learn the confidential class value by applying the classifier. However, this is always possible, even when the original record has not been used in the training data set. In other words, providing that enough noise is added, our technique makes the records from the training set as safe as any other previously unseen records of the same kind. In addition to the above contribution, this thesis also explores the suitability of pre-diction accuracy as a sole indicator of data quality, and proposes technique for clustering both categorical values and records containing such values.
|
53 |
Impact of data quality on photovoltaic (PV) performance assessmentKoubli, Eleni January 2017 (has links)
In this work, data quality control and mitigation tools have been developed for improving the accuracy of photovoltaic (PV) system performance assessment. These tools allow to demonstrate the impact of ignoring erroneous or lost data on performance evaluation and fault detection. The work mainly focuses on residential PV systems where monitoring is limited to recording total generation and the lack of meteorological data makes quality control in that area truly challenging. Main quality issues addressed in this work are with regards to wrong system description and missing electrical and/or meteorological data in monitoring. An automatic detection of wrong input information such as system nominal capacity and azimuth is developed, based on statistical distributions of annual figures of PV system performance ratio (PR) and final yield. This approach is specifically useful in carrying out PV fleet analyses where only monthly or annual energy outputs are available. The evaluation is carried out based on synthetic weather data which is obtained by interpolating from a network of about 80 meteorological monitoring stations operated by the UK Meteorological Office. The procedures are used on a large PV domestic dataset, obtained by a social housing organisation, where a significant number of cases with wrong input information are found. Data interruption is identified as another challenge in PV monitoring data, although the effect of this is particularly under-researched in the area of PV. Disregarding missing energy generation data leads to falsely estimated performance figures, which consequently may lead to false alarms on performance and/or the lack of necessary requirements for the financial revenue of a domestic system through the feed-in-tariff scheme. In this work, the effect of missing data is mitigated by applying novel data inference methods based on empirical and artificial neural network approaches, training algorithms and remotely inferred weather data. Various cases of data loss are considered and case studies from the CREST monitoring system and the domestic dataset are used as test cases. When using back-filled energy output, monthly PR estimation yields more accurate results than when including prolonged data gaps in the analysis. Finally, to further discriminate more obscure data from system faults when higher temporal resolution data is available, a remote modelling and failure detection framework is ii developed based on a physical electrical model, remote input weather data and system description extracted from PV module and inverter manufacturer datasheets. The failure detection is based on the analysis of daily profiles and long-term PR comparison of neighbouring PV systems. By employing this tool on various case studies it is seen that undetected wrong data may severely obscure fault detection, affecting PV system s lifetime. Based on the results and conclusions of this work on the employed residential dataset, essential data requirements for domestic PV monitoring are introduced as a potential contribution to existing lessons learnt in PV monitoring.
|
54 |
Quality data extraction methodology based on the labeling of coffee leaves with nutritional deficienciesJungbluth, Adolfo, Yeng, Jon Li 04 1900 (has links)
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / Nutritional deficiencies detection for coffee leaves is a task which is often undertaken manually by experts on the field known as agronomists. The process they follow to carry this task is based on observation of the different characteristics of the coffee leaves while relying on their own experience. Visual fatigue and human error in this empiric approach cause leaves to be incorrectly labeled and thus affecting the quality of the data obtained. In this context, different crowdsourcing approaches can be applied to enhance the quality of the data extracted. These approaches separately propose the use of voting systems, association rule filters and evolutive learning. In this paper, we extend the use of association rule filters and evolutive approach by combining them in a methodology to enhance the quality of the data while guiding the users during the main stages of data extraction tasks. Moreover, our methodology proposes a reward component to engage users and keep them motivated during the crowdsourcing tasks. The extracted dataset by applying our proposed methodology in a case study on Peruvian coffee leaves resulted in 93.33% accuracy with 30 instances collected by 8 experts and evaluated by 2 agronomic engineers with background on coffee leaves. The accuracy of the dataset was higher than independently implementing the evolutive feedback strategy and an empiric approach which resulted in 86.67% and 70% accuracy respectively under the same conditions. / Revisión por pares
|
55 |
The complexities and possibilities of health data utilization in the West Coast DistrictZimri, Irma Selina January 2018 (has links)
Magister Commercii - MCom (IM) (Information Management) / In an ideal public health arena, scientific evidence should be incorporated in the health information
practices of making management decisions, developing policies, and implementing programs.
However, much effort has been spent in developing health information practices focusing mainly
on data collection, data quality and processing, with relatively little development on the utilization
side of the information spectrum. Although the South Africa Health National Indicator Dataset of
2013 routinely collects and reports on more than two hundred elements, the degree to which this
information is being used is not empirically known. The overall aim of the study was to explore
the dynamics of routine primary healthcare information utilization in the West Coast district while
identifying specific interventions that could ultimately lead to the improved use of data to better
inform decision making. The ultimate goal being to enable managers to better utilize their routine
health information for effective decision making.
|
56 |
Transformace HTML dat o produktech do Linked Data formátu / Converting HTML product data to Linked DataKadleček, Rastislav January 2018 (has links)
In order to make a step towards the idea of the Semantic Web it is necessary to research ways how to retrieve semantic information from documents published on the current Web 2.0. As an answer to growing amount of data published in a form of relational tables, the Odalic system, based on the extended TableMiner+ Semantic Table Interpretation algorithm was introduced to provide a convenient way to semantize tabular data using knowledge base disambiguation process. The goal of this thesis is to propose an extended algorithm for the Odalic system, which would allow the system to gather semantic information for tabular data describing products from e-shops, which have very limited presence in the knowl- edge bases. This should be achieved by using a machine learning technique called classification. This thesis consists of several parts - obtaining and preprocessing of the product data from e-shops, evaluation of several classification algorithms in order to select the best-performing one, description of design and implementation of the extended Odalic algorithm, description of its integration into the Odalic system, evaluation of the improved algorithm using the obtained product data and semantization of the product data using the new Odalic algorithm. In the end, the results are concluded and possible...
|
57 |
MOOCLink: Linking and Maintaining Quality of Data Provided by Various MOOC ProvidersJanuary 2016 (has links)
abstract: The concept of Linked Data is gaining widespread popularity and importance. The method of publishing and linking structured data on the web is called Linked Data. Emergence of Linked Data has made it possible to make sense of huge data, which is scattered all over the web, and link multiple heterogeneous sources. This leads to the challenge of maintaining the quality of Linked Data, i.e., ensuring outdated data is removed and new data is included. The focus of this thesis is devising strategies to effectively integrate data from multiple sources, publish it as Linked Data, and maintain the quality of Linked Data. The domain used in the study is online education. With so many online courses offered by Massive Open Online Courses (MOOC), it is becoming increasingly difficult for an end user to gauge which course best fits his/her needs.
Users are spoilt for choices. It would be very helpful for them to make a choice if there is a single place where they can visually compare the offerings of various MOOC providers for the course they are interested in. Previous work has been done in this area through the MOOCLink project that involved integrating data from Coursera, EdX, and Udacity and generation of linked data, i.e. Resource Description Framework (RDF) triples.
The research objective of this thesis is to determine a methodology by which the quality
of data available through the MOOCLink application is maintained, as there are lots of new courses being constantly added and old courses being removed by data providers. This thesis presents the integration of data from various MOOC providers and algorithms for incrementally updating linked data to maintain their quality and compare it against a naïve approach in order to constantly keep the users engaged with up-to-date data. A master threshold value was determined through experiments and analysis that quantifies one algorithm being better than the other in terms of time efficiency. An evaluation of the tool shows the effectiveness of the algorithms presented in this thesis. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
58 |
Monitoring and Improving User Compliance and Data Quality For Long and Repetitive Self-Reporting MHealth SurveysJanuary 2017 (has links)
abstract: For the past decade, mobile health applications are seeing greater acceptance due to their potential to remotely monitor and increase patient engagement, particularly for chronic disease. Sickle Cell Disease is an inherited chronic disorder of red blood cells requiring careful pain management. A significant number of mHealth applications have been developed in the market to help clinicians collect and monitor information of SCD patients. Surveys are the most common way to self-report patient conditions. These are non-engaging and suffer from poor compliance. The quality of data gathered from survey instruments while using technology can be questioned as patients may be motivated to complete a task but not motivated to do it well. A compromise in quality and quantity of the collected patient data hinders the clinicians' effort to be able to monitor patient's health on a regular basis and derive effective treatment measures. This research study has two goals. The first is to monitor user compliance and data quality in mHealth apps with long and repetitive surveys delivered. The second is to identify possible motivational interventions to help improve compliance and data quality. As a form of intervention, will introduce intrinsic and extrinsic motivational factors within the application and test it on a small target population. I will validate the impact of these motivational factors by performing a comparative analysis on the test results to determine improvements in user performance. This study is relevant, as it will help analyze user behavior in long and repetitive self-reporting tasks and derive measures to improve user performance. The results will assist software engineers working with doctors in designing and developing improved self-reporting mHealth applications for collecting better quality data and enhance user compliance. / Dissertation/Thesis / Masters Thesis Computer Science 2017
|
59 |
Varianty řízení datové kvality v rámci regulace Solvency II / Variants of data quality management within the regulation Solvency IIPastrňáková, Alena January 2014 (has links)
The diploma thesis deals with data quality in connection with legal requirements of the Solvency II regulation, which must be met by insurance companies in order to keep their licences. The aim of this thesis is to consider opportunities and impacts of implementing data quality for Solvency II. All data quality requirements of the regulation were specified and supplemented with possibilities how to meet them. Related data quality areas were also described. Sample variants of manual, partially automated and fully automated solutions with regard to expenditure of costs and time were compared based on knowledge and acquired information. The benefit of this thesis is evaluation of possible positive and negative impacts of implementing data quality for Solvency II taking into account the possibility of introducing data quality across the entire company. General solution variants can be used for decision-making on implementing data quality in most companies out of insurance industry.
|
60 |
Data Quality: Does Time of Semester Matter?Hillhouse, Linden, Blackhart, Ginette 12 April 2019 (has links)
When conducting scientific research, obtaining high-quality data is important. When collecting data from a college student participant pool, however, factors such as the time of the semester in which data are collected could cause validity issues, especially if the survey is completed in an online, non-laboratory setting. Near the end of the semester, students may experience more time pressures and constraints than at other times in the semester. These additional pressures may encourage participants to multi-task while completing the study, or to rush through the survey in order to receive credits as quickly as possible. The hypothesis of this study was that responses collected at the end of the semester would exhibit lower data quality than responses collected at the beginning of the semester. Data were collected online during the last two weeks of the fall 2018 semester (n = 312) and the first two weeks of the spring 2019 semester (n = 55). Participants were asked to write about an embarrassing situation and then completed a number of questionnaires assessing their thoughts and feelings about the event, personality traits, and participant engagement. Data quality was assessed using several different previously validated methods, including time spent on survey; the number of missed items; the number of incorrect embedded attention-check items (out of 12); the length of responses on two open-ended questions; self-reported diligence, interest, effort, attention, and whether their data should be used; and Cronbach’s alphas on the scales. Results showed that between the two groups, there were significant differences on length of open-ended responses, self-reported diligence, self-reported interest, effort, attention, neuroticism, and conscientiousness. Participants completing the study in the first two weeks of the spring 2019 semester had significantly longer open-ended responses and significantly higher levels of self-reported diligence, self-reported interest, effort, attention, neuroticism, and conscientiousness. Although there was not a significant difference in number of incorrect attention-check items between the two groups, it should be noted that only 46% of the total participants did not miss any check items. These results lend support to the hypothesis that data collected at the end of the semester may be of lower quality than data collected at the beginning of the semester. However, because the groups significantly differed on neuroticism and conscientiousness, we cannot determine whether the time of semester effect is a product of internal participant characteristics or external pressures. Nevertheless, researchers should take into account this end-of-semester data quality difference when deciding the time-frame of their data collection.
|
Page generated in 0.0599 seconds