Global ETD Search

1	Quality data extraction methodology based on the labeling of coffee leaves with nutritional deficiencies Jungbluth, Adolfo, Yeng, Jon Li 04 1900 (has links) El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / Nutritional deficiencies detection for coffee leaves is a task which is often undertaken manually by experts on the field known as agronomists. The process they follow to carry this task is based on observation of the different characteristics of the coffee leaves while relying on their own experience. Visual fatigue and human error in this empiric approach cause leaves to be incorrectly labeled and thus affecting the quality of the data obtained. In this context, different crowdsourcing approaches can be applied to enhance the quality of the data extracted. These approaches separately propose the use of voting systems, association rule filters and evolutive learning. In this paper, we extend the use of association rule filters and evolutive approach by combining them in a methodology to enhance the quality of the data while guiding the users during the main stages of data extraction tasks. Moreover, our methodology proposes a reward component to engage users and keep them motivated during the crowdsourcing tasks. The extracted dataset by applying our proposed methodology in a case study on Peruvian coffee leaves resulted in 93.33% accuracy with 30 instances collected by 8 experts and evaluated by 2 agronomic engineers with background on coffee leaves. The accuracy of the dataset was higher than independently implementing the evolutive feedback strategy and an empiric approach which resulted in 86.67% and 70% accuracy respectively under the same conditions. / Revisión por pares Data extraction Data quality assessment Quality data extraction methodology
2	Reducing human effort in web data extraction Guo, Jinsong January 2017 (has links) The human effort in large-scale web data extraction significantly affects both the extraction flexibility and the economic cost. Our work aims to reduce the human effort required by web data extraction tasks in three specific scenarios. (I) Data demand is unclear, and the user has to guide the wrapper induction by annotations. To maximally save the human effort in the annotation process, wrappers should be robust, i.e., immune to the webpage's change, to avoid the wrapper re-generation which requires a re-annotation process. Existing approaches primarily aim at generating accurate wrappers but barely generate robust wrappers. We prove that the XPATH wrapper induction problem is NP-hard, and propose an approximate solution estimating a set of top-k robust wrappers in polynomial time. Our method also meets one additional requirement that the induction process should be noise resistant, i.e., tolerate slightly erroneous examples. (II) Data demand is clear, and the user's guide should be avoided, i.e., the wrapper generation should be fully-unsupervised. Existing unsupervised methods purely relying on the repeated patterns of HTML structures/visual information are far from being practical. Partially supervised methods, such as the state-of-the-art system DIADEM, can work well for tasks involving only a small number of domains. However, the human effort in the annotator preparation process becomes a heavier burden when the domain number increases. We propose a new approach, called RED (abbreviation for 'redundancy'), an automatic approach exploiting content redundancy between the result page and its corresponding detail pages. RED requires no annotation (thus requires no human effort) and its wrapper accuracy is significantly higher than that of previous unsupervised methods. (III) Data quality is unknown, and the user's related decisions are blind. Without knowing the error types and the error number of each type in the extracted data, the extraction effort could be wasted on useless websites, and even worse, the human effort could be wasted on unnecessary or wrongly-targeted data cleaning process. Despite the importance of error estimation, no methods have addressed it sufficiently. We focus on two types of common errors in web data, namely duplicates and violations of integrity constraints. We propose a series of error estimation approaches by adapting, extending, and synthesizing some recent innovations in diverse areas such as active learning, classifier calibration, F-measure estimation, and interactive training. Computer science ; Web data extraction
3	Automation of Generalized Measurement Extraction from Telemetric Network Systems Seegmiller, Ray D., Willden, Greg C., Araujo, Maria S., Newton, Todd A., Abbott, Ben A., Malatesta, William A. 10 1900 (has links) ITC/USA 2012 Conference Proceedings / The Forty-Eighth Annual International Telemetering Conference and Technical Exhibition / October 22-25, 2012 / Town and Country Resort & Convention Center, San Diego, California / In telemetric network systems, data extraction is often an after-thought. The data description frequently changes throughout the program so that last minute modifications of the data extraction approach are often required. This paper presents an alternative approach in which automation of measurement extraction is supported. The central key is a formal declarative language that can be used to configure instrumentation devices as well as measurement extraction devices. The Metadata Description Language (MDL) defined by the integrated Network Enhanced Telemetry (iNET) program, augmented with a generalized measurement extraction approach, addresses this issue. This paper describes the TmNS Data Extractor Tool, as well as lessons learned from commercial systems, the iNET program and TMATS. Data Format Data Extraction iNET Metadata MDL
4	Information Aggregation using the Cameleon# Web Wrapper Firat, Aykut, Madnick, Stuart, Yahaya, Nor Adnan, Kuan, Choo Wai, Bressan, Stéphane 29 July 2005 (has links) Cameleon# is a web data extraction and management tool that provides information aggregation with advanced capabilities that are useful for developing value-added applications and services for electronic business and electronic commerce. To illustrate its features, we use an airfare aggregation example that collects data from eight online sites, including Travelocity, Orbitz, and Expedia. This paper covers the integration of Cameleon# with commercial database management systems, such as MS SQL Server, and XML query languages, such as XQuery. Cameleon# web data extraction web data management
5	Generating Data-Extraction Ontologies By Example Zhou, Yuanqiu 22 November 2005 (has links) (PDF) Ontology-based data-extraction is a resilient web data-extraction approach. A major limitation of this approach is that ontology experts must manually develop and maintain data-extraction ontologies. The limitation prevents ordinary users who have little knowledge of conceptual models from making use of this resilient approach. In this thesis we have designed and implemented a general framework, OntoByE, to generate data-extraction ontologies semi-automatically through a small set of examples collected by users. With the assistance of a limited amount of prior knowledge, experimental evidence shows that OntoByE is capable of interacting with users to generate data-extraction ontologies for domains of interest to them. Ontology Web data data extraction Computer Sciences
6	Knowledge-based Data Extraction Workbench for Eclipse Rangaraj, Jithendra Kumar 18 December 2012 (has links) No description available. Computer Engineering Computer Science data extraction Knowledge-based data extraction KDE eclipse plugins ontology mapping plugins for eclipse
7	RAG-based data extraction : Mining information from second-life battery documents Edström, Jesper January 2024 (has links) With the constant evolution of Large Language Models (LLMs), methods for minimizing hallucinations are being developed to provide more truthful answers. By using Retrieval-Augmented Generation (RAG), external data can be provided to the model on which its answers should be based. This project aims at using RAG for a data extraction pipeline specified for second-life batteries. By pre-defining the prompts the user may only provide the documents that are wished to be analyzed, this is to ensure that the answers are in the correct format for further data processing. To process different document types, initial labeling takes place before more specific extraction suitable for the document can be applied. Best performance is achieved by grouping questions that allow the model to reason around what the relevant questions are so that no hallucinations occur. Regardless of whether there are two or three document types, the model performs equally well, and it is clear that a pipeline of this type is well suited to today's models. Further improvements can be achieved by utilizing models containing a larger context window and initially using Optical Character Recognition (OCR) to read text from the documents. RAG Retrieval-Augmented Generation LLM AI Data extraction second-life battery data extraction pipeline data extraction
8	Multi-Agent Architecture for Internet Information Extraction and Visualization Gollapally, Devender R. 08 1900 (has links) The World Wide Web is one of the largest sources of information; more and more applications are being developed daily to make use of this information. This thesis presents a multi-agent architecture that deals with some of the issues related to Internet data extraction. The primary issue addresses the reliable, efficient and quick extraction of data through the use of HTTP performance monitoring agents. A second issue focuses on how to make use of available data to take decisions and alert the user when there is change in data; this is done with the help of user agents that are equipped with a Defeasible reasoning interpreter. An additional issue is the visualization of extracted data; this is done with the aid of VRML visualization agents. The cited issues are discussed using stock portfolio management as an example application. Information retrieval. Internet. World Wide Web Internet data extraction
9	Query Rewriting for Extracting Data behind HTML Forms Chen, Xueqi 02 April 2004 (has links) (PDF) Much of the information on the Web is stored in specialized searchable databases and can only be accessed by interacting with a form or a series of forms. As a result, enabling automated agents and Web crawlers to interact with form-based interfaces designed primarily for humans is of great value. This thesis describes a system that can fill out Web forms automatically according to a given user query against a global schema for an application domain and, to the extent possible, extract just the relevant data behind these Web forms. Experimental results on two application domains show that the approach is reasonable for HTML forms. computer science data extraction HTML forms Computer Sciences
10	IMPLEMENTING EFORM-BASED BASELINE RISK DATA EXTRACTION FROM HIGH QUALITY PAPERS FOR THE BRISKET DATABASE AND TOOL Jacob, Anand 06 1900 (has links) This thesis was undertaken to investigate if an eForm-based extractor interface would improve the efficiency of the baseline risk extraction process for BRiskeT (Baseline Risk e-Tool). The BRiskeT database will contain the extracted baseline risk data from top prognostic research articles. BRiskeT utilizes McMaster University’s PLUS (Premium Literature Service) database to thoroughly vet articles prior to their inclusion in BRiskeT. The articles that have met inclusion criteria are then passed into the extractor interface that was developed for the purpose of this thesis, which has been called MacPrognosis. MacPrognosis displays these articles to a data extractor who fills out an electronic form which gives an overview of the baseline risk information in an article. The baseline risk information is subsequently saved to the BRiskeT database, which can then be queried according to the end user’s needs. One of the goals in switching from a paper-based extraction system to an eForm-based system was to save time in the extraction process. Another goal for MacPrognosis was to create an eForm that allowed baseline risk information to be extracted from as many disciplines as possible. To test whether MacPrognosis succeeded in saving extraction time and improving the proportion of articles from which baseline risk data could be extracted, it was subsequently utilized to extract data from a large test set of articles. The results of the extraction process were then compared with results from a previously conducted data extraction pilot utilizing a paper-based system which was created during the feasibility analysis for BRiskeT in 2012. The new eForm based extractor interface not only sped up the process of data extraction, but may also increase the proportion of articles from which data can be successfully extracted with minor future alterations when compared to a paper-based model of extraction. / Thesis / Master of Science (MSc) Baseline Risk Data Extraction Database Primary Literature McMaster PLUS Database

Search results