Spelling suggestions: "subject:"forminformation retrieval anda extraction"" "subject:"forminformation retrieval ando extraction""
1 |
Automated retrieval and extraction of training course information from unstructured web pagesXhemali, Daniela January 2010 (has links)
Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance.
|
2 |
Biomedical Literature Mining and Knowledge Discovery of Phenotyping DefinitionsBinkheder, Samar Hussein 07 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Phenotyping definitions are essential in cohort identification when conducting
clinical research, but they become an obstacle when they are not readily available.
Developing new definitions manually requires expert involvement that is labor-intensive,
time-consuming, and unscalable. Moreover, automated approaches rely mostly on
electronic health records’ data that suffer from bias, confounding, and incompleteness.
Limited efforts established in utilizing text-mining and data-driven approaches to automate
extraction and literature-based knowledge discovery of phenotyping definitions and to
support their scalability. In this dissertation, we proposed a text-mining pipeline combining
rule-based and machine-learning methods to automate retrieval, classification, and
extraction of phenotyping definitions’ information from literature. To achieve this, we first
developed an annotation guideline with ten dimensions to annotate sentences with evidence
of phenotyping definitions' modalities, such as phenotypes and laboratories. Two
annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text
observational studies’ methods sections (n=86). Percent and Kappa statistics showed high
inter-annotator agreement on sentence-level annotations. Second, we constructed two
validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level.
We applied the abstract-level classifier on a large-scale biomedical literature of over
20 million abstracts published between 1975 and 2018 to classify positive abstracts
(n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from
their methods sections and used the full-text sentence-level classifier to extract positive
sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the
positively classified sentences. Lexica-based methods were used to recognize medical
concepts in these sentences (n=19,423). Co-occurrence and association methods were used
to identify and rank phenotype candidates that are associated with a phenotype of interest.
We derived 12,616,465 associations from our large-scale corpus. Our literature-based
associations and large-scale corpus contribute in building new data-driven phenotyping
definitions and expanding existing definitions with minimal expert involvement.
|
3 |
Processing Turkish Radiology ReportsHadimli, Kerem 01 May 2011 (has links) (PDF)
Radiology departments utilize various visualization techniques of patients&rsquo / bodies, and narrative free text reports describing the findings in these visualizations are written by medical doctors. The information within these narrative reports is required to be extracted for medical information systems. Turkish is an highly agglutinative language and this poses problems in information retrieval and extraction from Turkish free texts.
In this thesis one rule-based and one data-driven alternate methods for information retrieval and structured information extraction from Turkish radiology reports are presented. Contrary to previous studies in medical NLP systems, both of these methods do not utilize any medical lexicon or ontology.
Information extraction is performed on the level of extracting medically related phrases from the sentence. The aim is to measure baseline performance Turkish language can provide for medical information extraction and retrieval, in isolation of other factors.
|
Page generated in 0.1909 seconds