Global ETD Search

111	Toponym resolution in text Leidner, Jochen Lothar January 2007 (has links) Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g., using latitude and longitude). However, present-day GIS systems provide no automatic geo-coding functionality for unstructured text. In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)). However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g. for questions like How far is London from Edinburgh?, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete. Objective. I investigate how referentially ambiguous spatial named entities can be grounded, or resolved, with respect to an extensional coordinate model robustly on open-domain news text. I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal, reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics (e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then investigate how to combine these sources of evidence to obtain a superior method. I also investigate the noise effect introduced by the named entity tagging step that toponym resolution relies on in a sequential system pipeline architecture. Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented in the gazetteer defined and, accordingly, a collection of present-day news text. I limit the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges), compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance, is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately toponym resolution. However, this is beyond the scope of this thesis. Method. While a small number of previous attempts have been made to solve the toponym resolution problem, these were either not evaluated, or evaluation was done by manual inspection of system output instead of curating a reusable reference corpus. Since the relevant literature is scattered across several disciplines (GIS, digital libraries, information retrieval, natural language processing) and descriptions of algorithms are mostly given in informal prose, I attempt to systematically describe them and aim at a reconstruction in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic comparison leads to an inventory of heuristics and other sources of evidence. In order to carry out a comparative evaluation procedure, an evaluation resource is required. Unfortunately, to date no gold standard has been curated in the research community. To this end, a reference gazetteer and an associated novel reference corpus with human-labeled referent annotation are created. These are subsequently used to benchmark a selection of the reconstructed algorithms and a novel re-combination of the heuristics catalogued in the inventory. I then compare the performance of the same TR algorithms under three different conditions, namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation using an existing Maximum Entropy sequence tagging model, and (iii) a na¨ıve toponym lookup procedure in a gazetteer. Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or component evaluation. To this end, we define a task-specific matching criterion to be used with traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient with respect to numerical gazetteer imprecision in situations where one toponym instance is marked up with different gazetteer entries in the gold standard and the test set, respectively, but where these refer to the same candidate referent, caused by multiple near-duplicate entries in the reference gazetteer. Main Contributions. The major contributions of this thesis are as follows: • A new reference corpus in which instances of location named entities have been manually annotated with spatial grounding information for populated places, and an associated reference gazetteer, from which the assigned candidate referents are chosen. This reference gazetteer provides numerical latitude/longitude coordinates (such as 51320 North, 0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect to a world wide-coverage, geographic taxonomy constructed by combining several large, but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora, a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard. This corpus will be made available as a reference evaluation resource; • a new method and implemented system to resolve toponyms that is capable of robustly processing unseen text (open-domain online newswire text) and grounding toponym instances in an extensional model using longitude and latitude coordinates and hierarchical path descriptions, using internal (textual) and external (gazetteer) evidence; • an empirical analysis of the relative utility of various heuristic biases and other sources of evidence with respect to the toponym resolution task when analysing free news genre text; • a comparison between a replicated method as described in the literature, which functions as a baseline, and a novel algorithm based on minimality heuristics; and • several exemplary prototypical applications to show how the resulting toponym resolution methods can be used to create visual surrogates for news stories, a geographic exploration tool for news browsing, geographically-aware document retrieval and to answer spatial questions (How far...?) in an open-domain question answering system. These applications only have demonstrative character, as a thorough quantitative, task-based (extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work. 621.382
112	Automated retrieval and extraction of training course information from unstructured web pages Xhemali, Daniela January 2010 (has links) Web Information Extraction (WIE) is the discipline dealing with the discovery, processing and extraction of specific pieces of information from semi-structured or unstructured web pages. The World Wide Web comprises billions of web pages and there is much need for systems that will locate, extract and integrate the acquired knowledge into organisations practices. There are some commercial, automated web extraction software packages, however their success comes from heavily involving their users in the process of finding the relevant web pages, preparing the system to recognise items of interest on these pages and manually dealing with the evaluation and storage of the extracted results. This research has explored WIE, specifically with regard to the automation of the extraction and validation of online training information. The work also includes research and development in the area of automated Web Information Retrieval (WIR), more specifically in Web Searching (or Crawling) and Web Classification. Different technologies were considered, however after much consideration, Naïve Bayes Networks were chosen as the most suitable for the development of the classification system. The extraction part of the system used Genetic Programming (GP) for the generation of web extraction solutions. Specifically, GP was used to evolve Regular Expressions, which were then used to extract specific training course information from the web such as: course names, prices, dates and locations. The experimental results indicate that all three aspects of this research perform very well, with the Web Crawler outperforming existing crawling systems, the Web Classifier performing with an accuracy of over 95% and a precision of over 98%, and the Web Extractor achieving an accuracy of over 94% for the extraction of course titles and an accuracy of just under 67% for the extraction of other course attributes such as dates, prices and locations. Furthermore, the overall work is of great significance to the sponsoring company, as it simplifies and improves the existing time-consuming, labour-intensive and error-prone manual techniques, as will be discussed in this thesis. The prototype developed in this research works in the background and requires very little, often no, human assistance.
113	Modeling words for online sexual behavior surveillance and clinical text information extraction Fries, Jason Alan 01 July 2015 (has links) How do we model the meaning of words? In domains like information retrieval, words have classically been modeled as discrete entities using 1-of-n encoding, a representation that elides most of a word's syntactic and semantic structure. Recent research, however, has begun exploring more robust representations called word embeddings. Embeddings model words as a parameterized function mapping into an n-dimensional continuous space and implicitly encode a number of interesting semantic and syntactic properties. This dissertation examines two application areas where existing, state-of-the-art terminology modeling improves the task of information extraction (IE) -- the process of transforming unstructured data into structured form. We show that a large amount of word meaning can be learned directly from very large document collections. First, we explore the feasibility of mining sexual health behavior data directly from the unstructured text of online “hookup" requests. The Internet has fundamentally changed how individuals locate sexual partners. The rise of dating websites, location-aware smartphone apps like Grindr and Tinder that facilitate casual sexual encounters (“hookups"), as well as changing trends in sexual health practices all speak to the shifting cultural dynamics surrounding sex in the digital age. These shifts also coincide with an increase in the incidence rate of sexually transmitted infections (STIs) in subpopulations such as young adults, racial and ethnic minorities, and men who have sex with men (MSM). The reasons for these increases and their possible connections to Internet cultural dynamics are not completely understood. What is apparent, however, is that sexual encounters negotiated online complicate many traditional public health intervention strategies such as contact tracing and partner notification. These circumstances underline the need to examine online sexual communities using computational tools and techniques -- as is done with other social networks -- to provide new insight and direction for public health surveillance and intervention programs. One of the central challenges in this task is constructing lexical resources that reflect how people actually discuss and negotiate sex online. Using a 2.5-year collection of over 130 million Craigslist ads (a large venue for MSM casual sexual encounters), we discuss computational methods for automatically learning terminology characterizing risk behaviors in the MSM community. These approaches range from keyword-based dictionaries and topic modeling to semi-supervised methods using word embeddings for query expansion and sequence labeling. These methods allow us to gather information similar (in part) to the types of questions asked in public health risk assessment surveys, but automatically aggregated directly from communities of interest, in near real-time, and at geographically high-resolution. We then address the methodological limitations of this work, as well as the fundamental validation challenges posed by the lack of large-scale sexual sexual behavior survey data and limited availability of STI surveillance data. Finally, leveraging work on terminology modeling in Craigslist, we present new research exploring representation learning using 7 years of University of Iowa Hospitals and Clinics (UIHC) clinical notes. Using medication names as an example, we show that modeling a low-dimensional representation of a medication's neighboring words, i.e., a word embedding, encodes a large amount of non-obvious semantic information. Embeddings, for example, implicitly capture a large degree of the hierarchical structure of drug families as well as encode relational attributes of words, such as generic and brand names of medications. These representations -- learned in a completely unsupervised fashion -- can then be used as features in other machine learning tasks. We show that incorporating clinical word embeddings in a benchmark classification task of medication labeling leads to a 5.4% increase in F1-score over a baseline of random initialization and a 1.9% over just using non-UIHC training data. This research suggests clinical word embeddings could be shared for use in other institutions and other IE tasks. health care information extraction machine learning neural networks public health text mining Computer Sciences
114	Semantic frame based automatic extraction of typological information from descriptive grammars Aslam, Irfan January 2019 (has links) This thesis project addresses the machine learning (ML) modelling aspects of the problem of automatically extracting typological linguistic information of natural languages spoken in South Asia from annotated descriptive grammars. Without getting stuck into the theory and methods of Natural Language Processing (NLP), the focus has been to develop and test a machine learning (ML) model dedicated to the information extraction part. Starting with the existing state-of-the-art frameworks to get labelled training data through the structured representation of the descriptive grammars, the problem has been modelled as a supervised ML classification task where the annotated text is provided as input and the objective is to classify the input to one of the pre-learned labels. The approach has been to systematically explore the data to develop understanding of the problem domain and then evaluate a set of four potential ML algorithms using predetermined performance metrics namely: accuracy, recall, precision and f-score. It turned out that the problem splits up into two independent classification tasks: binary classification task and multiclass classification task. The four selected algorithms: Decision Trees, Naïve Bayes, Support VectorMachines, and Logistic Regression belonging to both linear and non-linear families ofML models are independently trained and compared for both classification tasks. Using stratified 10-fold cross validation performance metrics are measured and the candidate algorithms are compared. Logistic Regression provided overall best results with DecisionTree as the close follow up. Finally, the Logistic Regression model was selected for further fine tuning and used in a web demo for typological information extraction tool developed to show the usability of the ML model in the field. Automatic Information Extraction Spoken Languages Typological Linguistic Information Logistic Regression Classification Computer Sciences Datavetenskap (datalogi)
115	Improving the performance of Hierarchical Hidden Markov Models on Information Extraction tasks Chou, Lin-Yi January 2006 (has links) This thesis presents novel methods for creating and improving hierarchical hidden Markov models. The work centers around transforming a traditional tree structured hierarchical hidden Markov model (HHMM) into an equivalent model that reuses repeated sub-trees. This process temporarily breaks the tree structure constraint in order to leverage the benefits of combining repeated sub-trees. These benefits include lowered cost of testing and an increased accuracy of the final model-thus providing the model with greater performance. The result is called a merged and simplified hierarchical hidden Markov model (MSHHMM). The thesis goes on to detail four techniques for improving the performance of MSHHMMs when applied to information extraction tasks, in terms of accuracy and computational cost. Briefly, these techniques are: a new formula for calculating the approximate probability of previously unseen events; pattern generalisation to transform observations, thus increasing testing speed and prediction accuracy; restructuring states to focus on state transitions; and an automated flattening technique for reducing the complexity of HHMMs. The basic model and four improvements are evaluated by applying them to the well-known information extraction tasks of Reference Tagging and Text Chunking. In both tasks, MSHHMMs show consistently good performance across varying sizes of training data. In the case of Reference Tagging, the accuracy of the MSHHMM is comparable to other methods. However, when the volume of training data is limited, MSHHMMs maintain high accuracy whereas other methods show a significant decrease. These accuracy gains were achieved without any significant increase in processing time. For the Text Chunking task the accuracy of the MSHHMM was again comparable to other methods. However, the other methods incurred much higher processing delays compared to the MSHHMM. The results of these practical experiments demonstrate the benefits of the new method-increased accuracy, lower computation costs, and better performance. hidden Markov model hierarchical hidden Markov model information extraction text mining
116	A Multidisciplinary Approach to the Reuse of Open Learning Resources FRESCHI, Sergio January 2008 (has links) Master of Engineering (Research) / Educational standards are having a significant impact on e-Learning. They allow for better exchange of information among different organizations and institutions. They simplify reusing and repurposing learning materials. They give teachers the possibility of personalizing them according to the student’s background and learning speed. Thanks to these standards, off-the-shelf content can be adapted to a particular student cohort’s context and learning needs. The same course content can be presented in different languages. Overall, all the parties involved in the learning-teaching process (students, teachers and institutions) can benefit from these standards and so online education can be improved. To materialize the benefits of standards, learning resources should be structured according to these standards. Unfortunately, there is the problem that a large number of existing e-Learning materials lack the intrinsic logical structure required, and further, when they have the structure, they are not encoded as required. These problems make it virtually impossible to share these materials. This thesis addresses the following research question: How to make the best use of existing open learning resources available on the Internet by taking advantage of educational standards and specifications and thus improving content reusability?In order to answer this question, I combine different technologies, techniques and standards that make the sharing of publicly available learning resources possible in innovative ways. I developed and implemented a three-stage tool to tackle the above problem. By applying information extraction techniques and open e-Learning standards to legacy learning resources the tool has proven to improve content reusability. In so doing, it contributes to the understanding of how these technologies can be used in real scenarios and shows how online education can benefit from them. In particular, three main components were created which enable the conversion process from unstructured educational content into a standard compliant form in a systematic and automatic way. An increasing number of repositories with educational resources are available, including Wikiversity and the Massachusetts Institute of Technology OpenCourseware. Wikivesity is an open repository containing over 6,000 learning resources in several disciplines and for all age groups [1]. I used the OpenCourseWare repository to evaluate the effectiveness of my software components and ideas. The results show that it is possible to create standard compliant learning objects from the publicly available web pages, improving their searchability, interoperability and reusability. Distance Education Open Courseware Learning standards Information Extraction IMS CP LD IEEE LOM
117	Adaptive Semi-structured Information Extraction Arpteg, Anders January 2003 (has links) <p>The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to construct user-driven information extraction systems where novice users are able to adapt them to new domains and tasks. To accomplish this goal, the systems need to become more intelligent and able to learn to extract information without need of expert skills or time-consuming work from the user.</p><p>The type of information extraction system that is in focus for this thesis is semistructural information extraction. The term semi-structural refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.</p><p>The extraction process contains several steps where different types of techniques are used. Examples of such types of techniques are those that take advantage of structural, pure syntactic, linguistic, and semantic information. The first step that is in focus for this thesis is the navigation step that takes advantage of the structural information. It is only one part of a complete extraction system, but it is an important part. The use of reinforcement learning algorithms for the navigation step can make the adaptation of the system to new tasks and domains more user-driven. The advantage of using reinforcement learning techniques is that the extraction agent can efficiently learn from its own experience without need for intensive user interactions.</p><p>An agent-oriented system was designed to evaluate the approach suggested in this thesis. Initial experiments showed that the training of the navigation step and the approach of the system was promising. However, additional components need to be included in the system before it becomes a fully-fledged user-driven system.</p> / Report code: LiU-Tek-Lic-2002:73. Information extraction Artificial intelligence Semi-structured data Reinforced learning Knowledge management Computer science Datavetenskap
118	Dominant vectors of nonnegative matrices : application to information extraction in large graphs Ninove, Laure 21 February 2008 (has links) Objects such as documents, people, words or utilities, that are related in some way, for instance by citations, friendship, appearance in definitions or physical connections, may be conveniently represented using graphs or networks. An increasing number of such relational databases, as for instance the World Wide Web, digital libraries, social networking web sites or phone calls logs, are available. Relevant information may be hidden in these networks. A user may for instance need to get authority web pages on a particular topic or a list of similar documents from a digital library, or to determine communities of friends from a social networking site or a phone calls log. Unfortunately, extracting this information may not be easy. This thesis is devoted to the study of problems related to information extraction in large graphs with the help of dominant vectors of nonnegative matrices. The graph structure is indeed very useful to retrieve information from a relational database. The correspondence between nonnegative matrices and graphs makes Perron--Frobenius methods a powerful tool for the analysis of networks. In a first part, we analyze the fixed points of a normalized affine iteration used by a database matching algorithm. Then, we consider questions related to PageRank, a ranking method of the web pages based on a random surfer model and used by the well known web search engine Google. In a second part, we study optimal linkage strategies for a web master who wants to maximize the average PageRank score of a web site. Finally, the third part is devoted to the study of a nonlinear variant of PageRank. The simple model that we propose takes into account the mutual influence between web ranking and web surfing. PageRank Information extraction Networks Graphs Cones Nonlinear iterations Perron-Frobenius Eigenvalue problems Dominant vectors Nonnegative matrices
119	Effects of Developmental Heuristics for Natural Language Learning Engels, Steve January 2003 (has links) Machine learning in natural language has been a widely pursued area of research. However, few learning techniques model themselves after human learning, despite the nature of the task being closely connected to human cognition. In particular, the idea of learning language in stages is a common approach for human learning, as can be seen in practice in the education system and in research on language acquisition. However, staged learning for natural language is an area largely overlooked by machine learning researchers. This thesis proposes a developmental learning heuristic for natural language models, to evaluate its performance on natural language tasks. The heuristic simulates human learning stages by training on child, teenage and adult text, provided by the British National Corpus. The three staged learning techniques that are proposed take advantage of these stages to create a single developed Hidden Markov Model. This model is then applied to the task of part-of-speech tagging to observe the effects of development on language learning. Computer Science Natural language machine learning development heuristics information extraction staged learning
120	Effects of Developmental Heuristics for Natural Language Learning Engels, Steve January 2003 (has links) Machine learning in natural language has been a widely pursued area of research. However, few learning techniques model themselves after human learning, despite the nature of the task being closely connected to human cognition. In particular, the idea of learning language in stages is a common approach for human learning, as can be seen in practice in the education system and in research on language acquisition. However, staged learning for natural language is an area largely overlooked by machine learning researchers. This thesis proposes a developmental learning heuristic for natural language models, to evaluate its performance on natural language tasks. The heuristic simulates human learning stages by training on child, teenage and adult text, provided by the British National Corpus. The three staged learning techniques that are proposed take advantage of these stages to create a single developed Hidden Markov Model. This model is then applied to the task of part-of-speech tagging to observe the effects of development on language learning. Computer Science Natural language machine learning development heuristics information extraction staged learning

Search results