Global ETD Search

1	Anomaly Classification Through Automated Shape Grammar Representation Whiting, Mark E. 01 August 2017 (has links) Statistical learning offers a trove of opportunities for problems where a large amount of data is available but falls short when data are limited. For example, in medicine, statistical learning has been used to outperform dermatologists in diagnosing melanoma visually from millions of photos of skin lesions. However, many other medical applications of this kind of learning are made impossible due to the lack of sufficient learning data, for example, performing similar diagnosis of soft tissue tumors within the body based on radiological imagery of blood vessel development. A key challenge underlying this situation is that many statistical learning approaches utilize unstructured data representations such as strings of text or raw images, that don’t intrinsically incorporate structural information. Shape grammar is a way of using visual rules to define the underlying structure of geometric data, pioneered by the design community. Shape grammar rules are replacement rules in which the left side of the rule is a search pattern and the right side is a replacement pattern which can replace the left side where it is found. Traditionally shape grammars have been assembled by hand through observation, making it slow to use them and limiting their use with complex data. This work introduces a way to automate the generation of shape grammars and a technique to use grammars for classification in situations with limited data. A method for automatically inducing grammars from graph based data using a simple recursive algorithm, providing non-probabilistic rulesets, is introduced. The algorithm uses iterative data segmentation to establish multi scale shape rules, and can do so with a single dataset. Additionally, this automatic grammar induction algorithm has been extended to apply to high dimensional data in a nonvisual domain, for example, graphs like social networks. We validated our method by comparing our results to grammars made of historic buildings and products and found it performed comparably grammars made by humans. The induction method was extended by introducing a classification approach based on mapping grammar rule occurrences to dimensions in a high dimensional vector space. With this representation data samples can be analyzed and quickly classified, without the need for data intensive statistical learning. We validated this method by performing sensitivity tests on key graph augmentations and found that our method was comparably sensitive and significantly faster at learning than related existing methods at detecting graph differences across cases. The automated grammar technique and the grammar based classification technique were used together to classify magnetic resonance imaging (MRI) of the brain of 17 individuals and showed that our methods could detect a variety of vasculature borne condition indicators with short and long-term health implications. Through this study we demonstrate that automated grammar based representations can be used for efficient classification of anomalies in abstract domains such as design and biological tissue analysis. Anomaly detection Grammar induction Shape grammar
2	Scalable semi-supervised grammar induction using cross-linguistically parameterized syntactic prototypes Boonkwan, Prachya January 2014 (has links) This thesis is about the task of unsupervised parser induction: automatically learning grammars and parsing models from raw text. We endeavor to induce such parsers by observing sequences of terminal symbols. We focus on overcoming the problem of frequent collocation that is a major source of error in grammar induction. For example, since a verb and a determiner tend to co-occur in a verb phrase, the probability of attaching the determiner to the verb is sometimes higher than that of attaching the core noun to the verb, resulting in erroneous attachment *((Verb Det) Noun) instead of (Verb (Det Noun)). Although frequent collocation is the heart of grammar induction, it is precariously capable of distorting the grammar distribution. Natural language grammars follow a Zipfian (power law) distribution, where the frequency of any grammar rule is inversely proportional to its rank in the frequency table. We believe that covering the most frequent grammar rules in grammar induction will have a strong impact on accuracy. We propose an efficient approach to grammar induction guided by cross-linguistic language parameters. Our language parameters consist of 33 parameters of frequent basic word orders, which are easy to be elicited from grammar compendiums or short interviews with naïve language informants. These parameters are designed to capture frequent word orders in the Zipfian distribution of natural language grammars, while the rest of the grammar including exceptions can be automatically induced from unlabeled data. The language parameters shrink the search space of the grammar induction problem by exploiting both word order information and predefined attachment directions. The contribution of this thesis is three-fold. (1) We show that the language parameters are adequately generalizable cross-linguistically, as our grammar induction experiments will be carried out on 14 languages on top of a simple unsupervised grammar induction system. (2) Our specification of language parameters improves the accuracy of unsupervised parsing even when the parser is exposed to much less frequent linguistic phenomena in longer sentences when the accuracy decreases within 10%. (3) We investigate the prevalent factors of errors in grammar induction which will provide room for accuracy improvement. The proposed language parameters efficiently cope with the most frequent grammar rules in natural languages. With only 10 man-hours for preparing syntactic prototypes, it improves the accuracy of directed dependency recovery over the state-ofthe- art Gillenwater et al.’s (2010) completely unsupervised parser in: (1) Chinese by 30.32% (2) Swedish by 28.96% (3) Portuguese by 37.64% (4) Dutch by 15.17% (5) German by 14.21% (6) Spanish by 13.53% (7) Japanese by 13.13% (8) English by 12.41% (9) Czech by 9.16% (10) Slovene by 7.24% (11) Turkish by 6.72% and (12) Bulgarian by 5.96%. It is noted that although the directed dependency accuracies of some languages are below 60%, their TEDEVAL scores are still satisfactory (approximately 80%). This suggests us that our parsed trees are, in fact, closely related to the gold-standard trees despite the discrepancy of annotation schemes. We perform an error analysis of over- and under-generation analysis. We found three prevalent problems that cause errors in the experiments: (1) PP attachment (2) discrepancies of dependency annotation schemes and (3) rich morphology. The methods presented in this thesis were originally presented in Boonkwan and Steedman (2011). The thesis presents a great deal more detail in the design of crosslinguistic language parameters, the algorithm of lexicon inventory construction, experiment results, and error analysis. 006.3
3	Using Zipf Frequencies As A Representativeness Measure In Statistical Active Learning Of Natural Language Cobanoglu, Onur 01 June 2008 (has links) (PDF) Active learning has proven to be a successful strategy in quick development of corpora to be used in statistical induction of natural language. A vast majority of studies in this field has concentrated on finding and testing various informativeness measures for samples / however, representativeness measures for samples have not been thoroughly studied. In this thesis, we introduce a novel representativeness measure which is, being based on Zipf&#039 / s law, model-independent and validated both theoretically and empirically. Experiments conducted on WSJ corpus with a wide-coverage parser show that our representativeness measure leads to better performance than previously introduced representativeness measures when used with most of the known informativeness measures. QA General Works 36-39
4	Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction Packer, Thomas L 01 October 2014 (has links) (PDF) Lists of records in machine-printed documents contain much useful information. As one example, the thousands of family history books scanned, OCRed, and placed on-line by FamilySearch.org probably contain hundreds of millions of fact assertions about people, places, family relationships, and life events. Data like this cannot be fully utilized until a person or process locates the data in the document text, extracts it, and structures it with respect to an ontology or database schema. Yet, in the family history industry and other industries, data in lists goes largely unused because no known approach adequately addresses all of the costs, challenges, and requirements of a complete end-to-end solution to this task. The diverse information is costly to extract because many kinds of lists appear even within a single document, differing from each other in both structure and content. The lists' records and component data fields are usually not set apart explicitly from the rest of the text, especially in a corpus of OCRed historical documents. OCR errors and the lack of document structure (e.g. HMTL tags) make list content hard to recognize by a software tool developed without a substantial amount of highly specialized, hand-coded knowledge or machine learning supervision. Making an approach that is not only accurate but also sufficiently scalable in terms of time and space complexity to process a large corpus efficiently is especially challenging. In this dissertation, we introduce a novel family of scalable approaches to list discovery and ontology population. Its contributions include the following. We introduce the first general-purpose methods of which we are aware for both list detection and wrapper induction for lists in OCRed or other plain text. We formally outline a mapping between in-line labeled text and populated ontologies, effectively reducing the ontology population problem to a sequence labeling problem, opening the door to applying sequence labelers and other common text tools to the goal of populating a richly structured ontology from text. We provide a novel admissible heuristic for inducing regular expression wrappers using an A* search. We introduce two ways of modeling list-structured text with a hidden Markov model. We present two query strategies for active learning in a list-wrapper induction setting. Our primary contributions are two complete and scalable wrapper-induction-based solutions to the end-to-end challenge of finding lists, extracting data, and populating an ontology. The first has linear time and space complexity and extracts highly accurate information at a low cost in terms of user involvement. The second has time and space complexity that are linear in the size of the input text and quadratic in the length of an output record and achieves higher F1-measures for extracted information as a function of supervision cost. We measure the performance of each of these approaches and show that they perform better than strong baselines, including variations of our own approaches and a conditional random field-based approach. information extraction data ontology conceptual modeling ontology population grammar induction wrapper induction hidden Markov model HMM regular expression regex OCR plain text OCRed text document list active learning unsupervised active learning document analysis and recognition historical document Computer Sciences

1

Page generated in 0.1021 seconds