Global ETD Search

791	Unsupervised Information Extraction From Text - Extraction and Clustering of Relations between Entities Wang, Wei 16 May 2013 (has links) (PDF) Unsupervised information extraction in open domain gains more and more importance recently by loosening the constraints on the strict definition of the extracted information and allowing to design more open information extraction systems. In this new domain of unsupervised information extraction, this thesis focuses on the tasks of extraction and clustering of relations between entities at a large scale. The objective of relation extraction is to discover unknown relations from texts. A relation prototype is first defined, with which candidates of relation instances are initially extracted with a minimal criterion. To guarantee the validity of the extracted relation instances, a two-step filtering procedures is applied: the first step with filtering heuristics to remove efficiently large amount of false relations and the second step with statistical models to refine the relation candidate selection. The objective of relation clustering is to organize extracted relation instances into clusters so that their relation types can be characterized by the formed clusters and a synthetic view can be offered to end-users. A multi-level clustering procedure is design, which allows to take into account the massive data and diverse linguistic phenomena at the same time. First, the basic clustering groups similar relation instances by their linguistic expressions using only simple similarity measures on a bag-of-word representation for relation instances to form high-homogeneous basic clusters. Second, the semantic clustering aims at grouping basic clusters whose relation instances share the same semantic meaning, dealing with more particularly phenomena such as synonymy or more complex paraphrase. Different similarities measures, either based on resources such as WordNet or distributional thesaurus, at the level of words, relation instances and basic clusters are analyzed. Moreover, a topic-based relation clustering is proposed to consider thematic information in relation clustering so that more precise semantic clusters can be formed. Finally, the thesis also tackles the problem of clustering evaluation in the context of unsupervised information extraction, using both internal and external measures. For the evaluations with external measures, an interactive and efficient way of building reference of relation clusters proposed. The application of this method on a newspaper corpus results in a large reference, based on which different clustering methods are evaluated. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Natural language processing Unsupervised information extraction Relation clustering Semantic similarity
792	Automatic lemmatisation for Afrikaans / by Hendrik J. Groenewald Groenewald, Hendrik Johannes January 2006 (has links) A lemmatiser is an important component of various human language technology applicalions for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current lemmatiser serves as motivation for developing another lemmatiser based on an alternative approach than language-specific rules. The alternalive method of lemmatiser corlstruction investigated in this study is memory-based learning. Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu "Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu, thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation, ii) to determine the influence of data size and various feature options on the performance of I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best performancc in Lcrms of linguistic accuracy, execution time and memory usage. In order to achieve the first objective, we investigate the processes of inflecrion and derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between inflection and derivation. We proceed to define the inflectional calegories for Afrikaans, which represent a number of affixes that should be removed from word-forms during lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime increase as the amount of training dala is increased and that Ihe various feature options bave a significant effect on the performance of Lia. The algorithmic parameters and data representation that deliver the best results are determincd by the use of I'Senrck, a programme that implements Wrapped Progre~sive Sampling in order determine a set of possibly optimal algorithmic parameters for each of the TiMBL classification algorithms. Aulornaric Lcmlnalisa~ionf or Afrikaans - - Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the best performing parameters for the IB1 algorithm on feature-aligned data with 20 features. This result indicates that memory-based learning is indeed more suitable than rule-based methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007. Lemmatisation Machine learning Memory-based learning Human language technology Natural language processing Computer engineering TIMBL Afrikaans Morphology
793	Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. Puttkammer Puttkammer, Martin Johannes January 2006 (has links) An important core technology in the development of human language technology applications is an automatic morphological analyser. Such a morphological analyser consists of various modules, one of which is a tokeniser. At present no tokeniser exists for Afrikaans and it has therefore been impossible to develop a morphological analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed, and the project therefore has two objectives: i)to postulate a tag set for integrated tokenisation, and ii) to develop an algorithm for integrated tokenisation. In order to achieve the first object, a tag set for the tagging of sentences, named-entities, words, abbreviations and punctuation is proposed specifically for the annotation of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to establish a larger, more specific tag set. The postulated tag set can also be simplified according to the level of specificity required by the user. It is subsequently shown that an effective tokeniser cannot be developed using only linguistic, or only statistical methods. This is due to the complexity of the task: rule-based modules should be used for certain processes (for example sentence recognition), while other processes (for example named-entity recognition) can only be executed successfully by means of a machine-learning module. It is argued that a hybrid system (a system where rule-based and statistical components are integrated) would achieve the best results on Afrikaans tokenisation. Various rule-based and statistical techniques, including a TiMBL-based classifier, are then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate named entities, the ∫-score rises to 94.74%. The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans sentencisation, named-entity recognition and tokenisation. The tokeniser will improve if it is trained with more data, while the expansion of gazetteers as well as the tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006. Afrikaans Tokenisation Sentence recognition Named-entity recognition Sentence Named entity Word Morphological analysis Natural language processing Computational linguistics TIMBL
794	Reëlgebaseerde klemtoontoekenning in 'n grafeem-na-foneemstelsel vir Afrikaans / E.W. Mouton Mouton, Elsie Wilhelmina January 2010 (has links) Text -to-speech systems currently are of great importance in the community. One core technology in this human language technology resource is stress assignment which plays an important role in any text-to-speech system. At present no automatic stress assigner for Afrikaans exists. For these reasons, the two most important aims of this project will be: a) to develop a complete and accurate set of stress rules for Afrikaans that can be implemented in an automatic stress assigner, and b) to develop an effective and highly accurate stress assigner in order to assign Afrikaans stress to words quickly and effectively. A set of stress rules for Afrikaans was developed in order to reach the first goal. It consists of 18 rules that are divided into groups for words that contain a schwa, derivations, and disyllabic, tri-syllabic and polysyllabic simplex words. Next, different approaches that can be used to develop a stress assigner were examined, and the rule-based approach was used to implement the developed stress rules within the stress assigner. The programming language, Perl, was chosen for the implementation of the rules. The chosen algorithm was used to generate a stress assigner for Afrikaans by implementing the stress rules developed. The hyphenator, Calomo and the compound analyser, CKarma was used to hyphenate all the test data and detect word boundaries within compounds. A dataset of 10 000 correctly annotated tokens was developed during the testing process. The evaluation of the stress assigner consists of four phases. During the first phase, the stress assigner was evaluated with the 10 000 tokens and achieved an accuracy of 92.09%. The grapheme - to-phoneme converter was evaluated with the same data and scored 91.9%. The influence of various factors on stress assignment was determined, and it was established that stress assignment is an essential component of rule-based grapheme-to-phoneme conversion. In conclusion, it can be said that the stress assigner achieved satisfactory results, and that the stress assigner can be successfully utilized in future projects to develop training data for further experiments with stress assignment and grapheme-to-phoneme conversion for Afrikaans. Experiments can be conducted in future with data-driven approaches that possibly may lead to better results in Afrikaans stress assignment and grapheme-to-phoneme conversion. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2010. Stress assignment Grapheme-to-phoneme conversion Stress rules Word stress Natural language processing Computer linguistics Rule-based
795	Reëlgebaseerde klemtoontoekenning in 'n grafeem-na-foneemstelsel vir Afrikaans / E.W. Mouton Mouton, Elsie Wilhelmina January 2010 (has links) Text -to-speech systems currently are of great importance in the community. One core technology in this human language technology resource is stress assignment which plays an important role in any text-to-speech system. At present no automatic stress assigner for Afrikaans exists. For these reasons, the two most important aims of this project will be: a) to develop a complete and accurate set of stress rules for Afrikaans that can be implemented in an automatic stress assigner, and b) to develop an effective and highly accurate stress assigner in order to assign Afrikaans stress to words quickly and effectively. A set of stress rules for Afrikaans was developed in order to reach the first goal. It consists of 18 rules that are divided into groups for words that contain a schwa, derivations, and disyllabic, tri-syllabic and polysyllabic simplex words. Next, different approaches that can be used to develop a stress assigner were examined, and the rule-based approach was used to implement the developed stress rules within the stress assigner. The programming language, Perl, was chosen for the implementation of the rules. The chosen algorithm was used to generate a stress assigner for Afrikaans by implementing the stress rules developed. The hyphenator, Calomo and the compound analyser, CKarma was used to hyphenate all the test data and detect word boundaries within compounds. A dataset of 10 000 correctly annotated tokens was developed during the testing process. The evaluation of the stress assigner consists of four phases. During the first phase, the stress assigner was evaluated with the 10 000 tokens and achieved an accuracy of 92.09%. The grapheme - to-phoneme converter was evaluated with the same data and scored 91.9%. The influence of various factors on stress assignment was determined, and it was established that stress assignment is an essential component of rule-based grapheme-to-phoneme conversion. In conclusion, it can be said that the stress assigner achieved satisfactory results, and that the stress assigner can be successfully utilized in future projects to develop training data for further experiments with stress assignment and grapheme-to-phoneme conversion for Afrikaans. Experiments can be conducted in future with data-driven approaches that possibly may lead to better results in Afrikaans stress assignment and grapheme-to-phoneme conversion. / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2010. Stress assignment Grapheme-to-phoneme conversion Stress rules Word stress Natural language processing Computer linguistics Rule-based
796	Natural language processing techniques for the purpose of sentinel event information extraction Barrett, Neil 23 November 2012 (has links) An approach to biomedical language processing is to apply existing natural language processing (NLP) solutions to biomedical texts. Often, existing NLP solutions are less successful in the biomedical domain relative to their non-biomedical domain performance (e.g., relative to newspaper text). Biomedical NLP is likely best served by methods, information and tools that account for its particular challenges. In this thesis, I describe an NLP system specifically engineered for sentinel event extraction from clinical documents. The NLP system's design accounts for several biomedical NLP challenges. The specific contributions are as follows. - Biomedical tokenizers differ, lack consensus over output tokens and are difficult to extend. I developed an extensible tokenizer, providing a tokenizer design pattern and implementation guidelines. It evaluated as equivalent to a leading biomedical tokenizer (MedPost). - Biomedical part-of-speech (POS) taggers are often trained on non-biomedical corpora and applied to biomedical corpora. This results in a decrease in tagging accuracy. I built a token centric POS tagger, TcT, that is more accurate than three existing POS taggers (mxpost, TnT and Brill) when trained on a non-biomedical corpus and evaluated on biomedical corpora. TcT achieves this increase in tagging accuracy by ignoring previously assigned POS tags and restricting the tagger's scope to the current token, previous token and following token. - Two parsers, MST and Malt, have been evaluated using perfect POS tag input. Given that perfect input is unlikely in biomedical NLP tasks, I evaluated these two parsers on imperfect POS tag input and compared their results. MST was most affected by imperfectly POS tagged biomedical text. I attributed MST's drop in performance to verbs and adjectives where MST had more potential for performance loss than Malt. I attributed Malt's resilience to POS tagging errors to its use of a rich feature set and a local scope in decision making. - Previous automated clinical coding (ACC) research focuses on mapping narrative phrases to terminological descriptions (e.g., concept descriptions). These methods make little or no use of the additional semantic information available through topology. I developed a token-based ACC approach that encodes tokens and manipulates token-level encodings by mapping linguistic structures to topological operations in SNOMED CT. My ACC method recalled most concepts given their descriptions and performed significantly better than MetaMap. I extended my contributions for the purpose of sentinel event extraction from clinical letters. The extensions account for negation in text, use medication brand names during ACC and model (coarse) temporal information. My software system's performance is similar to state-of-the-art results. Given all of the above, my thesis is a blueprint for building a biomedical NLP system. Furthermore, my contributions likely apply to NLP systems in general. / Graduate natural language processing medical language processing biomedical language processing sentinel event clinical documents NLP MLP CLU clinical language processing
797	Intensional Context-Free Grammar Little, Richard 02 January 2014 (has links) The purpose of this dissertation is to develop a new generative grammar, based on the principles of intensional logic. More specifically, the goal is to create a psychologically real grammar model for use in natural language processing. The new grammar consists of a set of context-free rewrite rules tagged with intensional versions. Most generative grammars, such as transformational grammar, lexical functional-grammar and head-driven phrase structure grammar, extend traditional context-free grammars with a mechanism for dealing with contextual information, such as subcategorization of words and agreement between different phrasal elements. In these grammars there is not enough separation between the utterances of a language and the context in which they are uttered. Their models of language seem to assume that context is in some way encapsulated in the words of the language instead of the other way around. In intensional logic, the truth of a statement is considered in the context in which it is uttered, unlike traditional predicate logic in which truth is assigned in a vacuum, regardless of when or where it may have been stated. To date, the application of the principles of intensionality to natural languages has been confined to semantic theory. We remedy this by applying the ideas of intensional logic to syntactic context, resulting in intensional context-free grammar. Our grammar takes full advantage of the simplicity and elegance of context-free grammars while accounting for information that is beyond the sentence itself, in a realistic way. Sentence derivation is entirely encapsulated in the context of its utterance. In fact, for any particular context, the entire language of the grammar is encapsulated in that context. This is evidenced by our proof that the language of an intensional grammar is a set of context-free languages, indexed by context. To further support our claims we design and implement a small fragment of English using the grammar. The English grammar is capable of generating both passive and active sentences that include a subject, verb and up to two optional objects. Furthermore, we have implemented a partial French to English translation system that uses a single language dimension to initiate a translation. This allows us to include multiple languages in one grammar, unlike other systems which must separate the grammars of each language. This result has led this author to believe that we have created a grammar that is a viable candidate for a true Universal Grammar, far exceeding our initial goals. / Graduate / 0984 / 0800 / 0290 / rlittle@uvic.ca formal grammars natural language processing intensional logic generative grammar intensional programming context-free grammars computational complexity denotational semantics
798	Answer Localization System Using Discourse Evaluation Sualp, Merter 01 December 2004 (has links) (PDF) The words in a language not only help us to construct the sentences but also contain some other features, which we usually underestimate. Each word relates itself to the remaining ones in some way. In our daily lives, we extensively use these relations in many areas, where question direction is also one of them. In this work, it is investigated whether the relations between the words can be useful for question direction and an approach for question direction is presented. Besides, a tool is devised in the way of this approach for a course given in Turkish. The relations between the words are represented by a semantic network for nouns and verbs. By passing through the whole course material and using the relations meronymy for only nouns / synonymy, antonymy, hypernymy, coordinated words for both nouns and verbs / entailment and causality for only verbs, the semantic network, which is the backbone of the application, is constructed. The end product of our research consists of three modules: &middot / getting the question from the user and constructing the set of words that are related to the words that make up the question &middot / scoring each course section by comparing the words of the question set and the words in the section &middot / presenting the sections that may contain the answer The sections that are evaluated are taken as the sections of the course for granted. The chat logs that expand three years of the course were taken by permission and questions were extracted from them. They were used for testing purposes of the constructed application.
799	Examining the Clinical Utility and Predictive Validity of Dimensional Models of Psychopathology Love, Patrick K 08 1900 (has links) The Diagnostic and Statistical Manual of Mental Disorders arranges co-occurring clusters of symptoms into distinct disorder categories, which theoretically have specific etiologies, pathologies, and treatments. However, researchers and clinicians alike have consistently found DSM diagnoses to have high rates of comorbidity, low diagnostic specificity, and no disorder has proven to be a discrete category. There is mounting evidence that dimensional taxonomies more accurately capture the underlying structure of mental illness and clinical presentations. The recently proposed hierarchical taxonomy of psychopathology presumes to address the issues of categorical nosologies using a data driven approach to create a dimensional model of psychopathology. However, heretofore there are no empirical examinations of HiTOP's ability to predict psychotherapy treatment outcomes. This study compared the predictive validity DSM, RDoC, and HiTOP criteria using natural language processing on free text narrative notes. Of the three GMM run, only the model using DSM criteria as predictors had adequate model fit. Additionally, none of the nosologies significantly predicted treatment course. Implications for the application of RDoC and HiTOP are discussed. Research Domain Criteria (RDoC) natural language processing treatment outcome classification nosology Psychology, Clinical
800	Noun phrase generation for situated dialogs Stoia, Laura Cristina. January 2007 (has links) Thesis (Ph. D.)--Ohio State University, 2007. / Title from first page of PDF file. Includes bibliographical references (p. 154-163).

Search results