Global ETD Search

201	Automatic generation of factual questions from video documentaries Skalban, Yvonne January 2013 (has links) Questioning sessions are an essential part of teachers’ daily instructional activities. Questions are used to assess students’ knowledge and comprehension and to promote learning. The manual creation of such learning material is a laborious and time-consuming task. Research in Natural Language Processing (NLP) has shown that Question Generation (QG) systems can be used to efficiently create high-quality learning materials to support teachers in their work and students in their learning process. A number of successful QG applications for education and training have been developed, but these focus mainly on supporting reading materials. However, digital technology is always evolving; there is an ever-growing amount of multimedia content available, and more and more delivery methods for audio-visual content are emerging and easily accessible. At the same time, research provides empirical evidence that multimedia use in the classroom has beneficial effects on student learning. Thus, there is a need to investigate whether QG systems can be used to assist teachers in creating assessment materials from these different types of media that are being employed in classrooms. This thesis serves to explore how NLP tools and techniques can be harnessed to generate questions from non-traditional learning materials, in particular videos. A QG framework which allows the generation of factual questions from video documentaries has been developed and a number of evaluations to analyse the quality of the produced questions have been performed. The developed framework uses several readily available NLP tools to generate questions from the subtitles accompanying a video documentary. The reason for choosing video vii documentaries is two-fold: firstly, they are frequently used by teachers and secondly, their factual nature lends itself well to question generation, as will be explained within the thesis. The questions generated by the framework can be used as a quick way of testing students’ comprehension of what they have learned from the documentary. As part of this research project, the characteristics of documentary videos and their subtitles were analysed and the methodology has been adapted to be able to exploit these characteristics. An evaluation of the system output by domain experts showed promising results but also revealed that generating even shallow questions is a task which is far from trivial. To this end, the evaluation and subsequent error analysis contribute to the literature by highlighting the challenges QG from documentary videos can face. In a user study, it was investigated whether questions generated automatically by the system developed as part of this thesis and a state-of-the-art system can successfully be used to assist multimedia-based learning. Using a novel evaluation methodology, the feasibility of using a QG system’s output as ‘pre-questions’ with different types of prequestions (text-based and with images) used was examined. The psychometric parameters of the automatically generated questions by the two systems and of those generated manually were compared. The results indicate that the presence of pre-questions (preferably with images) improves the performance of test-takers and they highlight that the psychometric parameters of the questions generated by the system are comparable if not better than those of the state-of-the-art system. In another experiment, the productivity of questions in terms of time taken to generate questions manually vs. time taken to post-edit system-generated questions was analysed. A viii post-editing tool which allows for the tracking of several statistics such as edit distance measures, editing time, etc, was used. The quality of questions before and after postediting was also analysed. Not only did the experiments provide quantitative data about automatically and manually generated questions, but qualitative data in the form of user feedback, which provides an insight into how users perceived the quality of questions, was also gathered. 006.35
202	Automatic error detection in non-native English De Felice, Rachele January 2008 (has links) This thesis describes the development of Dapper (`Determiner And PrePosition Error Recogniser'), a system designed to automatically acquire models of occurrence for English prepositions and determiners to allow for the detection and correction of errors in their usage, especially in the writing of non-native speakers of the language. Prepositions and determiners are focused on because they are parts of speech whose usage is particularly challenging to acquire, both for students of the language and for natural language processing tools. The work presented in this thesis proposes to address this problem by developing a system which can acquire models of correct preposition and determiner occurrence, and can use this knowledge to identify divergences from these models as errors. The contexts of these parts of speech are represented by a sophisticated feature set, incorporating a variety of semantic and syntactic elements. DAPPER is found to perform well on preposition and determiner selection tasks in correct native English text. Results on each preposition and determiner are discussed in detail to understand the possible reasons for variations in performance, and whether these are due to problems with the structure of DAPPER or to deeper linguistic reasons. An in-depth analysis of all features used is also offered, quantifying the contribution of each feature individually. This can help establish if the decision to include complex semantic and syntactic features is justified in the context of this task. Finally, the performance of DAPPER on non-native English text is assessed. The system is found to be robust when applied to text which does not contain any preposition or determiner errors. On an error correction task, results are mixed: DAPPER shows promising results on preposition selection and determiner confusion (definite vs. indefinite) errors, but is less successful in detecting errors involving missing or extraneous determiners. Several characteristics of learner writing are described, to gain a clearer understanding of what problems arise when natural language processing tools are used with this kind of text. It is concluded that the construction of contextual models is a viable approach to the task of preposition and determiner selection, despite outstanding issues pertaining to the domain of non-native writing. 005.3
203	Wide-coverage parsing for Turkish Çakici, Ruket January 2009 (has links) Wide-coverage parsing is an area that attracts much attention in natural language processing research. This is due to the fact that it is the first step tomany other applications in natural language understanding, such as question answering. Supervised learning using human-labelled data is currently the best performing method. Therefore, there is great demand for annotated data. However, human annotation is very expensive and always, the amount of annotated data is much less than is needed to train well-performing parsers. This is the motivation behind making the best use of data available. Turkish presents a challenge both because syntactically annotated Turkish data is relatively small and Turkish is highly agglutinative, hence unusually sparse at the whole word level. METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface dependency relations and morphological analyses for words. We show that including even the crudest forms of morphological information extracted from the data boosts the performance of both generative and discriminative parsers, contrary to received opinion concerning English. We induce word-based and morpheme-based CCG grammars from Turkish dependency treebank. We use these grammars to train a state-of-the-art CCG parser that predicts long-distance dependencies in addition to the ones that other parsers are capable of predicting. We also use the correct CCG categories as simple features in a graph-based dependency parser and show that this improves the parsing results. We show that a morpheme-based CCG lexicon for Turkish is able to solve many problems such as conflicts of semantic scope, recovering long-range dependencies, and obtaining smoother statistics from the models. CCG handles linguistic phenomena i.e. local and long-range dependencies more naturally and effectively than other linguistic theories while potentially supporting semantic interpretation in parallel. Using morphological information and a morpheme-cluster based lexicon improve the performance both quantitatively and qualitatively for Turkish. We also provide an improved version of the treebank which will be released by kind permission of METU and Sabancı. 300.285
204	Active learning : an explicit treatment of unreliable parameters Becker, Markus January 2008 (has links) Active learning reduces annotation costs for supervised learning by concentrating labelling efforts on the most informative data. Most active learning methods assume that the model structure is fixed in advance and focus upon improving parameters within that structure. However, this is not appropriate for natural language processing where the model structure and associated parameters are determined using labelled data. Applying traditional active learning methods to natural language processing can fail to produce expected reductions in annotation cost. We show that one of the reasons for this problem is that active learning can only select examples which are already covered by the model. In this thesis, we better tailor active learning to the need of natural language processing as follows. We formulate the Unreliable Parameter Principle: Active learning should explicitly and additionally address unreliably trained model parameters in order to optimally reduce classification error. In order to do so, we should target both missing events and infrequent events. We demonstrate the effectiveness of such an approach for a range of natural language processing tasks: prepositional phrase attachment, sequence labelling, and syntactic parsing. For prepositional phrase attachment, the explicit selection of unknown prepositions significantly improves coverage and classification performance for all examined active learning methods. For sequence labelling, we introduce a novel active learning method which explicitly targets unreliable parameters by selecting sentences with many unknown words and a large number of unobserved transition probabilities. For parsing, targeting unparseable sentences significantly improves coverage and f-measure in active learning. 006.3
205	Logarithmic opinion pools for conditional random fields Smith, Andrew January 2007 (has links) Since their recent introduction, conditional random fields (CRFs) have been successfully applied to a multitude of structured labelling tasks in many different domains. Examples include natural language processing (NLP), bioinformatics and computer vision. Within NLP itself we have seen many different application areas, like named entity recognition, shallow parsing, information extraction from research papers and language modelling. Most of this work has demonstrated the need, directly or indirectly, to employ some form of regularisation when applying CRFs in order to overcome the tendency for these models to overfit. To date a popular method for regularising CRFs has been to fit a Gaussian prior distribution over the model parameters. In this thesis we explore other methods of CRF regularisation, investigating their properties and comparing their effectiveness. We apply our ideas to sequence labelling problems in NLP, specifically part-of-speech tagging and named entity recognition. We start with an analysis of conventional approaches to CRF regularisation, and investigate possible extensions to such approaches. In particular, we consider choices of prior distribution other than the Gaussian, including the Laplacian and Hyperbolic; we look at the effect of regularising different features separately, to differing degrees, and explore how we may define an appropriate level of regularisation for each feature; we investigate the effect of allowing the mean of a prior distribution to take on non-zero values; and we look at the impact of relaxing the feature expectation constraints satisfied by a standard CRF, leading to a modified CRF model we call the inequality CRF. Our analysis leads to the general conclusion that although there is some capacity for improvement of conventional regularisation through modification and extension, this is quite limited. Conventional regularisation with a prior is in general hampered by the need to fit a hyperparameter or set of hyperparameters, which can be an expensive process. We then approach the CRF overfitting problem from a different perspective. Specifically, we introduce a form of CRF ensemble called a logarithmic opinion pool (LOP), where CRF distributions are combined under a weighted product. We show how a LOP has theoretical properties which provide a framework for designing new overfitting reduction schemes in terms of diverse models, and demonstrate how such diverse models may be constructed in a number of different ways. Specifically, we show that by constructing CRF models from manually crafted partitions of a feature set and combining them with equal weight under a LOP, we may obtain an ensemble that significantly outperforms a standard CRF trained on the entire feature set, and is competitive in performance to a standard CRF regularised with a Gaussian prior. The great advantage of LOP approach is that, unlike the Gaussian prior method, it does not require us to search a hyperparameter space. Having demonstrated the success of LOPs in the simple case, we then move on to consider more complex uses of the framework. In particular, we investigate whether it is possible to further improve the LOP ensemble by allowing parameters in different models to interact during training in such a way that diversity between the models is encouraged. Lastly, we show how the LOP approach may be used as a remedy for a problem that standard CRFs can sometimes suffer. In certain situations, negative effects may be introduced to a CRF by the inclusion of highly discriminative features. An example of this is provided by gazetteer features, which encode a word's presence in a gazetteer. We show how LOPs may be used to reduce these negative effects, and so provide some insight into how gazetteer features may be more effectively handled in CRFs, and log-linear models in general. 005.3
206	Automation of summarization evaluation methods and their application to the summarization process Nahnsen, Thade January 2011 (has links) Summarization is the process of creating a more compact textual representation of a document or a collection of documents. In view of the vast increase in electronically available information sources in the last decade, filters such as automatically generated summaries are becoming ever more important to facilitate the efficient acquisition and use of required information. Different methods using natural language processing (NLP) techniques are being used to this end. One of the shallowest approaches is the clustering of available documents and the representation of the resulting clusters by one of the documents; an example of this approach is the Google News website. It is also possible to augment the clustering of documents with a summarization process, which would result in a more balanced representation of the information in the cluster, NewsBlaster being an example. However, while some systems are already available on the web, summarization is still considered a difficult problem in the NLP community. One of the major problems hampering the development of proficient summarization systems is the evaluation of the (true) quality of system-generated summaries. This is exemplified by the fact that the current state-of-the-art evaluation method to assess the information content of summaries, the Pyramid evaluation scheme, is a manual procedure. In this light, this thesis has three main objectives. 1. The development of a fully automated evaluation method. The proposed scheme is rooted in the ideas underlying the Pyramid evaluation scheme and makes use of deep syntactic information and lexical semantics. Its performance improves notably on previous automated evaluation methods. 2. The development of an automatic summarization system which draws on the conceptual idea of the Pyramid evaluation scheme and the techniques developed for the proposed evaluation system. The approach features the algorithm for determining the pyramid and bases importance on the number of occurrences of the variable-sized contributors of the pyramid as opposed to word-based methods exploited elsewhere. 3. The development of a text coherence component that can be used for obtaining the best ordering of the sentences in a summary. 621.382
207	Generating affective natural language for parents of neonatal infants Mahamood, Saad Ali January 2010 (has links) The thesis presented here describes original research in the field of Natural Language Generation (NLG). NLG is the subfield of artificial intelligence that is concerned with the automatic production of documents from underlying data. This thesis in particular focuses on developing new and novel methods for generating text that takes into consideration the recipient’s level of stress as a factor to adapt the resultant textural output. This consideration of taking the recipient level of stress was particularly salient due to the domain that this research was conducted under; providing information for parents of pre-term infants during neonatal intensive care (NICU). A highly technical and stressful environment for parents where emotional sensitivity must be shown for the nature of information presented. We have investigated the emotional and informational needs of these parents through an extensive past literature review and two separate research studies with former and current NICU parents. The NLG system built for this research was called BabyTalk Family (BT-Family). A system that can produce a textual summary of medical events that has occurred for a baby in NICU in last twenty-four hours for parents. The novelty of this system is that is capable of estimating the level of stress of the recipient and by using several affective NLG strategies it is able to tailor it’s output for a stressed audience. Unlike traditional NLG systems where the output would remain unchanged regardless of emotional state of the recipient. The key innovation in this system was the integration of several affective strategies in the Document Planner for tailoring textual output for stress recipients. BT-Family’s output was evaluated with thirteen parents that previously had baby in neonatal care. We developed a methodology for an evaluation that involved a direct comparison between stressed and unstressed text for the same given medical scenario for variables such as preference, understandability, helpfulness, and emotional appropriateness. The results, obtained showed the parents overwhelming preferred the stressed text for all of the variables measured. 006.3
208	Using natural language generation to provide access to semantic metadata Hielkema, Feikje January 2010 (has links) In recent years, the use of using metadata to describe and share resources has grown in importance, especially in the context of the Semantic Web. However, access to metadata is difficult for users without experience with description logic or formal languages, and currently this description applies to most web users. There is a strong need for interfaces that provide easy access to semantic metadata, enabling novice users to browse, query and create it easily. This thesis describes a natural language generation interface to semantic metadata called LIBER (Language Interface for Browsing and Editing Rdf), driven by domain ontologies which are integrated with domain-specific linguistic information. LIBER uses the linguistic information to generate fluent descriptions and search terms through syntactic aggregation. The tool contains three modules to support metadata creation, querying and browsing, which implement the WYSIWYM (What You See Is What You Meant) natural language generation approach. Users can add and remove information by editing system-generated feedback texts. Two studies have been conducted to evaluate LIBER’s usability, and compare it to a different Semantic Web interface. The studies showed subjects with no prior experience of the Semantic Web could use LIBER effectively to create, search and browse metadata, and were a useful source of ideas in which to improve LIBER’s usability. However, the results of these studies were less positive than we had hoped, and users actually preferred the other Semantic Web tool. This has raised questions about which user audience LIBER should aim for, and the extent to which the underlying ontologies influence the usability of the interface. LIBER’s portability to other domains is supported by a tool with which ontology developers without a background in linguistics can prepare their ontologies for use in LIBER by adding the necessary linguistic information. 020
209	Implication textuelle et réécriture / Textual Entailment and rewriting Bedaride, Paul 18 October 2010 (has links) Cette thèse propose plusieurs contributions sur le thème de la détection d'implications textuelles (DIT). La DIT est la capacité humaine, étant donné deux textes, à pouvoir dire si le sens du second texte peut être déduit à partir de celui du premier. Une des contributions apportée au domaine est un système de DIT hybride prenant les analyses d'un analyseur syntaxique stochastique existant afin de les étiqueter avec des rôles sémantiques, puis transformant les structures obtenues en formules logiques grâce à des règles de réécriture pour tester finalement l'implication à l'aide d'outils de preuve. L'autre contribution de cette thèse est la génération de suites de tests finement annotés avec une distribution uniforme des phénomènes couplée avec une nouvelle méthode d'évaluation des systèmes utilisant les techniques de fouille d'erreurs développées par la communauté de l'analyse syntaxique permettant une meilleure identification des limites des systèmes. Pour cela nous créons un ensemble de formules sémantiques puis nous générons les réalisations syntaxiques annotées correspondantes à l'aide d'un système de génération existant. Nous testons ensuite s'il y a implication ou non entre chaque couple de réalisations syntaxiques possible. Enfin nous sélectionnons un sous-ensemble de cet ensemble de problèmes d'une taille donnée et satisfaisant un certain nombre de contraintes à l'aide d'un algorithme que nous avons développé. / This thesis presents several contributions on the theme of recognising textual entailment (RTE). The RTE is the human capacity, given two texts, to determine whether the meaning of the second text could be deduced from the meaning of the first or not. One of the contributions made to the field is a hybrid system of RTE taking analysis of an existing stochastic parser to label them with semantics roles, then turning obtained structures in logical formulas using rewrite rules to finally test the entailment using proof tools. Another contribution of this thesis is the generation of finely annotated tests suites with a uniform distribution of phenomena coupled with a new methodology of systems evaluation using error minning techniques developed by the community of parsing allowing better identification of systems limitations. For this, we create a set of formulas, then we generate annotated syntactics realisations corresponding by using an existing generation system. Then, we test whether or not there is an entailment between each pair of possible syntactics realisations. Finally, we select a subset of this set of problems of a given size and a satisfactory a certain number of constraints using an algorithm that we developed Traitement automatique des langues Réécriture Représentation Raisonnement Natural Language Processing Rewriting Representation Reasoning 410.285 006.35
210	Natural language processing of online propaganda as a means of passively monitoring an adversarial ideology Holm, Raven R. 03 1900 (has links) Approved for public release; distribution is unlimited / Reissued 30 May 2017 with Second Reader’s non-NPS affiliation added to title page. / Online propaganda embodies a potent new form of warfare; one that extends the strategic reach of our adversaries and overwhelms analysts. Foreign organizations have effectively leveraged an online presence to influence elections and distance-recruit. The Islamic State has also shown proficiency in outsourcing violence, proving that propaganda can enable an organization to wage physical war at very little cost and without the resources traditionally required. To augment new counter foreign propaganda initiatives, this thesis presents a pipeline for defining, detecting and monitoring ideology in text. A corpus of 3,049 modern online texts was assembled and two classifiers were created: one for detecting authorship and another for detecting ideology. The classifiers demonstrated 92.70% recall and 95.84% precision in detecting authorship, and detected ideological content with 76.53% recall and 95.61% precision. Both classifiers were combined to simulate how an ideology can be detected and how its composition could be passively monitored across time. Implementation of such a system could conserve manpower in the intelligence community and add a new dimension to analysis. Although this pipeline makes presumptions about the quality and integrity of input, it is a novel contribution to the fields of Natural Language Processing and Information Warfare. / Lieutenant, United States Coast Guard data mining natural language processing machine learning algorithm design information warfare propaganda

Search results