Global ETD Search

271	Approaches to natural language processing in app development Djoweini, Camran, Hellberg, Henrietta January 2018 (has links) Natural language processing is an on-going field that is not yet fully established. A high demand for natural language processing in applications creates a need for good development-tools and different implementation approaches developed to suit the engineers behind the applications. This project approaches the field from an engineering point of view to research approaches, tools, and techniques that are readily available today for development of natural language processing support. The sub-area of information retrieval of natural language processing was examined through a case study, where prototypes were developed to get a deeper understanding of the tools and techniques used for such tasks from an engineering point of view. We found that there are two major approaches to developing natural language processing support for applications, high-level and low-level approaches. A categorization of tools and frameworks belonging to the two approaches as well as the source code, documentation and, evaluations, of two prototypes developed as part of the research are presented. The choice of approach, tools and techniques should be based on the specifications and requirements of the final product and both levels have their own pros and cons. The results of the report are, to a large extent, generalizable as many different natural language processing tasks can be solved using similar solutions even if their goals vary. / Datalingvistik (engelska natural language processing) är ett område inom datavetenskap som ännu inte är fullt etablerat. En hög efterfrågan av stöd för naturligt språk i applikationer skapar ett behov av tillvägagångssätt och verktyg anpassade för ingenjörer. Detta projekt närmar sig området från en ingenjörs synvinkel för att undersöka de tillvägagångssätt, verktyg och tekniker som finns tillgängliga att arbeta med för utveckling av stöd för naturligt språk i applikationer i dagsläget. Delområdet ‘information retrieval’ undersöktes genom en fallstudie, där prototyper utvecklades för att skapa en djupare förståelse av verktygen och teknikerna som används inom området. Vi kom fram till att det går att kategorisera verktyg och tekniker i två olika grupper, beroende på hur distanserad utvecklaren är från den underliggande bearbetningen av språket. Kategorisering av verktyg och tekniker samt källkod, dokumentering och utvärdering av prototyperna presenteras som resultat. Valet av tillvägagångssätt, tekniker och verktyg bör baseras på krav och specifikationer för den färdiga produkten. Resultaten av studien är till stor del generaliserbara eftersom lösningar till många problem inom området är likartade även om de slutgiltiga målen skiljer sig åt. Natural language processing information retrieval voice-control implementation approaches NLP. Natural language processing informationsinhämtning röststyrning implementerings tillvägagångssätt NLP. Computer and Information Sciences Data- och informationsvetenskap
272	Determining Whether and When People Participate in the Events They Tweet About Sanagavarapu, Krishna Chaitanya 05 1900 (has links) This work describes an approach to determine whether people participate in the events they tweet about. Specifically, we determine whether people are participants in events with respect to the tweet timestamp. We target all events expressed by verbs in tweets, including past, present and events that may occur in future. We define event participant as people directly involved in an event regardless of whether they are the agent, recipient or play another role. We present an annotation effort, guidelines and quality analysis with 1,096 event mentions. We discuss the label distributions and event behavior in the annotated corpus. We also explain several features used and a standard supervised machine learning approach to automatically determine if and when the author is a participant of the event in the tweet. We discuss trends in the results obtained and devise important conclusions. Twitter events author participation machine learning natural language processing social media corpus analysis Computer Science Microblogs. Discourse analysis.
273	On Semantic Cognition, Inductive Generalization, and Language Models Kanishka Misra (9708551) 05 September 2023 (has links) <p dir="ltr">Our ability to understand language and perform reasoning crucially relies on a robust system of semantic cognition (G. L. Murphy, 2002; Rogers & McClelland, 2004; Rips et al., 2012; Lake & Murphy, 2021): processes that allow us to learn, update, and produce inferences about everyday concepts (e.g., cat, chair), properties (e.g., has fur, can be sat on), categories (e.g., mammals, furniture), and relations (e.g., is-a, taller-than). Meanwhile, recent progress in the field of natural language processing (NLP) has led to the development of language models (LMs): sophisticated neural networks that are trained to predict words in context (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020), and as a result build representations that encode the knowledge present in the statistics of their training environment. These models have achieved impressive levels of performance on a range of tasks that require sophisticated semantic knowledge (e.g. question answering and natural language inference), often even reaching human parity. To what extent do LMs capture the nuances of human conceptual knowledge and reasoning? Centering around this broad question, this dissertation uses core ideas in human semantic cognition as guiding principles and lays down the groundwork to establish effective evaluation and improvement of conceptual understanding in LMs. In particular, I build on prior work that focuses on characterizing what semantic knowledge is made available in the behavior and representations of LMs, and extend it by additionally proposing tests that focus on functional consequences of acquiring basic semantic knowledge.<br><br>I primarily focus on inductive generalization (Hayes & Heit, 2018)—the unique ability of humans to rely on acquired conceptual knowledge to project or generalize novel information—as a context within which we can analyze LMs’ encoding of conceptual knowledge. I do this, since the literature surrounding inductive generalization contains a variety of empirical regularities that map to specific conceptual abstractions and shed light on how humans store, organize and use conceptual knowledge. Before explicitly analyzing LMs for these empirical regularities, I test them on two other contexts, which also feature the role of inductive generalization. First I test the extent to which LMs demonstrate typicality effects—a robust finding in human categorization literature where certain members of a category are considered to be more central to the category than are others. Specifically, I test the behavior 19 different LMs on two contexts where typicality effects modulate human behavior: 1) verification of sentences expressing taxonomic category membership, and 2) projecting novel properties from individual category members to the entire category. In both tests, LMs achieved positive but modest correlations with human typicality ratings, suggesting that they can to a non-trivial extent capture subtle differences between category members. Next, I propose a new benchmark to test the robustness of LMs in attributing properties to everyday concepts, and in making inductive leaps to endow properties to novel concepts. On testing 31 different LMs for these capacities, I find that while they can correctly attribute properties to everyday concepts and even predict the properties of novel concepts in simple settings, they struggle to do so robustly. Combined with the analyses of typicality effects, these results suggest that the ability of LMs to demonstrate impressive conceptual knowledge and reasoning behavior can be explained by their sensitivities to shallow predictive cues. When these cues are carefully controlled for, LMs show critical failures in demonstrating robust conceptual understanding. Finally, I develop a framework that can allow us to characterize the extent to which the distributed representations learned by LMs can encode principles and abstractions that characterize inductive behavior of humans. This framework operationalizes inductive generalization as the behavior of an LM after its representations have been partially exposed (via gradient-based learning) to novel conceptual information. To simulate this behavior, the framework uses LMs that are endowed with human-elicited property knowledge, by training them to evaluate the truth of sentences attributing properties to concepts. I apply this framework to test four different LMs on 13 different inductive phenomena documented for humans (Osherson et al., 1990; Heit & Rubinstein, 1994). Results from these analyses suggest that building representations from word distributions can successfully allow the encoding of many abstract principles that can guide inductive behavior in the models—principles such as sensitivity to conceptual similarity, hierarchical organization of categories, reasoning about category coverage, and sample size. At the same time, the tested models also systematically failed at demonstrating certain phenomena, showcasing their inability to demonstrate pragmatic reasoning, preference to rely on shallow statistical cues, and lack of context sensitivity with respect to high-level intuitive theories.</p> Natural language processing Computational linguistics Cognition Language Models Artificial Intelligence Large Language Models Concepts and Categories Inductive Reasoning Machine Learning Natural Language Processing
274	Framtidens cybersäkerhet : en studie om hur Natural Language Processing påverkar dagens cybersäkerhetsarbete / The Future of Cybersecurity : A Study on How Natural Language Processing Impacts Today's Cybersecurity Efforts Grönstedt Söderberg, Olle, Mattsson, Fredrik January 2024 (has links) Sedan lanseringen av OpenAIs generativa chatbot ChatGPT i slutet av 2022 har intresset för artificiell intelligens (AI) och specifikt Natural Language Processing (NLP) ökat markant. Genom dess förmåga att tolka och generera mänskligt språk har NLP redan transformerat flertalet industrier och skapat debatter bland forskare, där somliga ser AI som en av de mest betydelsefulla innovationerna någonsin, medan andra varnar för att den hastiga teknikutvecklingen leder till nya och förändrade risker. Denna studie syftar till att undersöka cybersäkerhetsexperters syn på risker relaterade till användningen av NLP och dess inverkan på cybersäkerhetsarbete. Genom intervjuer och enkäter har studien identifierat flera risker som effektiviseras i och med användningen av NLP-baserade tjänster. Studiens enkätresultat visar vilka risker cybersäkerhetsexperter värderar högst utifrån sannolikhet och potentiella skada. Värderingarna görs med ramverket CIA i åtanke (Confidentiality, Integrity, Availability), en beprövad säkerhetsmodell som används för att upprätthålla god informations- och cybersäkerhet. Studiens intervjuresultat förser studien med insikter i respondenternas bakomliggande resonemang och betonar också vikten av medvetenhet vid användningen av NLP-baserade tjänster. Sammantaget förser studien läsaren med en förståelse för de risker som är förknippade med Natural language processing och ger insikt i de faktorer som cybersäkerhetsexperter tar i beaktning när de bedömer dessa risker. De tre risker som studien identifierade som särskilt framstående var: Spear-phishing, Skadlig Kod och Data leaks. Natural language processing Cybersecurity Artificial intelligence CIA Risks Natural language processing Cybersäkerhet Artificiell intelligens CIA Risker Information Systems, Social aspects
275	Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication Kyle, Kristopher 09 May 2016 (has links) Syntactic complexity has been an area of significant interest in L2 writing development studies over the past 45 years. Despite the regularity in which syntactic complexity measures have been employed, the construct is still relatively under-developed, and, as a result, the cumulative results of syntactic complexity studies can appear opaque. At least three reasons exist for the current state of affairs, namely the lack of consistency and clarity by which indices of syntactic complexity have been described, the overly broad nature of the indices that have been regularly employed, and the omission of indices that focus on usage-based perspectives. This study seeks to address these three gaps through the development and validation of the Tool for the Automatic Assessment of Syntactic Sophistication and Complexity (TAASSC). TAASSC measures large and fined grained clausal and phrasal indices of syntactic complexity and usage-based frequency/contingency indices of syntactic sophistication. Using TAASSC, this study will address L2 writing development in two main ways: through the examination of syntactic development longitudinally and through the examination of human judgments of writing proficiency (e.g., expert ratings of TOEFL essays). This study will have important implications for second language acquisition, second language writing, and language assessment. Second language acquisition Syntactic complexity Writing development Language use Language assessment Natural language processing
276	Efficient algorithms for infinite-state recursive stochastic models and Newton's method Stewart, Alistair Mark January 2015 (has links) Some well-studied infinite-state stochastic models give rise to systems of nonlinear equations. These systems of equations have solutions that are probabilities, generally probabilities of termination in the model. We are interested in finding efficient, preferably polynomial time, algorithms for calculating probabilities associated with these models. The chief tool we use to solve systems of polynomial equations will be Newton’s method as suggested by [EY09]. The main contribution of this thesis is to the analysis of this and related algorithms. We give polynomial-time algorithms for calculating probabilities for broad classes of models for which none were known before. Stochastic models that give rise to such systems of equations include such classic and heavily-studied models as Multi-type Branching Processes, Stochastic Context- Free Grammars(SCFGs) and Quasi Birth-Death Processes. We also consider models that give rise to infinite-state Markov Decision Processes (MDPs) by giving algorithms for approximating optimal probabilities and finding policies that give probabilities close to the optimal probability, in several classes of infinite-state MDPs. Our algorithms for analysing infinite-state MDPs rely on a non-trivial generalization of Newton’s method that works for the max/min polynomial systems that arise as Bellman optimality equations in these models. For SCFGs, which are used in statistical natural language processing, in addition to approximating termination probabilities, we analyse algorithms for approximating the probability that a grammar produces a given string, or produces a string in a given regular language. In most cases, we show that we can calculate an approximation to the relevant probability in time polynomial in the size of the model and the number of bits of desired precision. We also consider more general systems of monotone polynomial equations. For such systems we cannot give a polynomial-time algorithm, which pre-existing hardness results render unlikely, but we can still give an algorithm with a complexity upper bound which is exponential only in some parameters that are likely to be bounded for the monotone polynomial equations that arise for many interesting stochastic models. 512
277	Characterization of Prose by Rhetorical Structure for Machine Learning Classification Java, James 01 January 2015 (has links) Measures of classical rhetorical structure in text can improve accuracy in certain types of stylistic classification tasks such as authorship attribution. This research augments the relatively scarce work in the automated identification of rhetorical figures and uses the resulting statistics to characterize an author's rhetorical style. These characterizations of style can then become part of the feature set of various classification models. Our Rhetorica software identifies 14 classical rhetorical figures in free English text, with generally good precision and recall, and provides summary measures to use in descriptive or classification tasks. Classification models trained on Rhetorica's rhetorical measures paired with lexical features typically performed better at authorship attribution than either set of features used individually. The rhetorical measures also provide new stylistic quantities for describing texts, authors, genres, etc. Authorship attribution Machine learning Natural language processing Rhetoric Computer Science Computer Sciences Rhetoric
278	Iterated learning framework for unsupervised part-of-speech induction Christodoulopoulos, Christos January 2013 (has links) Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the undeniable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find annotators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods offer the possibility to expand our analyses into more resourcepoor languages, and to move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and can provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech systems, I make the case that the compartmentalised, traditional pipeline approach of NLP is not ideal for the task. I present a generative Bayesian system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lexical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I test the system using features provided by unsupervised systems in a pipeline mode (where the output of one system is the input to another) and show that the performance of the baseline (distributional) model increases significantly, reaching and in some cases surpassing the performance of state-of-the-art part-of-speech induction systems. I then turn to the unsupervised systems that provided these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their inference. Having established a bi-directional relationship between each system and my part-of-speech inducer, I describe an iterated learning method, where each component system is trained using the output of the other system in each iteration. The iterated learning method improves the performance of both component systems in each task. Finally, using this iterated learning framework, and by using parts of speech as the central component, I produce chains of linguistic structure induction that combine all the component systems to offer a more holistic view of NLP. To show the potential of this multi-level system, I demonstrate its use ‘in the wild’. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages. Using the multi-level induction system, I induce cross-lingual clusters, and provide some qualitative results of my approach. I show that it is possible to discover similarities between languages that correspond to ‘hidden’ morphological, syntactic or semantic elements. 006.3
279	Automatic generation of factual questions from video documentaries Skalban, Yvonne January 2013 (has links) Questioning sessions are an essential part of teachers’ daily instructional activities. Questions are used to assess students’ knowledge and comprehension and to promote learning. The manual creation of such learning material is a laborious and time-consuming task. Research in Natural Language Processing (NLP) has shown that Question Generation (QG) systems can be used to efficiently create high-quality learning materials to support teachers in their work and students in their learning process. A number of successful QG applications for education and training have been developed, but these focus mainly on supporting reading materials. However, digital technology is always evolving; there is an ever-growing amount of multimedia content available, and more and more delivery methods for audio-visual content are emerging and easily accessible. At the same time, research provides empirical evidence that multimedia use in the classroom has beneficial effects on student learning. Thus, there is a need to investigate whether QG systems can be used to assist teachers in creating assessment materials from these different types of media that are being employed in classrooms. This thesis serves to explore how NLP tools and techniques can be harnessed to generate questions from non-traditional learning materials, in particular videos. A QG framework which allows the generation of factual questions from video documentaries has been developed and a number of evaluations to analyse the quality of the produced questions have been performed. The developed framework uses several readily available NLP tools to generate questions from the subtitles accompanying a video documentary. The reason for choosing video vii documentaries is two-fold: firstly, they are frequently used by teachers and secondly, their factual nature lends itself well to question generation, as will be explained within the thesis. The questions generated by the framework can be used as a quick way of testing students’ comprehension of what they have learned from the documentary. As part of this research project, the characteristics of documentary videos and their subtitles were analysed and the methodology has been adapted to be able to exploit these characteristics. An evaluation of the system output by domain experts showed promising results but also revealed that generating even shallow questions is a task which is far from trivial. To this end, the evaluation and subsequent error analysis contribute to the literature by highlighting the challenges QG from documentary videos can face. In a user study, it was investigated whether questions generated automatically by the system developed as part of this thesis and a state-of-the-art system can successfully be used to assist multimedia-based learning. Using a novel evaluation methodology, the feasibility of using a QG system’s output as ‘pre-questions’ with different types of prequestions (text-based and with images) used was examined. The psychometric parameters of the automatically generated questions by the two systems and of those generated manually were compared. The results indicate that the presence of pre-questions (preferably with images) improves the performance of test-takers and they highlight that the psychometric parameters of the questions generated by the system are comparable if not better than those of the state-of-the-art system. In another experiment, the productivity of questions in terms of time taken to generate questions manually vs. time taken to post-edit system-generated questions was analysed. A viii post-editing tool which allows for the tracking of several statistics such as edit distance measures, editing time, etc, was used. The quality of questions before and after postediting was also analysed. Not only did the experiments provide quantitative data about automatically and manually generated questions, but qualitative data in the form of user feedback, which provides an insight into how users perceived the quality of questions, was also gathered. 006.35
280	Semantic annotation of Chinese texts with message structures based on HowNet Wong, Ping-wai., 黃炳蔚. January 2007 (has links) published_or_final_version / abstract / Humanities / Doctoral / Doctor of Philosophy Chinese language - Semantics. Semantics - Data processing.

Search results