Global ETD Search

301	Automated Learning of Event Coding Dictionaries for Novel Domains with an Application to Cyberspace Radford, Benjamin James January 2016 (has links) <p>Event data provide high-resolution and high-volume information about political events. From COPDAB to KEDS, GDELT, ICEWS, and PHOENIX, event datasets and the frameworks that produce them have supported a variety of research efforts across fields and including political science. While these datasets are machine-coded from vast amounts of raw text input, they nonetheless require substantial human effort to produce and update sets of required dictionaries. I introduce a novel method for generating large dictionaries appropriate for event-coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and deep learning to greatly reduce the researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describes that domain. An application to cybersecurity is described and both the generated dictionaries and resultant event data are examined. The cybersecurity event data are also examined in relation to existing datasets in related domains.</p> / Dissertation Political science Cyber Conflict Cybersecurity Event Data International Relations Machine Learning Natural Language Processing
302	ASKNet : automatically creating semantic knowledge networks from natural language text Harrington, Brian January 2009 (has links) This thesis details the creation of ASKNet (Automated Semantic Knowledge Network), a system for creating large scale semantic networks from natural language texts. Using ASKNet as an example, we will show that by using existing natural language processing (NLP) tools, combined with a novel use of spreading activation theory, it is possible to efficiently create high quality semantic networks on a scale never before achievable. The ASKNet system takes naturally occurring English text (e.g., newspaper articles), and processes them using existing NLP tools. It then uses the output of those tools to create semantic network fragments representing the meaning of each sentence in the text. Those fragments are then combined by a spreading activation based algorithm that attempts to decide which portions of the networks refer to the same real-world entity. This allows ASKNet to combine the small fragments together into a single cohesive resource, which has more expressive power than the sum of its parts. Systems aiming to build semantic resources have typically either overlooked information integration completely, or else dismissed it as being AI-complete, and thus unachievable. In this thesis we will show that information integration is both an integral component of any semantic resource, and achievable through a combination of NLP technologies and novel applications of spreading activation theory. While extraction and integration of all knowledge within a text may be AI-complete, we will show that by processing large quantities of text efficiently, we can compensate for minor processing errors and missed relations with volume and creation speed. If relations are too difficult to extract, or we are unsure which nodes should integrate at any given stage, we can simply leave them to be picked up later when we have more information or come across a document which explains the concept more clearly. ASKNet is primarily designed as a proof of concept system. However, this thesis will show that it is capable of creating semantic networks larger than any existing similar resource in a matter of days, and furthermore that the networks it creates of are sufficient quality to be used for real world tasks. We will demonstrate that ASKNet can be used to judge semantic relatedness of words, achieving results comparable to the best state-of-the-art systems. 006.3
303	The language of humour Mihalcea, Rada January 2010 (has links) Humour is one of the most interesting and puzzling aspects of human behaviour. Despite the attention it has received from fields such as philosophy, linguistics, and psychology, there have been only few attempts to create computational models for humour recognition and analysis. In this thesis, I use corpus-based approaches to formulate and test hypotheses concerned with the processing of verbal humour. The thesis makes two important contributions. First, it brings empirical evidence that computational approaches can be successfully applied to the task of humour recognition. Through experiments performed on very large data sets, I show that automatic classification techniques can be effectively used to distinguish between humorous and non-humorous texts, using content-based features or models of incongruity. Moreover, using a method for measuring feature saliency, I identify and validate several dominant word classes that can be used to characterize humorous text. Second, the thesis provides corpus-based support toward the validity of previously formulated linguistic theories, indicating that humour is primarily due to incongruity and humour-specific language. Experiments performed on collections of verbal humour show that both incongruity and content-based features can be successfully used to model humour, and that these features are even more effective when used in tandem. 410
304	Modeling Synergistic Relationships Between Words and Images Leong, Chee Wee 12 1900 (has links) Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept. By uncovering synergistic, semantic relationships that exist between words and images, I am working to develop novel techniques that can help improve tasks in natural language processing, as well as effective models for text-to-image synthesis, image retrieval, and automatic image annotation. Specifically, in my dissertation, I will explore the interoperability of features between language and vision tasks. In the first part, I will show how it is possible to apply features generated using evidence gathered from text corpora to solve the image annotation problem in computer vision, without the use of any visual information. In the second part, I will address research in the reverse direction, and show how visual cues can be used to improve tasks in natural language processing. Importantly, I propose a novel metric to estimate the similarity of words by comparing the visual similarity of concepts invoked by these words, and show that it can be used further to advance the state-of-the-art methods that employ corpus-based and knowledge-based semantic similarity measures. Finally, I attempt to construct a joint semantic space connecting words with images, and synthesize an evaluation framework to quantify cross-modal semantic relationships that exist between arbitrary pairs of words and images. I study the effectiveness of unsupervised, corpus-based approaches to automatically derive the semantic relatedness between words and images, and perform empirical evaluations by measuring its correlation with human annotators. Natural language processing image word sematic multimodal semantic spaces cross-modal relatedness
305	Lexical mechanics: Partitions, mixtures, and context Williams, Jake Ryland 01 January 2015 (has links) Highly structured for efficient communication, natural languages are complex systems. Unlike in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users "in the dark," so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer. When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses. A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, we have shown in Ch. 3 that the act of combining distinct texts to form large 'corpora' results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis---the retention of meaningful separations in language. We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build on our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to define measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary. Applied Mathematics Lexicography Machine Learning Natural Language Processing Psycholinguistics Statistical Mechanics Applied Mathematics Linguistics Neuroscience and Neurobiology
306	High-Performance Knowledge-Based Entity Extraction Middleton, Anthony M. 01 January 2009 (has links) Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships. Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b). Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope. The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents. computational linguistics entity extraction information extraction machine learning memory-based learning natural language processing Computer Sciences
307	A Semi-Supervised Information Extraction Framework for Large Redundant Corpora Normand, Eric 19 December 2008 (has links) The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy. Information Extraction Natural Language Processing Support Vector Machine Machine Learn- ing Information Retrieval unstructured text
308	An empirical study of semantic similarity in WordNet and Word2Vec Handler, Abram 18 December 2014 (has links) This thesis performs an empirical analysis of Word2Vec by comparing its output to WordNet, a well-known, human-curated lexical database. It finds that Word2Vec tends to uncover more of certain types of semantic relations than others -- with Word2Vec returning more hypernyms, synonomyns and hyponyms than hyponyms or holonyms. It also shows the probability that neighbors separated by a given cosine distance in Word2Vec are semantically related in WordNet. This result both adds to our understanding of the still-unknown Word2Vec and helps to benchmark new semantic tools built from word vectors. Word2Vec Natural Language Processing WordNet Distributional Semantics Artificial Intelligence and Robotics Computational Linguistics Other Computer Engineering
309	New Methods for Large-Scale Analyses of Social Identities and Stereotypes Joseph, Kenneth 01 June 2016 (has links) Social identities, the labels we use to describe ourselves and others, carry with them stereotypes that have significant impacts on our social lives. Our stereotypes, sometimes without us knowing, guide our decisions on whom to talk to and whom to stay away from, whom to befriend and whom to bully, whom to treat with reverence and whom to view with disgust. Despite these impacts of identities and stereotypes on our lives, existing methods used to understand them are lacking. In this thesis, I first develop three novel computational tools that further our ability to test and utilize existing social theory on identity and stereotypes. These tools include a method to extract identities from Twitter data, a method to infer affective stereotypes from newspaper data and a method to infer both affective and semantic stereotypes from Twitter data. Case studies using these methods provide insights into Twitter data relevant to the Eric Garner and Michael Brown tragedies and both Twitter and newspaper data from the “Arab Spring”. Results from these case studies motivate the need for not only new methods for existing theory, but new social theory as well. To this end, I develop a new sociotheoretic model of identity labeling - how we choose which label to apply to others in a particular situation. The model combines data, methods and theory from the social sciences and machine learning, providing an important example of the surprisingly rich interconnections between these fields. Computational Social Science Affect Control Theory Natural Language Processing Bayesian Networks
310	Automatic text summarization of Swedish news articles Lehto, Niko, Sjödin, Mikael January 2019 (has links) With an increasing amount of textual information available there is also an increased need to make this information more accessible. Our paper describes a modified TextRank model and investigates the different methods available to use automatic text summarization as a means for summary creation of swedish news articles. To evaluate our model we focused on intrinsic evaluation methods, in part through content evaluation in the form of of measuring referential clarity and non-redundancy, and in part by text quality evaluation measures, in the form of keyword retention and ROUGE evaluation. The results acquired indicate that stemming and improved stop word capabilities can have a positive effect on the ROUGE scores. The addition of redundancy checks also seems to have a positive effect on avoiding repetition of information. Keyword retention decreased somewhat, however. Lastly all methods had some trouble with dangling anaphora, showing a need for further work within anaphora resolution. Automatic text summarization Language technology Summary evaluation Natural language processing Engineering and Technology Teknik och teknologier

Search results