Global ETD Search

291	Natural language processing of online propaganda as a means of passively monitoring an adversarial ideology Holm, Raven R. 03 1900 (has links) Approved for public release; distribution is unlimited / Reissued 30 May 2017 with Second Reader’s non-NPS affiliation added to title page. / Online propaganda embodies a potent new form of warfare; one that extends the strategic reach of our adversaries and overwhelms analysts. Foreign organizations have effectively leveraged an online presence to influence elections and distance-recruit. The Islamic State has also shown proficiency in outsourcing violence, proving that propaganda can enable an organization to wage physical war at very little cost and without the resources traditionally required. To augment new counter foreign propaganda initiatives, this thesis presents a pipeline for defining, detecting and monitoring ideology in text. A corpus of 3,049 modern online texts was assembled and two classifiers were created: one for detecting authorship and another for detecting ideology. The classifiers demonstrated 92.70% recall and 95.84% precision in detecting authorship, and detected ideological content with 76.53% recall and 95.61% precision. Both classifiers were combined to simulate how an ideology can be detected and how its composition could be passively monitored across time. Implementation of such a system could conserve manpower in the intelligence community and add a new dimension to analysis. Although this pipeline makes presumptions about the quality and integrity of input, it is a novel contribution to the fields of Natural Language Processing and Information Warfare. / Lieutenant, United States Coast Guard data mining natural language processing machine learning algorithm design information warfare propaganda
292	Disentangling Discourse: Networks, Entropy, and Social Movements Gallagher, Ryan 01 January 2017 (has links) Our daily online conversations with friends, family, colleagues, and strangers weave an intricate network of interactions. From these networked discussions emerge themes and topics that transcend the scope of any individual conversation. In turn, these themes direct the discourse of the network and continue to ebb and flow as the interactions between individuals shape the topics themselves. This rich loop between interpersonal conversations and overarching topics is a wonderful example of a complex system: the themes of a discussion are more than just the sum of its parts. Some of the most socially relevant topics emerging from these online conversations are those pertaining to racial justice issues. Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial shootings of Black Americans. In response to #BlackLivesMatter, other online users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Together these contentious hashtags each shape clashing narratives that echo previous civil rights battles and illustrate ongoing racial tension between police officers and Black Americans. These narratives have taken place on a massive scale with millions of online posts and articles debating the sentiments of "black lives matter" and "all lives matter." Since no one person could possibly read everything written in this debate, comprehensively understanding these conversations and their underlying networks requires us to leverage tools from data science, machine learning, and natural language processing. In Chapter 2, we utilize methodology from network science to measure to what extent #BlackLivesMatter and #AllLivesMatter are "slacktivist" movements, and the effect this has on the diversity of topics discussed within these hashtags. In Chapter 3, we precisely quantify the ways in which the discourse of #BlackLivesMatter and #AllLivesMatter diverge through the application of information-theoretic techniques, validating our results at the topic level from Chapter 2. These entropy-based approaches provide the foundation for powerful automated analysis of textual data, and we explore more generally how they can be used to construct a human-in-the-loop topic model in Chapter 4. Our work demonstrates that there is rich potential for weaving together social science domain knowledge with computational tools in the study of language, networks, and social movements. Black Lives Matter Information Theory Natural Language Processing Polarization Social Networks Topic Model Computer Sciences Mathematics
293	Exploration des réseaux de neurones à base d'autoencodeur dans le cadre de la modélisation des données textuelles Lauly, Stanislas January 2016 (has links) Depuis le milieu des années 2000, une nouvelle approche en apprentissage automatique, l'apprentissage de réseaux profonds (deep learning), gagne en popularité. En effet, cette approche a démontré son efficacité pour résoudre divers problèmes en améliorant les résultats obtenus par d'autres techniques qui étaient considérées alors comme étant l'état de l'art. C'est le cas pour le domaine de la reconnaissance d'objets ainsi que pour la reconnaissance de la parole. Sachant cela, l’utilisation des réseaux profonds dans le domaine du Traitement Automatique du Langage Naturel (TALN, Natural Language Processing) est donc une étape logique à suivre. Cette thèse explore différentes structures de réseaux de neurones dans le but de modéliser le texte écrit, se concentrant sur des modèles simples, puissants et rapides à entraîner. Deep learning Réseaux profonds Réseau de neurones TALN Natural language processing NLP
294	Automated Learning of Event Coding Dictionaries for Novel Domains with an Application to Cyberspace Radford, Benjamin James January 2016 (has links) <p>Event data provide high-resolution and high-volume information about political events. From COPDAB to KEDS, GDELT, ICEWS, and PHOENIX, event datasets and the frameworks that produce them have supported a variety of research efforts across fields and including political science. While these datasets are machine-coded from vast amounts of raw text input, they nonetheless require substantial human effort to produce and update sets of required dictionaries. I introduce a novel method for generating large dictionaries appropriate for event-coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and deep learning to greatly reduce the researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describes that domain. An application to cybersecurity is described and both the generated dictionaries and resultant event data are examined. The cybersecurity event data are also examined in relation to existing datasets in related domains.</p> / Dissertation Political science Cyber Conflict Cybersecurity Event Data International Relations Machine Learning Natural Language Processing
295	ASKNet : automatically creating semantic knowledge networks from natural language text Harrington, Brian January 2009 (has links) This thesis details the creation of ASKNet (Automated Semantic Knowledge Network), a system for creating large scale semantic networks from natural language texts. Using ASKNet as an example, we will show that by using existing natural language processing (NLP) tools, combined with a novel use of spreading activation theory, it is possible to efficiently create high quality semantic networks on a scale never before achievable. The ASKNet system takes naturally occurring English text (e.g., newspaper articles), and processes them using existing NLP tools. It then uses the output of those tools to create semantic network fragments representing the meaning of each sentence in the text. Those fragments are then combined by a spreading activation based algorithm that attempts to decide which portions of the networks refer to the same real-world entity. This allows ASKNet to combine the small fragments together into a single cohesive resource, which has more expressive power than the sum of its parts. Systems aiming to build semantic resources have typically either overlooked information integration completely, or else dismissed it as being AI-complete, and thus unachievable. In this thesis we will show that information integration is both an integral component of any semantic resource, and achievable through a combination of NLP technologies and novel applications of spreading activation theory. While extraction and integration of all knowledge within a text may be AI-complete, we will show that by processing large quantities of text efficiently, we can compensate for minor processing errors and missed relations with volume and creation speed. If relations are too difficult to extract, or we are unsure which nodes should integrate at any given stage, we can simply leave them to be picked up later when we have more information or come across a document which explains the concept more clearly. ASKNet is primarily designed as a proof of concept system. However, this thesis will show that it is capable of creating semantic networks larger than any existing similar resource in a matter of days, and furthermore that the networks it creates of are sufficient quality to be used for real world tasks. We will demonstrate that ASKNet can be used to judge semantic relatedness of words, achieving results comparable to the best state-of-the-art systems. 006.3
296	The language of humour Mihalcea, Rada January 2010 (has links) Humour is one of the most interesting and puzzling aspects of human behaviour. Despite the attention it has received from fields such as philosophy, linguistics, and psychology, there have been only few attempts to create computational models for humour recognition and analysis. In this thesis, I use corpus-based approaches to formulate and test hypotheses concerned with the processing of verbal humour. The thesis makes two important contributions. First, it brings empirical evidence that computational approaches can be successfully applied to the task of humour recognition. Through experiments performed on very large data sets, I show that automatic classification techniques can be effectively used to distinguish between humorous and non-humorous texts, using content-based features or models of incongruity. Moreover, using a method for measuring feature saliency, I identify and validate several dominant word classes that can be used to characterize humorous text. Second, the thesis provides corpus-based support toward the validity of previously formulated linguistic theories, indicating that humour is primarily due to incongruity and humour-specific language. Experiments performed on collections of verbal humour show that both incongruity and content-based features can be successfully used to model humour, and that these features are even more effective when used in tandem. 410
297	Modeling Synergistic Relationships Between Words and Images Leong, Chee Wee 12 1900 (has links) Texts and images provide alternative, yet orthogonal views of the same underlying cognitive concept. By uncovering synergistic, semantic relationships that exist between words and images, I am working to develop novel techniques that can help improve tasks in natural language processing, as well as effective models for text-to-image synthesis, image retrieval, and automatic image annotation. Specifically, in my dissertation, I will explore the interoperability of features between language and vision tasks. In the first part, I will show how it is possible to apply features generated using evidence gathered from text corpora to solve the image annotation problem in computer vision, without the use of any visual information. In the second part, I will address research in the reverse direction, and show how visual cues can be used to improve tasks in natural language processing. Importantly, I propose a novel metric to estimate the similarity of words by comparing the visual similarity of concepts invoked by these words, and show that it can be used further to advance the state-of-the-art methods that employ corpus-based and knowledge-based semantic similarity measures. Finally, I attempt to construct a joint semantic space connecting words with images, and synthesize an evaluation framework to quantify cross-modal semantic relationships that exist between arbitrary pairs of words and images. I study the effectiveness of unsupervised, corpus-based approaches to automatically derive the semantic relatedness between words and images, and perform empirical evaluations by measuring its correlation with human annotators. Natural language processing image word sematic multimodal semantic spaces cross-modal relatedness
298	Lexical mechanics: Partitions, mixtures, and context Williams, Jake Ryland 01 January 2015 (has links) Highly structured for efficient communication, natural languages are complex systems. Unlike in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users "in the dark," so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer. When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses. A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, we have shown in Ch. 3 that the act of combining distinct texts to form large 'corpora' results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis---the retention of meaningful separations in language. We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build on our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to define measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary. Applied Mathematics Lexicography Machine Learning Natural Language Processing Psycholinguistics Statistical Mechanics Applied Mathematics Linguistics Neuroscience and Neurobiology
299	High-Performance Knowledge-Based Entity Extraction Middleton, Anthony M. 01 January 2009 (has links) Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships. Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b). Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope. The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents. computational linguistics entity extraction information extraction machine learning memory-based learning natural language processing Computer Sciences
300	A Semi-Supervised Information Extraction Framework for Large Redundant Corpora Normand, Eric 19 December 2008 (has links) The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy. Information Extraction Natural Language Processing Support Vector Machine Machine Learn- ing Information Retrieval unstructured text

Search results