Spelling suggestions: "subject:"[een] NATURAL LANGUAGE"" "subject:"[enn] NATURAL LANGUAGE""
311 |
The language of humourMihalcea, Rada January 2010 (has links)
Humour is one of the most interesting and puzzling aspects of human behaviour. Despite the attention it has received from fields such as philosophy, linguistics, and psychology, there have been only few attempts to create computational models for humour recognition and analysis. In this thesis, I use corpus-based approaches to formulate and test hypotheses concerned with the processing of verbal humour. The thesis makes two important contributions. First, it brings empirical evidence that computational approaches can be successfully applied to the task of humour recognition. Through experiments performed on very large data sets, I show that automatic classification techniques can be effectively used to distinguish between humorous and non-humorous texts, using content-based features or models of incongruity. Moreover, using a method for measuring feature saliency, I identify and validate several dominant word classes that can be used to characterize humorous text. Second, the thesis provides corpus-based support toward the validity of previously formulated linguistic theories, indicating that humour is primarily due to incongruity and humour-specific language. Experiments performed on collections of verbal humour show that both incongruity and content-based features can be successfully used to model humour, and that these features are even more effective when used in tandem.
|
312 |
Lexical mechanics: Partitions, mixtures, and contextWilliams, Jake Ryland 01 January 2015 (has links)
Highly structured for efficient communication, natural languages are complex systems. Unlike in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users "in the dark," so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer. When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses. A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, we have shown in Ch. 3 that the act of combining distinct texts to form large 'corpora' results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis---the retention of meaningful separations in language. We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build on our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to define measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary.
|
313 |
Etude de l'ambiguïté des requêtes dans un moteur de recherche spécialisé dans l'actualité : exploitation d'indices contextuels / Study of the ambiguity of queries in a news search engine : exploitation of contextual cluesLalleman, Fanny 26 November 2013 (has links)
Dans cette thèse, nous envisageons la question de l’ambiguïté des requêtes soumises à un moteur de recherche dans un domaine particulier qui est l’actualité. Nous nous appuyons sur les travaux récents dans le domaine de la recherche d’information (RI) qui ont montré l’apport d’informations contextuelles pour mieux cerner et traiter plus adéquatement le besoin informationnel. Nous faisons ainsi l’hypothèse que les éléments d’information disponibles dans une application de RI (contextes présents dans la base documentaire, répétitions et reformulations de requêtes, dimension diachronique de la recherche) peuvent nous aider à étudier ce problème d’ambiguïté. Nous faisons également l’hypothèse que l’ambiguïté va se manifester dans les résultats ramenés par un moteur de recherche. Dans ce but, nous avons mis en place un dispositif pour étudier l’ambiguïté des requêtes reposant sur une méthode de catégorisation thématique des requêtes, qui s’appuie sur unecatégorisation experte. Nous avons ensuite montré que cette ambiguïté est différente de celle repérée par une ressource encyclopédique telle que Wikipédia. Nous avons évalué ce dispositif de catégorisation en mettant en place deux tests utilisateurs. Enfin, nous fournissons une étude basée sur un faisceau d’indices contextuels afin de saisir le comportement global d’une requête. / In this thesis, we consider the question of the ambiguity of queries submitted to a search engine in a particular area that is news.We build on recent work in the field of information retrieval (IR) that showed the addition of contextual information to better identify and address more adequately the information need. On this basis, we make the hypothesis that the elements of information available in an application of IR (contexts in the collection of documents, repetitions and reformulations of queries, diachronic dimension of the search) can help us to examine this problem of ambiguity. We also postulate that ambiguity will manifest in the results returned by a search engine. In this purpose to evaluate these hypotheses, we set up a device to study the ambiguity of queries based on a method of thematic categorization of queries, which relies on an expert categorization. We then show that this ambiguity is different which is indicated by an encyclopedic resources such as Wikipedia.We evaluate this categorization device by setting up two user tests. Finally, we carry out a study based on a set of contextual clues in order to understand the global behavior of a query.
|
314 |
High-Performance Knowledge-Based Entity ExtractionMiddleton, Anthony M. 01 January 2009 (has links)
Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships.
Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b).
Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope.
The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents.
|
315 |
A Semi-Supervised Information Extraction Framework for Large Redundant CorporaNormand, Eric 19 December 2008 (has links)
The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy.
|
316 |
An empirical study of semantic similarity in WordNet and Word2VecHandler, Abram 18 December 2014 (has links)
This thesis performs an empirical analysis of Word2Vec by comparing its output to WordNet, a well-known, human-curated lexical database. It finds that Word2Vec tends to uncover more of certain types of semantic relations than others -- with Word2Vec returning more hypernyms, synonomyns and hyponyms than hyponyms or holonyms. It also shows the probability that neighbors separated by a given cosine distance in Word2Vec are semantically related in WordNet. This result both adds to our understanding of the still-unknown Word2Vec and helps to benchmark new semantic tools built from word vectors.
|
317 |
New Methods for Large-Scale Analyses of Social Identities and StereotypesJoseph, Kenneth 01 June 2016 (has links)
Social identities, the labels we use to describe ourselves and others, carry with them stereotypes that have significant impacts on our social lives. Our stereotypes, sometimes without us knowing, guide our decisions on whom to talk to and whom to stay away from, whom to befriend and whom to bully, whom to treat with reverence and whom to view with disgust. Despite these impacts of identities and stereotypes on our lives, existing methods used to understand them are lacking. In this thesis, I first develop three novel computational tools that further our ability to test and utilize existing social theory on identity and stereotypes. These tools include a method to extract identities from Twitter data, a method to infer affective stereotypes from newspaper data and a method to infer both affective and semantic stereotypes from Twitter data. Case studies using these methods provide insights into Twitter data relevant to the Eric Garner and Michael Brown tragedies and both Twitter and newspaper data from the “Arab Spring”. Results from these case studies motivate the need for not only new methods for existing theory, but new social theory as well. To this end, I develop a new sociotheoretic model of identity labeling - how we choose which label to apply to others in a particular situation. The model combines data, methods and theory from the social sciences and machine learning, providing an important example of the surprisingly rich interconnections between these fields.
|
318 |
Automatic text summarization of Swedish news articlesLehto, Niko, Sjödin, Mikael January 2019 (has links)
With an increasing amount of textual information available there is also an increased need to make this information more accessible. Our paper describes a modified TextRank model and investigates the different methods available to use automatic text summarization as a means for summary creation of swedish news articles. To evaluate our model we focused on intrinsic evaluation methods, in part through content evaluation in the form of of measuring referential clarity and non-redundancy, and in part by text quality evaluation measures, in the form of keyword retention and ROUGE evaluation. The results acquired indicate that stemming and improved stop word capabilities can have a positive effect on the ROUGE scores. The addition of redundancy checks also seems to have a positive effect on avoiding repetition of information. Keyword retention decreased somewhat, however. Lastly all methods had some trouble with dangling anaphora, showing a need for further work within anaphora resolution.
|
319 |
A SENTIMENT BASED AUTOMATIC QUESTION-ANSWERING FRAMEWORKQiaofei Ye (6636317) 14 May 2019 (has links)
With the rapid growth and maturity of Question-Answering (QA) domain, non-factoid Question-Answering tasks are in high demand. However, existing Question-Answering systems are either fact-based, or highly keyword related and hard-coded. Moreover, if QA is to become more personable, sentiment of the question and answer should be taken into account. However, there is not much research done in the field of non-factoid Question-Answering systems based on sentiment analysis, that would enable a system to retrieve answers in a more emotionally intelligent way. This study investigates to what extent could prediction of the best answer be improved by adding an extended representation of sentiment information into non-factoid Question-Answering.
|
320 |
Using Machine Learning to Learn from Bug Reports : Towards Improved Testing EfficiencyIngvarsson, Sanne January 2019 (has links)
The evolution of a software system originates from its changes, whether it comes from changed user needs or adaption to its current environment. These changes are as encouraged as they are inevitable, although every change to a software system comes with a risk of introducing an error or a bug. This thesis aimed to investigate the possibilities of using the description of bug reports as a decision basis for detecting the provenance of a bug by using machine learning. K-means and agglomerative clustering have been applied to free text documents by using Natural Language Processing to initially divide the investigated software system into sub parts. Topic labelling is further on performed on the found clusters to find suitable names and get an overall understanding for the clusters.Finally, it was investigated if it was possible to find which cluster that were more likely to cause a bug from certain clusters and should be tested more thoroughly. By evaluating a subset of known causes, it was found that possible direct connections could be found in 50% of the cases, while this number increased to 58% if the cause were attached to clusters.
|
Page generated in 0.0396 seconds