Global ETD Search

281	Efficient computation of advanced skyline queries. Yuan, Yidong, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links) Skyline has been proposed as an important operator for many applications, such as multi-criteria decision making, data mining and visualization, and user-preference queries. Due to its importance, skyline and its computation have received considerable attention from database research community recently. All the existing techniques, however, focus on the conventional databases. They are not applicable to online computation environment, such as data stream. In addition, the existing studies consider efficiency of skyline computation only, while the fundamental problem on the semantics of skylines still remains open. In this thesis, we study three problems of skyline computation: (1) online computing skyline over data stream; (2) skyline cube computation and its analysis; and (3) top-k most representative skyline. To tackle the problem of online skyline computation, we develop a novel framework which converts more expensive multiple dimensional skyline computation to stabbing queries in 1-dimensional space. Based on this framework, a rigorous theoretical analysis of the time complexity of online skyline computation is provided. Then, efficient algorithms are proposed to support ad hoc and continuous skyline queries over data stream. Inspired by the idea of data cube, we propose a novel concept of skyline cube which consists of skylines of all possible non-empty subsets of a given full space. We identify the unique sharing strategies for skyline cube computation and develop two efficient algorithms which compute skyline cube in a bottom-up and top-down manner, respectively. Finally, a theoretical framework to answer the question about semantics of skyline and analysis of multidimensional subspace skyline are presented. Motived by the fact that the full skyline may be less informative because it generally consists of a large number of skyline points, we proposed a novel skyline operator -- top-k most representative skyline. The top-k most representative skyline operator selects the k skyline points so that the number of data points, which are dominated by at least one of these k skyline points, is maximized. To compute top-k most representative skyline, two efficient algorithms and their theoretical analysis are presented. Database management. Database design. Question-answering systems. Semantics - Data processing.
282	Interactive Visualizations of Natural Language Collins, Christopher 06 August 2010 (has links) While linguistic skill is a hallmark of humanity, the increasing volume of linguistic data each of us faces is causing individual and societal problems — ‘information overload’ is a commonly discussed condition. Tasks such as finding the most appropriate information online, understanding the contents of a personal email repository, and translating documents from another language are now commonplace. These tasks need not cause stress and feelings of overload: the human intellectual capacity is not the problem. Rather, the computational interfaces to linguistic data are problematic — there exists a Linguistic Visualization Divide in the current state-of-the-art. Through five design studies, this dissertation combines sophisticated natural language processing algorithms with information visualization techniques grounded in evidence of human visuospatial capabilities. The first design study, Uncertainty Lattices, augments real-time computermediated communication, such as cross-language instant messaging chat and automatic speech recognition. By providing explicit indications of algorithmic confidence, the visualization enables informed decisions about the quality of computational outputs. Two design studies explore the space of content analysis. DocuBurst is an interactive visualization of document content, which spatially organizes words using an expert-created ontology. Broadening from single documents to document collections, Parallel Tag Clouds combine keyword extraction and coordinated visualizations to provide comparative overviews across subsets of a faceted text corpus. Finally, two studies address visualization for natural language processing research. The Bubble Sets visualization draws secondary set relations around arbitrary collections of items, such as a linguistic parse tree. From this design study we propose a theory of spatial rights to consider when assigning visual encodings to data. Expanding considerations of spatial rights, we present a formalism to organize the variety of approaches to coordinated and linked visualization, and introduce VisLink, a new method to relate and explore multiple 2d visualizations in 3d space. Intervisualization connections allow for cross-visualization queries and support high level comparison between visualizations. From the design studies we distill challenges common to visualizing language data, including maintaining legibility, supporting detailed reading, addressing data scale challenges, and managing problems arising from semantic ambiguity. visualization linguistics natural language processing information visualization machine translation 0984 0723
283	Exploiting Linguistic Knowledge to Infer Properties of Neologisms Cook, C. Paul 14 February 2011 (has links) Neologisms, or newly-coined words, pose problems for natural language processing (NLP) systems. Due to the recency of their coinage, neologisms are typically not listed in computational lexicons---dictionary-like resources that many NLP applications depend on. Therefore when a neologism is encountered in a text being processed, the performance of an NLP system will likely suffer due to the missing word-level information. Identifying and documenting the usage of neologisms is also a challenge in lexicography, the making of dictionaries. The traditional approach to these tasks has been to manually read a lot of text. However, due to the vast quantities of text being produced nowadays, particularly in electronic media such as blogs, it is no longer possible to manually analyze it all in search of neologisms. Methods for automatically identifying and inferring syntactic and semantic properties of neologisms would therefore address problems encountered in both natural language processing and lexicography. Because neologisms are typically infrequent due to their recent addition to the language, approaches to automatically learning word-level information relying on statistical distributional information are in many cases inappropriate. Moreover, neologisms occur in many domains and genres, and therefore approaches relying on domain-specific resources are also inappropriate. The hypothesis of this thesis is that knowledge about etymology---including word formation processes and types of semantic change---can be exploited for the acquisition of aspects of the syntax and semantics of neologisms. Evidence supporting this hypothesis is found in three case studies: lexical blends (e.g., "webisode" a blend of "web" and "episode"), text messaging forms (e.g., "any1" for "anyone"), and ameliorations and pejorations (e.g., the use of "sick" to mean `excellent', an amelioration). Moreover, this thesis presents the first computational work on lexical blends and ameliorations and pejorations, and the first unsupervised approach to text message normalization. Computer science Computational linguistics Natural language processing Lexical acquisition Neologisms 0984
284	Détection du langage spéculatif dans la littérature scientifique Moncecchi, Guillermo 11 March 2013 (has links) (PDF) Ce travail de thèse propose une méthodologie visant la résolution de certains problèmes de classification, notamment ceux concernant la classification séquentielle en tâches de Traitement Automatique des Langues. Afin d'améliorer les résultats de la tâche de classification, nous proposons l'utilisation d'une approche itérative basée sur l'erreur, qui intègre, dans le processus d'apprentissage, des connaissances d'un expert représentées sous la forme de "règles de connaissance". Nous avons appliqué la méthodologie à deux tâches liées à la détection de la spéculation ("hedging") dans la littérature scientifique: la détection de segments textuels spéculatifs ("hedge cue identification") et la détection de la couverture de ces segments ("hedge cue scope detection"). Les résultats son prometteurs: pour la première tâche, nous avons amélioré le F-score de la baseline de 2,5 points en intégrant des données sur la co-occurrence de segments spéculatifs. Concernant la deuxième tâche, l'intégration d'information syntaxique et des règles pour l'élagage syntaxique ont permis d'améliorer les résultats de la classification de 0,712 à 0,835 (F-score). Par rapport aux méthodes de l'état de l'art, les résultats sont très bons et ils suggèrent que l'approche consistant à améliorer les classifieurs basées seulement sur des erreurs commises dans un corpus, peut être également appliquée à d'autres tâches similaires. Qui plus est, ce travail de thèse propose un schéma de classes permettant de représenter l'analyse d'une phrase dans une structure unique qui intègre les résultats de différentes analyses linguistiques. Cela permet de mieux gérer le processus itératif d'amélioration du classifieur, dans lequel différents ensembles d'attributs d'apprentissage sont utilisés à chaque itération. Nous proposons également de stocker les attributs dans un modèle relationnel au lieu des structures textuelles classiques, afin de faciliter l'analyse et la manipulation des données apprises. natural language processing hedging hybrid methods
285	Interactive Visualizations of Natural Language Collins, Christopher 06 August 2010 (has links) While linguistic skill is a hallmark of humanity, the increasing volume of linguistic data each of us faces is causing individual and societal problems — ‘information overload’ is a commonly discussed condition. Tasks such as finding the most appropriate information online, understanding the contents of a personal email repository, and translating documents from another language are now commonplace. These tasks need not cause stress and feelings of overload: the human intellectual capacity is not the problem. Rather, the computational interfaces to linguistic data are problematic — there exists a Linguistic Visualization Divide in the current state-of-the-art. Through five design studies, this dissertation combines sophisticated natural language processing algorithms with information visualization techniques grounded in evidence of human visuospatial capabilities. The first design study, Uncertainty Lattices, augments real-time computermediated communication, such as cross-language instant messaging chat and automatic speech recognition. By providing explicit indications of algorithmic confidence, the visualization enables informed decisions about the quality of computational outputs. Two design studies explore the space of content analysis. DocuBurst is an interactive visualization of document content, which spatially organizes words using an expert-created ontology. Broadening from single documents to document collections, Parallel Tag Clouds combine keyword extraction and coordinated visualizations to provide comparative overviews across subsets of a faceted text corpus. Finally, two studies address visualization for natural language processing research. The Bubble Sets visualization draws secondary set relations around arbitrary collections of items, such as a linguistic parse tree. From this design study we propose a theory of spatial rights to consider when assigning visual encodings to data. Expanding considerations of spatial rights, we present a formalism to organize the variety of approaches to coordinated and linked visualization, and introduce VisLink, a new method to relate and explore multiple 2d visualizations in 3d space. Intervisualization connections allow for cross-visualization queries and support high level comparison between visualizations. From the design studies we distill challenges common to visualizing language data, including maintaining legibility, supporting detailed reading, addressing data scale challenges, and managing problems arising from semantic ambiguity. visualization linguistics natural language processing information visualization machine translation 0984 0723
286	Exploiting Linguistic Knowledge to Infer Properties of Neologisms Cook, C. Paul 14 February 2011 (has links) Neologisms, or newly-coined words, pose problems for natural language processing (NLP) systems. Due to the recency of their coinage, neologisms are typically not listed in computational lexicons---dictionary-like resources that many NLP applications depend on. Therefore when a neologism is encountered in a text being processed, the performance of an NLP system will likely suffer due to the missing word-level information. Identifying and documenting the usage of neologisms is also a challenge in lexicography, the making of dictionaries. The traditional approach to these tasks has been to manually read a lot of text. However, due to the vast quantities of text being produced nowadays, particularly in electronic media such as blogs, it is no longer possible to manually analyze it all in search of neologisms. Methods for automatically identifying and inferring syntactic and semantic properties of neologisms would therefore address problems encountered in both natural language processing and lexicography. Because neologisms are typically infrequent due to their recent addition to the language, approaches to automatically learning word-level information relying on statistical distributional information are in many cases inappropriate. Moreover, neologisms occur in many domains and genres, and therefore approaches relying on domain-specific resources are also inappropriate. The hypothesis of this thesis is that knowledge about etymology---including word formation processes and types of semantic change---can be exploited for the acquisition of aspects of the syntax and semantics of neologisms. Evidence supporting this hypothesis is found in three case studies: lexical blends (e.g., "webisode" a blend of "web" and "episode"), text messaging forms (e.g., "any1" for "anyone"), and ameliorations and pejorations (e.g., the use of "sick" to mean `excellent', an amelioration). Moreover, this thesis presents the first computational work on lexical blends and ameliorations and pejorations, and the first unsupervised approach to text message normalization. Computer science Computational linguistics Natural language processing Lexical acquisition Neologisms 0984
287	Topical Opinion Retrieval Skomorowski, Jason January 2006 (has links) With a growing amount of subjective content distributed across the Web, there is a need for a domain-independent information retrieval system that would support ad hoc retrieval of documents expressing opinions on a specific topic of the user’s query. While the research area of opinion detection and sentiment analysis has received much attention in the recent years, little research has been done on identifying subjective content targeted at a specific topic, i.e. expressing topical opinion. This thesis presents a novel method for ad hoc retrieval of documents which contain subjective content on the topic of the query. Documents are ranked by the likelihood each document expresses an opinion on a query term, approximated as the likelihood any occurrence of the query term is modified by a subjective adjective. Domain-independent user-based evaluation of the proposed methods was conducted, and shows statistically significant gains over Google ranking as the baseline. information retrieval corpus linguistics statistical natural language processing sentiment analysis adjectives Computer Science
288	An Investigation of Word Sense Disambiguation for Improving Lexical Chaining Enss, Matthew January 2006 (has links) This thesis investigates how word sense disambiguation affects lexical chains, as well as proposing an improved model for lexical chaining in which word sense disambiguation is performed prior to lexical chaining. A lexical chain is a set of words from a document that are related in meaning. Lexical chains can be used to identify the dominant topics in a document, as well as where changes in topic occur. This makes them useful for applications such as topic segmentation and document summarization. <br /><br /> However, polysemous words are an inherent problem for algorithms that find lexical chains as the intended meaning of a polysemous word must be determined before its semantic relations to other words can be determined. For example, the word "bank" should only be placed in a chain with "money" if in the context of the document "bank" refers to a place that deals with money, rather than a river bank. The process by which the intended senses of polysemous words are determined is word sense disambiguation. To date, lexical chaining algorithms have performed word sense disambiguation as part of the overall process building lexical chains. Because the intended senses of polysemous words must be determined before words can be properly chained, we propose that word sense disambiguation should be performed before lexical chaining occurs. Furthermore, if word sense disambiguation is performed prior to lexical chaining, then it can be done with any available disambiguation method, without regard to how lexical chains will be built afterwards. Therefore, the most accurate available method for word sense disambiguation should be applied prior to the creation of lexical chains. <br /><br /> We perform an experiment to demonstrate the validity of the proposed model. We compare the lexical chains produced in two cases: <ol> <li>Lexical chaining is performed as normal on a corpus of documents that has not been disambiguated. </li> <li>Lexical chaining is performed on the same corpus, but all the words have been correctly disambiguated beforehand. </li></ol> We show that the lexical chains created in the second case are more correct than the chains created in the first. This result demonstrates that accurate word sense disambiguation performed prior to the creation of lexical chains does lead to better lexical chains being produced, confirming that our model for lexical chaining is an improvement upon previous approaches. Computer Science lexical chains word sense disambiguation computational linguistics natural language processing
289	From Atoms to the Solar System: Generating Lexical Analogies from Text Chiu, Pei-Wen Andy January 2006 (has links) A <em>lexical analogy</em> is two pairs of words (<em>w</em><sub>1</sub>, <em>w</em><sub>2</sub>) and (<em>w</em><sub>3</sub>, <em>w</em><sub>4</sub>) such that the relation between <em>w</em><sub>1</sub> and <em>w</em><sub>2</sub> is identical or similar to the relation between <em>w</em><sub>3</sub> and <em>w</em><sub>4</sub>. For example, (<em>abbreviation</em>, <em>word</em>) forms a lexical analogy with (<em>abstract</em>, <em>report</em>), because in both cases the former is a shortened version of the latter. Lexical analogies are of theoretic interest because they represent a second order similarity measure: <em>relational similarity</em>. Lexical analogies are also of practical importance in many applications, including text-understanding and learning ontological relations. <BR> <BR> This thesis presents a novel system that generates lexical analogies from a corpus of text documents. The system is motivated by a well-established theory of analogy-making, and views lexical analogy generation as a series of three processes: identifying pairs of words that are semantically related, finding clues to characterize their relations, and generating lexical analogies by matching pairs of words with similar relations. The system uses a <em>dependency grammar</em> to characterize semantic relations, and applies machine learning techniques to determine their similarities. Empirical evaluation shows that the system performs remarkably well, generating lexical analogies at a precision of over 90%. Computer Science lexical analogy relational similarity natural language processing machine learning
290	'Healthy' Coreference: Applying Coreference Resolution to the Health Education Domain Hirtle, David Z. January 2008 (has links) This thesis investigates coreference and its resolution within the domain of health education. Coreference is the relationship between two linguistic expressions that refer to the same real-world entity, and resolution involves identifying this relationship among sets of referring expressions. The coreference resolution task is considered among the most difficult of problems in Artificial Intelligence; in some cases, resolution is impossible even for humans. For example, "she" in the sentence "Lynn called Jennifer while she was on vacation" is genuinely ambiguous: the vacationer could be either Lynn or Jennifer. <br/><br/> There are three primary motivations for this thesis. The first is that health education has never before been studied in this context. So far, the vast majority of coreference research has focused on news. Secondly, achieving domain-independent resolution is unlikely without understanding the extent to which coreference varies across different genres. Finally, coreference pervades language and is an essential part of coherent discourse. Its effective use is a key component of easy-to-understand health education materials, where readability is paramount. <br/><br/> No suitable corpus of health education materials existed, so our first step was to create one. The comprehensive analysis of this corpus, which required manual annotation of coreference, confirmed our hypothesis that the coreference used in health education differs substantially from that in previously studied domains. This analysis was then used to shape the design of a knowledge-lean algorithm for resolving coreference. This algorithm performed surprisingly well on this corpus, e.g., successfully resolving over 85% of all pronouns when evaluated on unseen data. <br/><br/> Despite the importance of coreferentially annotated corpora, only a handful are known to exist, likely because of the difficulty and cost of reliably annotating coreference. The paucity of genres represented in these existing annotated corpora creates an implicit bias in domain-independent coreference resolution. In an effort to address these issues, we plan to make our health education corpus available to the wider research community, hopefully encouraging a broader focus in the future. coreference resolution anaphora computational linguistics natural language processing corpus analysis health education Computer Science

Search results