281 |
Large-scale semi-supervised learning for natural language processingBergsma, Shane A 11 1900 (has links)
Natural Language Processing (NLP) develops computational approaches to processing language data. Supervised machine learning has become the dominant methodology of modern NLP. The performance of a supervised NLP system crucially depends on the amount of data available for training. In the standard supervised framework, if a sequence of words was not encountered in the training set, the system can only guess at its label at test time. The cost of producing labeled training examples is a bottleneck for current NLP technology. On the other hand, a vast quantity of unlabeled data is freely available.
This dissertation proposes effective, efficient, versatile methodologies for 1) extracting useful information from very large (potentially web-scale) volumes of unlabeled data and 2) combining such information with standard supervised machine learning for NLP. We demonstrate novel ways to exploit unlabeled data, we scale these approaches to make use of all the text on the web, and we show improvements on a variety of challenging NLP tasks. This combination of learning from both labeled and unlabeled data is often referred to as semi-supervised learning.
Although lacking manually-provided labels, the statistics of unlabeled patterns can often distinguish the correct label for an ambiguous test instance. In the first part of this dissertation, we propose to use the counts of unlabeled patterns as features in supervised classifiers, with these classifiers trained on varying amounts of labeled data. We propose a general approach for integrating information from multiple, overlapping sequences of context for lexical disambiguation problems. We also show how standard machine learning algorithms can be modified to incorporate a particular kind of prior knowledge: knowledge of effective weightings for count-based features. We also evaluate performance within and across domains for two generation and two analysis tasks, assessing the impact of combining web-scale counts with conventional features. In the second part of this dissertation, rather than using the aggregate statistics as features, we propose to use them to generate labeled training examples. By automatically labeling a large number of examples, we can train powerful discriminative models, leveraging fine-grained features of input words.
|
282 |
Grapheme-to-phoneme conversion and its application to transliterationJiampojamarn, Sittichai 06 1900 (has links)
Grapheme-to-phoneme conversion (G2P) is the task of converting a word, represented by a sequence of graphemes, to its pronunciation, represented by a sequence of phonemes. The G2P task plays a crucial role in speech synthesis systems, and is an important part of other applications, including spelling correction and speech-to-speech machine translation. G2P conversion is a complex task, for which a number of diverse solutions have been proposed. In general, the problem is challenging because the source string does not unambiguously specify the target representation. In addition, the training data include only example word
pairs without the structural information of subword alignments.
In this thesis, I introduce several novel approaches for G2P conversion. My contributions can be categorized into (1) new alignment models and (2) new output generation models. With respect to alignment models, I present techniques including many-to-many alignment, phonetic-based alignment, alignment by integer linear programing and alignment-by-aggregation. Many-to-many alignment is designed to replace the one-to-one
alignment that has been used almost exclusively in the past. The new many-to-many alignments are more precise and accurate in expressing grapheme-phoneme relationships. The other proposed alignment approaches attempt to advance the training method beyond the use of Expectation-Maximization (EM). With respect to generation models, I first describe a framework for integrating many-to-many alignments and language models for grapheme classification. I then propose joint processing for G2P using online discriminative training. I integrate a generative joint n-gram model into the discriminative framework. Finally, I apply the proposed G2P systems to name transliteration generation and mining tasks. Experiments show that the proposed system achieves state-of-the-art performance in both the G2P and name transliteration tasks.
|
283 |
Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE)Razavi, Amir Hossein 16 July 2012 (has links)
In this dissertation, we introduce a novel text representation method mainly used for text classification purpose. The presented representation method is initially based on a variety of closeness relationships between pairs of words in text passages within the entire corpus. This representation is then used as the basis for our multi-level lightweight ontological representation method (TOR-FUSE), in which documents are represented based on their contexts and the goal of the learning task. The method is unlike the traditional representation methods, in which all the documents are represented solely based on the constituent words of the documents, and are totally isolated from the goal that they are represented for. We believe choosing the correct granularity of representation features is an important aspect of text classification. Interpreting data in a more general dimensional space, with fewer dimensions, can convey more discriminative knowledge and decrease the level of learning perplexity. The multi-level model allows data interpretation in a more conceptual space, rather than only containing scattered words occurring in texts. It aims to perform the extraction of the knowledge tailored for the classification task by automatic creation of a lightweight ontological hierarchy of representations. In the last step, we will train a tailored ensemble learner over a stack of representations at different conceptual granularities. The final result is a mapping and a weighting of the targeted concept of the original learning task, over a stack of representations and granular conceptual elements of its different levels (hierarchical mapping instead of linear mapping over a vector). Finally the entire algorithm is applied to a variety of general text classification tasks, and the performance is evaluated in comparison with well-known algorithms.
|
284 |
Automatic Concept-Based Query Expansion Using Term Relational Pathways Built from a Collection-Specific Association ThesaurusLyall-Wilson, Jennifer Rae January 2013 (has links)
The dissertation research explores an approach to automatic concept-based query expansion to improve search engine performance. It uses a network-based approach for identifying the concept represented by the user's query and is founded on the idea that a collection-specific association thesaurus can be used to create a reasonable representation of all the concepts within the document collection as well as the relationships these concepts have to one another. Because the representation is generated using data from the association thesaurus, a mapping will exist between the representation of the concepts and the terms used to describe these concepts. The research applies to search engines designed for use in an individual website with content focused on a specific conceptual domain. Therefore, both the document collection and the subject content must be well-bounded, which affords the ability to make use of techniques not currently feasible for general purpose search engine used on the entire web.
|
285 |
Grapheme-to-phoneme conversion and its application to transliterationJiampojamarn, Sittichai Unknown Date
No description available.
|
286 |
Large-scale semi-supervised learning for natural language processingBergsma, Shane A Unknown Date
No description available.
|
287 |
Entwicklung eines Werkzeugs zur onlinebasierten Bestimmung typenspezifischer LernpräferenzenWortmann, Frank, Frießem, Martina, Zülch, Joachim 25 October 2013 (has links) (PDF)
Die multimediale Aufbereitung von Lerninhalten zu E-Learning Einheiten ist heute in der Aus- und Weiterbildung allgegenwärtig. Häufig wird dabei jedoch die Perspektive der Nutzer der virtuellen Lehr-/Lernsysteme vernachlässigt [1]. In der Regel erfolgt eine standardisierte Aufbereitung der Lerninhalte, welche während der Konzeptionsphase die späteren Konsumenten der E-Learning Einheiten weitestgehend außer Acht lässt. Entsprechend werden meistens weder die Potenziale zur Kostenersparnis durch Teilstandardisierung von Weiterbildungsmodulen noch zur Qualitätssteigerung durch verstärkte Individualisierung annähernd ausgeschöpft. [2] Dabei bleiben individuellen Lernpräferenzen der Teilnehmer unberücksichtigt und Potenziale zur Verbesserung der Lernergebnisse werden nicht genutzt. Um diese Lücke zu schließen, wird basierend auf der Theorie der neurolinguistischen Programmierung ein Online-Werkzeug entwickelt, welches die zur Informationsaufnahme präferierten Sinneskanäle der einzelnen Teilnehmer analysiert und auswertet. Mit diesem Wissen kann die zukünftige Entwicklung und Aufbereitung von E-Learning Inhalten individueller auf die jeweiligen Konsumenten zugeschnitten werden. Zusätzlich soll das Online-Werkzeug den Nutzern Empfehlungen geben, wie sie ihren Lernerfolg verbessern können.
|
288 |
AIM - A Social Media Monitoring System for Quality EngineeringBank, Mathias 27 June 2013 (has links) (PDF)
In the last few years the World Wide Web has dramatically changed the way people are communicating with each other. The growing availability of Social Media Systems like Internet fora, weblogs and social networks ensure that the Internet is today, what it was originally designed for: A technical platform in which all users are able to interact with each other. Nowadays, there are billions of user comments available discussing all aspects of life and the data source is still growing.
This thesis investigates, whether it is possible to use this growing amount of freely provided user comments to extract quality related information. The concept is based on the observation that customers are not only posting marketing relevant information. They also publish product oriented content including positive and negative experiences. It is assumed that this information represents a valuable data source for quality analyses: The original voices of the customers promise to specify a more exact and more concrete definition of \"quality\" than the one that is available to manufacturers or market researchers today.
However, the huge amount of unstructured user comments makes their evaluation very complex. It is impossible for an analysis protagonist to manually investigate the provided customer feedback. Therefore, Social Media specific algorithms have to be developed to collect, pre-process and finally analyze the data. This has been done by the Social Media monitoring system AIM (Automotive Internet Mining) that is the subject of this thesis. It investigates how manufacturers, products, product features and related opinions are discussed in order to estimate the overall product quality from the customers\\\' point of view. AIM is able to track different types of data sources using a flexible multi-agent based crawler architecture. In contrast to classical web crawlers, the multi-agent based crawler supports individual crawling policies to minimize the download of irrelevant web pages. In addition, an unsupervised wrapper induction algorithm is introduced to automatically generate content extraction parameters which are specific for the crawled Social Media systems. The extracted user comments are analyzed by different content analysis algorithms to gain a deeper insight into the discussed topics and opinions. Hereby, three different topic types are supported depending on the analysis needs.
* The creation of highly reliable analysis results is realized by using a special context-aware taxonomy-based classification system.
* Fast ad-hoc analyses are applied on top of classical fulltext search capabilities.
* Finally, AIM supports the detection of blind-spots by using a new fuzzified hierarchical clustering algorithm. It generates topical clusters while supporting multiple topics within each user comment.
All three topic types are treated in a unified way to enable an analysis protagonist to apply all methods simultaneously and in exchange. The systematically processed user comments are visualized within an easy and flexible interactive analysis frontend. Special abstraction techniques support the investigation of thousands of user comments with minimal time efforts. Hereby, specifically created indices show the relevancy and customer satisfaction of a given topic. / In den letzten Jahren hat sich das World Wide Web dramatisch verändert. War es vor einigen Jahren noch primär eine Informationsquelle, in der ein kleiner Anteil der Nutzer Inhalte veröffentlichen konnte, so hat sich daraus eine Kommunikationsplattform entwickelt, in der jeder Nutzer aktiv teilnehmen kann. Die dadurch enstehende Datenmenge behandelt jeden Aspekt des täglichen Lebens. So auch Qualitätsthemen. Die Analyse der Daten verspricht Qualitätssicherungsmaßnahmen deutlich zu verbessern. Es können dadurch Themen behandelt werden, die mit klassischen Sensoren schwer zu messen sind. Die systematische und reproduzierbare Analyse von benutzergenerierten Daten erfordert jedoch die Anpassung bestehender Tools sowie die Entwicklung neuer Social-Media spezifischer Algorithmen. Diese Arbeit schafft hierfür ein völlig neues Social Media Monitoring-System, mit dessen Hilfe ein Analyst tausende Benutzerbeiträge mit minimaler Zeitanforderung analysieren kann. Die Anwendung des Systems hat einige Vorteile aufgezeigt, die es ermöglichen, die kundengetriebene Definition von \"Qualität\" zu erkennen.
|
289 |
Προδιαγραφές μιας καινοτόμας πλατφόρμας ηλεκτρονικής μάθησης που ενσωματώνει τεχνικές επεξεργασίας φυσικής γλώσσαςΦερφυρή, Ναυσικά 04 September 2013 (has links)
Ζούμε σε μια κοινωνία στην οποία η χρήση της τεχνολογίας έχει εισβάλει δυναμικά στην καθημερινότητα.Η εκπαίδευση δεν θα μπορούσε να μην επηρεαστεί απο τις Νέες Τεχνολογίες.Ήδη,όροι όπως “Ηλεκτρονική Μάθηση” και ”Ασύγχρονη Τηλε-εκπαίδευση” έχουν δημιουργήσει νέα δεδομένα στην κλασική Εκπαίδευση. Με τον όρο ασύγχρονη τηλε-εκπαίδευση εννοούμε μια διαδικασία ανταλλαγής μάθησης μεταξύ εκπαιδευτή - εκπαιδευομένων,που πραγματοποιείται ανεξάρτητα χρόνου και τόπου. Ηλεκτρονική Μάθηση είναι η χρήση των νέων πολυμεσικών τεχνολογιών και του διαδικτύου για τη βελτίωση της ποιότητας της μάθησης,διευκολύνοντας την πρόσβαση σε πηγές πληροφοριών και σε υπηρεσίες καθώς και σε ανταλλαγές και εξ'αποστάσεως συνεργασίες.Ο όρος καλύπτει ένα ευρύ φάσμα εφαρμογών και διαδικασιών,όπως ηλεκτρονικές τάξεις και ψηφιακές συνεργασίες, μάθηση βασιζόμενη στους ηλεκτρονικούς υπολογιστές και στις τεχνολογίες του παγκόσμιου ιστού. Κάποιες απο τις βασικές απαιτήσεις που θα πρέπει να πληρούνται για την δημιουργία μιας πλατφόρμας ηλεκτρονικής μάθησης είναι: Να υποστηρίζει τη δημιουργία βημάτων συζήτησης (discussion forums) και “δωματίων συζήτησης”(chat rooms),να υλοποιεί ηλεκτρονικό ταχυδρομείο,να έχει φιλικό περιβάλλον τόσο για το χρήστη/μαθητή όσο και για το χρήστη/καθηγητή,να υποστηρίζει προσωποποίηση(customization)του περιβάλλοντος ανάλογα με το χρήστη.Επίσης να κρατάει πληροφορίες(δημιουργία profiles)για το χρήστη για να τον “βοηθάει”κατά την πλοήγηση,να υποστηρίζει την εύκολη δημιουργία διαγωνισμάτων(online tests), να υποστηρίζει την παρουσίαση πολυμεσικών υλικών. Ως επεξεργασία φυσικής γλώσσας (NLP) ορίζουμε την υπολογιστική ανάλυση αδόμητων δεδομένων σε κείμενα, με σκοπό την επίτευξη μηχανικής κατανόησης του κειμένου αυτού.Είναι η επεξεργασία προτάσεων που εισάγονται ή διαβάζονται από το σύστημα,το οποίο απαντά επίσης με προτάσεις με τρόπο τέτοιο που να θυμίζει απαντήσεις μορφωμένου ανθρώπου. Βασικό ρόλο παίζει η γραμματική,το συντακτικό,η ανάλυση των εννοιολογικών στοιχείων και γενικά της γνώσης, για να γίνει κατανοητή η ανθρώπινη γλώσσα από τη μηχανή. Οι βασικές τεχνικές επεξεργασίας φυσικού κειμένου βασίζονται στις γενικές γνώσεις σχετικά με τη φυσική γλώσσα.Χρησιμοποιούν ορισμένους απλούς ευρετικούς κανόνες οι οποίοι στηρίζονται στη συντακτική και σημασιολογική προσέγγιση και ανάλυση του κειμένου.Ορισμένες τεχνικές που αφορούν σε όλα τα πεδία εφαρμογής είναι: ο διαμερισμός στα συστατικά στοιχεία του κειμένου (tokenization), η χρήση της διάταξης του κειμένου (structural data mining), η απαλοιφή λέξεων που δεν φέρουν ουσιαστική πληροφορία (elimination of insignificant words),η γραμματική δεικτοδότηση (PoS tagging), η μορφολογική ανάλυση και η συντακτική ανάλυση. Στόχος της παρούσας διπλωματικής είναι να περιγράψει και να αξιολογήσει πως οι τεχνικές επεξεργασίας της φυσικής γλώσσας (NLP), θα μπορούσαν να αξιοποιηθούν για την ενσωμάτωση τους σε πλατφόρμες ηλεκτρονικής μάθησης.Ο μεγάλος όγκος δεδομένων που παρέχεται μέσω μιας ηλεκτρονικής πλατφόρμας μάθησης, θα πρέπει να μπορεί να διαχειριστεί , να διανεμηθεί και να ανακτηθεί σωστά.Κάνοντας χρήση των τεχνικών NLP θα παρουσιαστεί μια καινοτόμα πλατφόρμα ηλεκτρονικής μάθησης,εκμεταλεύοντας τις υψηλού επιπέδου τεχνικές εξατομίκευσης, την δυνατότητα εξαγωγής συμπερασμάτων επεξεργάζοντας την φυσική γλώσσα των χρηστών προσαρμόζοντας το προσφερόμενο εκπαιδευτικό υλικό στις ανάγκες του κάθε χρήστη. / We live in a society in which the use of technology has entered dynamically in our life,the education could not be influenced by new Technologies. Terms such as "e-Learning" and "Asynchronous e-learning" have created new standards in the classical Education.
By the term “asynchronous e-learning” we mean a process of exchange of learning between teacher & student, performed regardless of time and place.
E-learning is the use of new multimedia technologies and the Internet to improve the quality of learning by facilitating access to information resources and services as well as remote exchanges .The term covers a wide range of applications and processes, such electronic classrooms, and digital collaboration, learning based on computers and Web technologies.
Some of the basic requirements that must be met to establish a platform for e-learning are: To support the creation of forums and chat rooms, to deliver email, has friendly environment for both user / student and user / teacher, support personalization depending to the user . Holding information (creating profiles) for the user in order to provide help in the navigation, to support easy creating exams (online tests), to support multimedia presentation materials.
As natural language processing (NLP) define the computational analysis of unstructured data in text, to achieve mechanical understanding of the text. To elaborate proposals that imported or read by the system, which also responds by proposals in a manner that reminds answers of educated man. A key role is played by the grammar, syntax, semantic analysis of data and general knowledge to understand the human language of the machine.
The main natural text processing techniques based on general knowledge about natural language .This techniques use some simple heuristic rules based on syntactic and semantic analysis of the text. Some of the techniques pertaining to all fields of application are: tokenization, structural data mining, elimination of insignificant words, PoS tagging, analyzing the morphological and syntactic analysis.
The aim of this study is to describe and evaluate how the techniques of natural language processing (NLP), could be used for incorporation into e-learning platforms. The large growth of data delivered through an online learning platform, should be able to manage, distributed and retrieved. By the use of NLP techniques will be presented an innovative e-learning platform, using the high level personalization techniques, the ability to extract conclusions digesting the user's natural language by customizing the offered educational materials to the needs of each user .
|
290 |
Data-driven language understanding for spoken dialogue systemsMrkšić, Nikola January 2018 (has links)
Spoken dialogue systems provide a natural conversational interface to computer applications. In recent years, the substantial improvements in the performance of speech recognition engines have helped shift the research focus to the next component of the dialogue system pipeline: the one in charge of language understanding. The role of this module is to translate user inputs into accurate representations of the user goal in the form that can be used by the system to interact with the underlying application. The challenges include the modelling of linguistic variation, speech recognition errors and the effects of dialogue context. Recently, the focus of language understanding research has moved to making use of word embeddings induced from large textual corpora using unsupervised methods. The work presented in this thesis demonstrates how these methods can be adapted to overcome the limitations of language understanding pipelines currently used in spoken dialogue systems. The thesis starts with a discussion of the pros and cons of language understanding models used in modern dialogue systems. Most models in use today are based on the delexicalisation paradigm, where exact string matching supplemented by a list of domain-specific rephrasings is used to recognise users' intents and update the system's internal belief state. This is followed by an attempt to use pretrained word vector collections to automatically induce domain-specific semantic lexicons, which are typically hand-crafted to handle lexical variation and account for a plethora of system failure modes. The results highlight the deficiencies of distributional word vectors which must be overcome to make them useful for downstream language understanding models. The thesis next shifts focus to overcoming the language understanding models' dependency on semantic lexicons. To achieve that, the proposed Neural Belief Tracking (NBT) model forsakes the use of standard one-hot n-gram representations used in Natural Language Processing in favour of distributed representations of user utterances, dialogue context and domain ontologies. The NBT model makes use of external lexical knowledge embedded in semantically specialised word vectors, obviating the need for domain-specific semantic lexicons. Subsequent work focuses on semantic specialisation, presenting an efficient method for injecting external lexical knowledge into word vector spaces. The proposed Attract-Repel algorithm boosts the semantic content of existing word vectors while simultaneously inducing high-quality cross-lingual word vector spaces. Finally, NBT models powered by specialised cross-lingual word vectors are used to train multilingual belief tracking models. These models operate across many languages at once, providing an efficient method for bootstrapping language understanding models for lower-resource languages with limited training data.
|
Page generated in 0.0272 seconds