Global ETD Search

1	Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic Östling, Robert January 2013 (has links) In this paper, we experiment with using Stagger, an open-source implementation of an Averaged Perceptron tagger, to tag Icelandic, a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy, an unknown word guesser, we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore, by adding data from a morphological database, and word embeddings induced from an unannotated corpus, the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%, compared to the previously best tagger for Icelandic, consisting of linguistic rules and a Hidden Markov Model. part of speech tagging pos tagging icelandic
2	Sentiment analysis of products’ reviews containing English and Hindi texts Singh, J.P., Rana, Nripendra P., Alkhowaiter, W. 26 September 2020 (has links) Yes / The online shopping is increasing rapidly because of its convenience to buy from home and comparing products from their reviews written by other purchasers. When people buy a product, they express their emotions about that product in the form of review. In Indian context, it is found that the reviews contain Hindi text along with English. It is also found that most of the Hindi text contains opinionated words like bahut achha, bakbas, pesa wasool etc. We have tried to find out different Hindi texts appearing in product reviews written on Indian E-commerce portals. We have also developed a system which takes all those reviews containing Hindi as well as English texts and find out the sentiment expressed in that review for each attribute of the product as well as a final review of the product. POS-tagging Project summarisation Review analysis Sentiment analysis
3	Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources Hänig, Christian 25 April 2013 (has links) (PDF) This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\'s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora. Text Mining Sprachverarbeitung Informationsextraktion Relationsextraktion POS Tagging Parsing Text Mining NLP Information Extraction Relation Extraction POS Tagging Parsing ddc:500
4	Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources Hänig, Christian 17 April 2013 (has links) This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora. info:eu-repo/classification/ddc/500 ddc:500
5	Linguistic Knowledge Transfer for Enriching Vector Representations Kim, Joo-Kyung 12 December 2017 (has links) No description available. Computer Science Artificial Intelligence Transfer Learning Word Embedding Intent Detection Slot Filling POS Tagging Adversarial Traning
6	Part-of-Speech Bootstrapping Using Lexically-Specific Frames Leibbrandt, Richard Eduard, richard.leibbrandt@flinders.edu.au January 2009 (has links) The work in this thesis presents and evaluates a number of strategies by which English-learning children might discover the major open-class parts-of-speech in English (nouns, verbs and adjectives) on the basis of purely distributional information. Previous work has shown that parts-of-speech can be readily induced from the distributional patterns in which words occur. The research reported in this thesis extends and improves on this previous work in two major ways, related to the constructional status of the utterance contexts used for distributional analysis, and to the way in which previous studies have dealt with categorial ambiguity. Previous studies that have induced parts-of-speech from word distributions have done so on the basis of fixed windows of words that occur before and after the word in focus. These contexts are often not constructions of the language in question, and hence have dubious status as elements of linguistic knowledge. A great deal of recent evidence (e.g. Lieven, Pine & Baldwin, 1997; Tomasello, 1992) has suggested that childrens early language may be organized around a number of lexically-specific constructional frames with slots, such as a X, you X it, draw X on X. The work presented here investigates the possibility that constructions such as these may be a more appropriate domain for the distributional induction of parts-of-speech. This would open up the possibility of a treatment of part-of-speech induction that is more closely integrated with the acquisition of syntax. Three strategies to discover lexically-specific frames in the speech input to children are presented. Two of these strategies are based on the interplay between more and less frequent words in English utterances: the more frequent words, which are typically function words or light verbs, are taken to provide the schematic backbone of an utterance. The third strategy is based around pairs of words in which the occurrence of one word is highly predictable from that of the other, but not vice versa; from these basic slot-filler relationships, larger frames are assembled. These techniques were implemented computationally and applied to a corpus of child-directed speech. Each technique yielded a large set of lexically-specific frames, many of which could plausibly be regarded as constructions. In a comparison with a manual analysis of the same corpus by Cameron-Faulkner, Lieven and Tomasello (2003), it is shown that most of the constructional frames identified in the manual analysis were also produced by the automatic techniques. After the identification of potential constructional frames, parts-of-speech were formed from the patterns of co-occurrence of words in particular constructions, by means of hierarchical clustering. The resulting clusters produced are shown to be quite similar to the major English parts-of-speech of nouns, verbs and adjectives. Each individual word token was assigned a part-of-speech on the basis of its constructional context. This categorization was evaluated empirically against the part-of-speech assigned to the word in question in the original corpus. The resulting categorization is shown to be, to a great extent, in agreement with the manual categorization. These strategies deal with the categorial ambiguity of words, by allowing the frame context to determine part-of-speech. However, many of the frames produced were themselves ambiguous cues to part-of-speech. For this reason, strategies are presented to deal with both word and context ambiguity. Three such strategies are proposed. One considers membership of a part-of-speech to be a matter of degree for both word and contextual frame. A second strategy attempts to discretely assign multiple parts-of-speech to words and constructions in a way that imposes internal consistency in the corpus. The third strategy attempts to assign only the minimally-required multiple categories to words and constructions so as to provide a parsimonious description of the data. Each of these techniques was implemented and applied to each of the three frame discovery techniques, thereby providing category information about both the frame and the word. The subsequent assignment of parts-of-speech was done by combining word and frame information, and is shown to be far more accurate than the categorization based on frames alone. This approach can be regarded as addressing certain objections against the distributional method that have been raised by Pinker (1979, 1984, 1987). Lastly, a framework for extending this research is outlined that allows semantic information to be incorporated into the process of category induction. part-of-speech bootstrapping word classes language learning lexically-specific frames construction grammar usage-based linguistics POS tagging developmental psychology psycholinguistics
7	A Hybrid Environment for Syntax-Semantic Tagging Padró, Lluís 06 February 1998 (has links) The thesis describes the application of the relaxation labelling algorithm to NLP disambiguation. Language is modelled through context constraint inspired on Constraint Grammars. The constraints enable the use of a real value statind "compatibility". The technique is applied to POS tagging, Shallow Parsing and Word Sense Disambigation. Experiments and results are reported. The proposed approach enables the use of multi-feature constraint models, the simultaneous resolution of several NL disambiguation tasks, and the collaboration of linguistic and statistical models. / La tesi descriu l'aplicació de l'algorisme d'etiquetat per relaxacio (relaxation labelling) a la desambiguació del llenguatge natural. La llengua es modela mitjançant restriccions de context inspirades en les Constraint Grammars. Les restriccions permeten l'ús d'un valor real que n'expressa la "compatibilitat". La tècnica s'aplica a la desambiguació morfosintàctica (POS tagging), a l'anàlisi sintàctica superficial (Shallow Parsing) i a la desambiguació semàntica (Word Sense Disambigation), i se'n presenten experiments i resultats. L'enfoc proposat permet la utilització de models de restriccions amb trets múltiples, la resolució simultània de diverses tasques de desambiguació del llenguatge natural, i la col·laboració de models linguístics i estadístics. Shallow Parsing POS tagging Constraint Grammars relaxation labelling Word Sense Disambigation 1203. Ciència dels ordinadors 004 62 81
8	Leis de Escala nos gastos com saneamento básico: dados do SIOP e DOU / Scaling Patterns in Basic Sanitation Expenditure: data from SIOP and DOU Ribeiro, Ludmila Deute 14 March 2019 (has links) A partir do final do século 20, o governo federal criou vários programas visando a ampliação de acesso ao saneamento básico. Embora esses programas tenham trazido o abastecimento de água potável e a coleta de resíduos sólidos para a maioria dos municípios brasileiros, o esgotamento sanitário ainda está espacialmente concentrado na região Sudeste e nas áreas mais urbanizadas. Para explicar esse padrão espacialmente concentrado, é frequentemente assumido que o tamanho das cidades realmente importa para o saneamento básico, especialmente para o esgotamento sanitário. De fato, à medida que as cidades crescem em tamanho, devemos esperar economias de escala no volume de infraestrutura de saneamento. Economias de escala na infra-estrutura implicam uma redução nos custos de saneamento básico, de forma proporcional ao tamanho da cidade, levando também a uma (esperada) relação de lei de escala (ou de potência) entre os gastos com saneamento básico e o tamanho da cidade. Usando a população, N(t), como medida do tamanho da cidade no momento t, a lei de escala para infraestrutura assume o formato Y(t) = Y0N(t)&#946 onde &#946 &#8776 0.8 < 1, Y denota o volume de infraestrutura e Y0 é uma constante. Diversas propriedades das cidades, desde a produção de patentes e renda até a extensão da rede elétrica, são funções de lei de potência do tamanho da população com expoentes de escalamento, &#946, que se enquadram em classes distintas. As quantidades que refletem a criação de riqueza e a inovação têm &#946 &#8776 1.2 > 1 (retornos crescentes), enquanto aquelas responsáveis pela infraestrutura exibem &#946 &#8776 0.8 < 1 (economias de escala). Verificamos essa relação com base em dados extraídos do Sistema Integrado de Planejamento e Orçamento (SIOP), que abrangem transferências com recursos não onerosos, previstos na Lei Orçamentária Anual (LOA), na modalidade saneamento básico. No conjunto, os valores estimados de &#946 mostram redução das transferências da União Federal para saneamento básico, de forma proporcional ao tamanho dos municípios beneficiários. Para a dotação inicial, valores programados na LOA, estimado foi de aproximadamente: 0.63 para municípios com população superior a dois mil habitantes; 0.92 para municípios acima de vinte mil habitantes; e 1.18 para municípios com mais de cinquenta mil habitantes. A segunda fonte de dados identificada foi o Diário Oficial da União (DOU), periódico do governo federal para publicação de atos oficiais. Os dados fornecidos pelo DOU referem-se aos recursos não onerosos e também aos empréstimos com recursos do Fundo de Garantia por Tempo de Serviço (FGTS). Para extração dos dados textuais foram utilizadas técnicas de Processamento de Linguagem Natural(PLN). Essas técnicas funcionam melhor quando os algoritmos são alimentados com anotações - metadados que fornecem informações adicionais sobre o texto. Por isso geramos uma base de dados, a partir de textos anotados do DOU, para treinar uma rede LSTM bidirecional aplicada à etiquetagem morfossintática e ao reconhecimento de entidades nomeadas. Os resultados preliminares obtidos dessa forma estão relatados no texto / Starting in the late 20th century, the Brazilian federal government created several programs to increase the access to water and sanitation. However, although these programs made improvements in water access, sanitation was generally overlooked. While water supply, and waste collection are available in the majority of the Brazilian municipalities, the sewage system is still spatially concentrated in the Southeast region and in the most urbanized areas. In order to explain this spatially concentrated pattern it is frequently assumed that the size of cities does really matter for sanitation services provision, specially for sewage collection. As a matter of fact, as cities grow in size, one should expect economies of scale in sanitation infrastructure volume. Economies of scale in sanitation infrastructure means a decrease in basic sanitation costs, proportional to the city size, leading also to a (expected) power law relationship between the expenditure on sanitation and city size.Using population, N(t), as the measure of city size at time t, power law scaling for infrastructure takes the form Y(t) = Y0N(t)&#946 where &#946 &#8776 0.8 < 1, Y denotes infrastructure volume and is a constant. Many diverse properties of cities from patent production and personal income to electrical cable length are shown to be power law functions of population size with scaling exponents, &#946, that fall into distinct universality classes. Quantities reflecting wealth creation and innovation have &#946 &#8776 1.2 > 1 (increasing returns), whereas those accounting for infrastructure display &#946 &#8776 0.8 < 1 (economies of scale). We verified this relationship using data from federal government databases, called Integrated Planning and Budgeting System, known as SIOP. SIOP data refers only to grants, funds given to municipalities by the federal government to run programs within defined guidelines. Preliminary results from SIOP show decrease in Federal Grants to Brazilian Municipalities, proportional to the city size. For the initial budget allocation, &#946 was found to be roughly 0.63 for municipalities above twenty thousand inhabitants; to be roughly 0.92 for municipalities above twenty thousand inhabitants; and to be roughly 1.18 for municipalities above fifty thousand inhabitants. The second data source is DOU, government journal for publishing official acts. DOU data should give us information not only about grants, but also about FGTS funds for basic sanitation loans. In order to extract data from DOU we have applied Natural Language Processing (NLP) tools. These techniques often work better when the algorithms are provided with annotations metadata that provides additional information about the text. In particular, we fed a database with annotations into a bidirectional LSTM model applied to POS Tagging and Named-entity Recognition. Preliminary results are reported in the paper Basic Sanitation Cidades Cities Etiquetagem Morfossintática Leis de Escala Named-entity Recognition Natural Language Processing POS Tagging Power Law Scaling Processamento de Linguagem Natural Reconhecimento de Entidades Nomeadas Saneamento Básico
9	Word Classes in Language Modelling Erikson, Emrik, Åström, Marcus January 2024 (has links) This thesis concerns itself with word classes and their application to language modelling.Considering a purely statistical Markov model trained on sequences of word classes in theSwedish language different problems in language engineering are examined. Problemsconsidered are part-of-speech tagging, evaluating text modifiers such as translators withthe help of probability measurements and matrix norms, and lastly detecting differenttypes of text using the Fourier transform of cross entropy sequences of word classes.The results show that the word class language model is quite weak by itself but that itis able to improve part-of-speech tagging for 1 and 2 letter models. There are indicationsthat a stronger word class model could aid 3-letter and potentially even stronger models.For evaluating modifiers the model is often able to distinguish between shuffled andsometimes translated text as well as to assign a score as to how much a text has beenmodified. Future work on this should however take better care to ensure large enoughtest data. The results from the Fourier approach indicate that a Fourier analysis of thecross entropy sequence between word classes may allow the model to distinguish betweenA.I. generated text as well as translated text from human written text. Future work onmachine learning word class models could be carried out to get further insights into therole of word class models in modern applications. The results could also give interestinginsights in linguistic research regarding word classes. Word class Language Model POS-tagging n-gram Markov Model Transition Matrix Matrix norm Cross Entropy Discrete Fourier Transform Mathematics Matematik

Search results