• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 6
  • 2
  • 2
  • 1
  • Tagged with
  • 14
  • 6
  • 6
  • 5
  • 5
  • 5
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

An investigation into lemmatization in Southern Sotho

Makgabutlane, Kelebohile Hilda 01 1900 (has links)
Lemmatization refers to the process whereby a lexicographer assigns a specific place in a dictionary to a word which he regards as the most basic form amongst other related forms. The fact that in Bantu languages formative elements can be added to one another in an often seemingly interminable series till quite long words are produced, evokes curiosity as far as lemmatization is concerned. Being aware of the productive nature of Southern Sotho it is interesting to observe how lexicographers go about handling the question of morphological complexities they are normally faced with in the process of arranging lexical items. This study has shown that some difficulties are encountered as far as adhering to the traditional method of alphabetization is concerned. It does not aim at proposing solutions but does point out some considerations which should be borne in mind in the process of lemmatization. / African Languages / M.A. (African Languages)
2

Solución para determinar la relevancia de un texto por medio del nivel de subjetividad en textos digitales

Pajuelo Huayta, Luis Enrique, Gómez Mandujano, Juan Carlos 27 September 2019 (has links)
En la actualidad el internet es el medio más utilizado, el cual alberga una gran cantidad de información textual sobre diversos temas; sin embargo, dicha información, en la gran mayoría de casos no es regulada por criterios de calidad de información, ya que cualquier usuario puede publicar y editar el contenido, lo cual se genera la necesidad de encontrar procedimientos automatizados que puedan filtrar los contenidos de los textos en la web. El objetivo principal del proyecto es implementar una solución que permita identificar el grado la subjetividad de un texto en base a un diccionario de datos, esto se podrá debido a la implementación de procesos que ayuden a determinar la subjetividad de textos. El software desarrollado en el proyecto es basado en software de licencia abierta que permite analizar y almacenar un conjunto de palabras según la ponderación de frecuencia de la subjetividad estimada por cada distribución creando así un diccionario. Para esto, todas las palabras son transformadas a su forma base sin importar su variación morfológica a través del uso de técnicas de procesamiento de lenguaje natural. Como resultado del proyecto se realizó la implementación de una solución software el cual realiza la obtención del grado de subjetividad. Dicho software procesa la información y es almacenado para luego ser mostrado por medio de reportes. El resultado de la solución software fue validado para verificar la efectividad de este. El resultado mostró un porcentaje de efectividad satisfactoria. / In the present time, the internet is one of the most used media worldwide, and it has a lot of textual information about different topics; But that information, in many cases is not regulated by any information quality criteria. This is caused because any person can publish or edit the content of its. This generates the necessity of find automated procedures to filter the contents of the texts on the web. The main objective of the project is to implement a solution that allows identifying the degree of subjectivity of a text based on a data dictionary, this may be due to the implementation of processes that help define the subjectivity of texts. The developed software in the project is based on open source software that allows to analyze and store a set of words according to the frequency weighting of the subjectivity estimated by each distribution thus creating a dictionary. For this, all words are transformed to their base form regardless of their morphological variation through the use of natural language processing techniques. As a result of the project, the implementation of a software solution gives a result, which obtains the degree of subjectivity. This software processes the information. After that is stored and then be shown through reports. The result of the software solution was validated to verify its effectiveness. The result showed a satisfactory effectiveness percentage. / Tesis
3

The lemmatization of Tshivenda lexical items

Mantsha, Avhavhudzani Virginia January 2012 (has links)
Thesis (M.A. (African Languages)) -- University of Limpopo, 2013 / The study focuses on the lemmatization of lexical items in Tshivenḓa. It was conducted by reviewing selected Tshivenḓa dictionaries and the lexical items investigated were nouns, locatives, verbs and adjectives. The analysis looked at the approaches used in the macro- and micro-structural treatment of these important lexical items in dictionaries. The study also covered the treatment of the morphological, syntactical and semantic aspects of these lexical items in Tshivenḓa. This research ended with recommendations that will help dictionary compilers to overcome challenges they experience when lemmatizing nouns, locatives, verbs and adjectives.
4

Making Sense of Online Reviews: A Machine Learning Approach: An Abstract

Harrison, Dana E., Ajjan, Haya 01 January 2020 (has links)
It is estimated that 80% of companies’ data is unstructured. Unstructured data, or data that is not predefined by numerical values, continues to grow at a rapid pace. Images, text, videos and voice are all examples of unstructured data. Companies can use this type of data to leverage novel insights unavailable through more easily manageable, structured data. Unstructured data, however, creates a challenge since it often requires substantial coding prior to performing an analysis. The purpose of this study is to describe the steps and introduce computational methods that can be adopted to further explore unstructured, online reviews. The unstructured nature of online reviews requires extensive text analytics processing. This study introduces methods for text analytics including tokenization at the sentence level, lemmatization or stemming to reduce inflectional forms of the words appearing in the text, and ‘bag of n-grams’ approach. We will also introduce lexicon-based feature engineering and methods to develop new lexicons for capturing theoretically established constructs and relationships that are specific to the domain of study. The numeric features generated in the analysis will then be analyzed using machine learning algorithms. This process can be applied to the analysis of other unstructured data such as dyadic information exchange between customer service, salespeople, customers and channel members. Although not a comprehensive set of examples, companies can apply results from unstructured data analysis to examine a variety of outcomes related to customer decisions, managing channels and mitigating potential crisis situations. Understanding interdisciplinary methods of analyzing unstructured data is critical as the availability of this type of data continues to accelerate and enables researchers to develop theoretical contributions within the marketing discipline.
5

O efeito do uso de diferentes formas de extração de termos na compreensibilidade e representatividade dos termos em coleções textuais na língua portuguesa / The effect of using different forms of terms extraction on its comprehensibility and representability in Portuguese textual domains

Conrado, Merley da Silva 10 September 2009 (has links)
A extração de termos em coleções textuais, que é uma atividade da etapa de Pré-Processamento da Mineração de Textos, pode ser empregada para diversos fins nos processos de extração de conhecimento. Esses termos devem ser cuidadosamente extraídos, uma vez que os resultados de todo o processo dependerão, em grande parte, da \"qualidade\" dos termos obtidos. A \"qualidade\" dos termos, neste trabalho, abrange tanto a representatividade dos termos no domínio em questão como sua compreensibilidade. Tendo em vista sua importância, neste trabalho, avaliou-se o efeito do uso de diferentes técnicas de simplificação de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. Os termos foram extraídos seguindo os passos da metodologia apresentada neste trabalho e as técnicas utilizadas durante essa atividade de extração foram a radicalização, lematização e substantivação. Para apoiar tal metodologia, foi desenvolvida uma ferramenta, a ExtraT (Ferramenta para Extração de Termos). Visando garantir a \"qualidade\" dos termos extraídos, os mesmos são avaliados objetiva e subjetivamente. As avaliações subjetivas, ou seja, com o auxílio de especialistas do domínio em questão, abrangem a representatividade dos termos em seus respectivos documentos, a compreensibilidade dos termos obtidos ao utilizar cada técnica e a preferência geral subjetiva dos especialistas em cada técnica. As avaliações objetivas, que são auxiliadas por uma ferramenta desenvolvida (a TaxEM - Taxonomia em XML da Embrapa), levam em consideração a quantidade de termos extraídos por cada técnica, além de abranger tambéem a representatividade dos termos extraídos a partir de cada técnica em relação aos seus respectivos documentos. Essa avaliação objetiva da representatividade dos termos utiliza como suporte a medida CTW (Context Term Weight). Oito coleções de textos reais do domínio de agronegócio foram utilizadas na avaliaçao experimental. Como resultado foram indicadas algumas das características positivas e negativas da utilização das técnicas de simplificação de termos, mostrando que a escolha pelo uso de alguma dessas técnicas para o domínio em questão depende do objetivo principal pré-estabelecido, que pode ser desde a necessidade de se ter termos compreensíveis para o usuário até a necessidade de se trabalhar com uma menor quantidade de termos / The task of term extraction in textual domains, which is a subtask of the text pre-processing in Text Mining, can be used for many purposes in knowledge extraction processes. These terms must be carefully extracted since their quality will have a high impact in the results. In this work, the quality of these terms involves both representativity in the specific domain and comprehensibility. Considering this high importance, in this work the effects produced in the comprehensibility and representativity of terms were evaluated when different term simplification techniques are utilized in text collections in Portuguese. The term extraction process follows the methodology presented in this work and the techniques used were radicalization, lematization and substantivation. To support this metodology, a term extraction tool was developed and is presented as ExtraT. In order to guarantee the quality of the extracted terms, they were evaluated in an objective and subjective way. The subjective evaluations, assisted by domain specialists, analyze the representativity of the terms in related documents, the comprehensibility of the terms with each technique, and the specialist\'s opinion. The objective evaluations, which are assisted by TaxEM and by Thesagro (National Agricultural Thesaurus), consider the number of extracted terms by each technique and their representativity in the related documents. This objective evaluation of the representativity uses the CTW measure (Context Term Weight) as support. Eight real collections of the agronomy domain were used in the experimental evaluation. As a result, some positive and negative characteristics of each techniques were pointed out, showing that the best technique selection for this domain depends on the main pre-established goal, which can involve obtaining better comprehensibility terms for the user or reducing the quantity of extracted terms
6

O efeito do uso de diferentes formas de extração de termos na compreensibilidade e representatividade dos termos em coleções textuais na língua portuguesa / The effect of using different forms of terms extraction on its comprehensibility and representability in Portuguese textual domains

Merley da Silva Conrado 10 September 2009 (has links)
A extração de termos em coleções textuais, que é uma atividade da etapa de Pré-Processamento da Mineração de Textos, pode ser empregada para diversos fins nos processos de extração de conhecimento. Esses termos devem ser cuidadosamente extraídos, uma vez que os resultados de todo o processo dependerão, em grande parte, da \"qualidade\" dos termos obtidos. A \"qualidade\" dos termos, neste trabalho, abrange tanto a representatividade dos termos no domínio em questão como sua compreensibilidade. Tendo em vista sua importância, neste trabalho, avaliou-se o efeito do uso de diferentes técnicas de simplificação de termos na compreensibilidade e representatividade dos termos em coleções textuais na Língua Portuguesa. Os termos foram extraídos seguindo os passos da metodologia apresentada neste trabalho e as técnicas utilizadas durante essa atividade de extração foram a radicalização, lematização e substantivação. Para apoiar tal metodologia, foi desenvolvida uma ferramenta, a ExtraT (Ferramenta para Extração de Termos). Visando garantir a \"qualidade\" dos termos extraídos, os mesmos são avaliados objetiva e subjetivamente. As avaliações subjetivas, ou seja, com o auxílio de especialistas do domínio em questão, abrangem a representatividade dos termos em seus respectivos documentos, a compreensibilidade dos termos obtidos ao utilizar cada técnica e a preferência geral subjetiva dos especialistas em cada técnica. As avaliações objetivas, que são auxiliadas por uma ferramenta desenvolvida (a TaxEM - Taxonomia em XML da Embrapa), levam em consideração a quantidade de termos extraídos por cada técnica, além de abranger tambéem a representatividade dos termos extraídos a partir de cada técnica em relação aos seus respectivos documentos. Essa avaliação objetiva da representatividade dos termos utiliza como suporte a medida CTW (Context Term Weight). Oito coleções de textos reais do domínio de agronegócio foram utilizadas na avaliaçao experimental. Como resultado foram indicadas algumas das características positivas e negativas da utilização das técnicas de simplificação de termos, mostrando que a escolha pelo uso de alguma dessas técnicas para o domínio em questão depende do objetivo principal pré-estabelecido, que pode ser desde a necessidade de se ter termos compreensíveis para o usuário até a necessidade de se trabalhar com uma menor quantidade de termos / The task of term extraction in textual domains, which is a subtask of the text pre-processing in Text Mining, can be used for many purposes in knowledge extraction processes. These terms must be carefully extracted since their quality will have a high impact in the results. In this work, the quality of these terms involves both representativity in the specific domain and comprehensibility. Considering this high importance, in this work the effects produced in the comprehensibility and representativity of terms were evaluated when different term simplification techniques are utilized in text collections in Portuguese. The term extraction process follows the methodology presented in this work and the techniques used were radicalization, lematization and substantivation. To support this metodology, a term extraction tool was developed and is presented as ExtraT. In order to guarantee the quality of the extracted terms, they were evaluated in an objective and subjective way. The subjective evaluations, assisted by domain specialists, analyze the representativity of the terms in related documents, the comprehensibility of the terms with each technique, and the specialist\'s opinion. The objective evaluations, which are assisted by TaxEM and by Thesagro (National Agricultural Thesaurus), consider the number of extracted terms by each technique and their representativity in the related documents. This objective evaluation of the representativity uses the CTW measure (Context Term Weight) as support. Eight real collections of the agronomy domain were used in the experimental evaluation. As a result, some positive and negative characteristics of each techniques were pointed out, showing that the best technique selection for this domain depends on the main pre-established goal, which can involve obtaining better comprehensibility terms for the user or reducing the quantity of extracted terms
7

Určení základního tvaru slova / Determination of basic form of words

Šanda, Pavel January 2011 (has links)
Lemmatization is an important preprocessing step for many applications of text mining. Lemmatization process is similar to the stemming process, with the difference that determines not only the word stem, but it´s trying to determines the basic form of the word using the methods Brute Force and Suffix Stripping. The main aim of this paper is to present methods for algorithmic improvements Czech lemmatization. The created training set of data are content of this paper and can be freely used for student and academic works dealing with similar problematics.
8

Paralelní korpusový manažer / Parallel Corpus Manager

Kouřil, Jan January 2011 (has links)
The goal of diploma project was to implement parallel corpus manager, which can align parallel texts in different languages and insert them into corpus, where several more processing functions are provided. Program provides possibilities of automatic text alignment and its interactive editing. These aligned texts are then inserted into corpus. Program can work with multiple corpora, parallel corpus is allways identified by a couple of languages. In corpus, there are possibilities to search by many categories, view and edit particular selections, lemmatize and morphologically tag given texts, sort selections, import and export data, in many ways edit corpus for further easy navigation and add new expressions to managed dictionaries. Particular chapters describe introduction to corpus problematics, theory of aligning parallel texts, morphological text tagging and lemmatization, external tools used in program, most common subtitle formats and implementation solution of particular problems.
9

Rychlá adaptace počítačové podpory hry Krycí jména pro nové jazyky / Fast Adaptation of Codenames Computer Assistant for New Languages

Jareš, Petr January 2021 (has links)
This thesis extends a system of an artificial player of a word-association game Codenames to easy addition of support for new languages. The system is able to play Codenames in roles as a guessing player, a clue giver or, by their combination a Duet version player. For analysis of different languages a neural toolkit Stanza was used, which is language independent and enables automated processing of many languages. It was mainly about lemmatization and part of speech tagging for selection of clues in the game. For evaluation of word associations were several models tested, where the best results had a method Pointwise Mutual Information and predictive model fastText. The system supports playing Codenames in 36 languages comprising 8 different alphabets.
10

Popis staročeské apelativní deklinace (se zřetelem k automatické morfologické analýze textů ve Staročeské textové bance) / Description of Old Czech Common Nouns Declension (with regard to Automatic Morphological Analysis of Texts in Old Czech Text Bank)

Synková, Pavlína January 2017 (has links)
The thesis aims at explicit description of Old Czech common nouns declension with regard to its application in a tool for automatic morphological analysis of (digitized) texts in Old Czech. This means that this description is intended to serve as a basis for automatic generation of word forms (jointly with their appropriate morphological information and lemma) which will then be used for assigning morphological categories (gender, number, case) and lemma to word forms occurring in Old Czech digitized texts. The thesis thus develops a base for the first step in transformation of text banks (which currently exist for the Old Czech period) into an Old Czech corpus offering more possibilities for linguistic research. The Old Czech period is defined as a period from the beginning of the 14th century (more precisely from the period when first coherent texts written in Czech appeared) approx. to the end of the 15th century. Nouns were chosen for this work, because they cover approx. 30% of texts in current Czech (which is the highest percentage from all parts of speech). Old Czech texts are taken into account only in a transcribed form (based on transcription rules used in the Old Czech Text Bank developed at the Institute of the Czech Language of the Academy of Sciences of the Czech Republic). On the one...

Page generated in 0.1197 seconds