Global ETD Search

1	BNS informacinių žinučių analizė teminiu aspektu / Topic analysis in news items of BNS news agency Grigaitytė, Justina 17 June 2010 (has links) Darbe nagrinėjamas temų identifikavimo uždavinys, kuris siejamas su teksto klasifikavimu į tam tikras kategorijas, t.y. įvairių tekstinių duomenų grupavimas pagal atitinkamas temas. Žinutės naujienų agentūrose yra skirstomos į atskiras grupes ir pogrupius pagal temas. Šis darbas atliekamas rankomis, t.y. perskaitomas tekstas ir priskiriamas kokiai nors temai. Vis dėlto, vystantis žiniasklaidai ir kuriantis įvairiems naujienų portalams, aktualu naujienas skirstyti ne rankiniu, o automatiniu būdu, todėl galimybė automatizuoti šį procesą galėtų būti naudinga įvairiems naujienų portalams, padedant skirstyti pranešimus ir taupant laiko bei energijos sąnaudas. Darbo objektą apima 2007 metų BNS spaudos centro žinutės. Darbo tikslas – išsiaiškinti, kaip atskiri žodžiai padeda nustatyti teksto temą. Temos nustatymui taikomi trys metodai: dažnų žodžių, dvižodžių junginių (bigramų) ir prasminių žodžių. Darbas susideda iš trijų dalių. Pirmoje dalyje buvo aptarti teoriniai pagrindai (temos nustatymas, tekstų klasifikavimas, žinių kalba). Apžvelgus žinučių ypatumus pastebėta, kad šis informacinis žanras iš kitų išsiskiria tekstų glaustumu, faktų konstatavimu. Taip pat daroma prielaida, kad temos nustatymo tikslumui yra svarbu žinutės apimtis ir aktualumas. Antroje dalyje aprašyti dažnų žodžių ir dvižodžių junginių sąrašų sudarymo bei prasminių žodžių ištraukimo būdai. Apžvelgus naujienų skirstymą pagal temas, buvo sudarytas temų sąrašas ir juo remiantis, buvo anotuoti dažnų žodžių ir... [toliau žr. visą tekstą] / The thesis is based on topic detection in BNS news reports. The reports are divided into different groups and sub-grouped according to topics. This topic analysis is manual; namely, reading texts and assigning to any topic. However, media and various news portals are developing very quickly, so the possibility to distribute reports automatically is quite relevant problem. The automated topic detection process would be useful for various news portals, automated distribution would save time and energy costs. Therefore, the task of the paper is topic detection issue, which is associated with the classification of text into certain categories, in other words, various text data is classified by subject. The object of the thesis is reports from BNS news agency received in 2007. The aim of the paper is to analyze how separate words help identify the topic. Three methods are applied to detect the topic: high frequency words, bigrams (two-word compounds) and the keywords. The paper consists of three parts. The first part is theoretical; it presents the bases of topic detection, text classification and report language. The report was chosen because this information genre is concise and clearly stating facts. What is more, it is hypothesized that the accuracy of topic detection depends on the size and relevance of the report. The second part describes the formation of frequent words’ and bigram lists and keyword extraction techniques. Those frequent word and bigram lists were... [to full text] Philology Temos nustatymas Dažni Prasminiai Bigramų Topic Bigrams Frequent words Keyword
2	Large Vocabulary Continuous Speech Recogniton For Turkish Using Htk Comez, Murat Ali 01 January 2003 (has links) (PDF) This study aims to build a new language model that can be used in a Turkish large vocabulary continuous speech recognition system. Turkish is a very productive language in terms of word forms because of its agglutinative nature. For such languages like Turkish, the vocabulary size is far from being acceptable. From only one simple stem, thousands of new word forms can be generated using inflectional or derivational suffixes. In this thesis, words are parsed into their stems and endings. One ending includes the suffixes attached to the associated root. Then the search network based on bigrams is constructed. Bigrams are obtained either using stem and endings, or using only stems. The language model proposed is based on bigrams obtained using only stems. All work is done in HTK (Hidden Markov Model Toolkit) environment, except parsing and network transforming. Besides of offering a new language model for Turkish, this study involves a comprehensive work about speech recognition inspecting into concepts in the state of the art speech recognition systems. To acquire good command of these concepts and processes in speech recognition isolated word, connected word and continuous speech recognition tasks are performed. The experimental results associated with these tasks are also given.
3	Language Modeling For Turkish Continuous Speech Recognition Sahin, Serkan 01 December 2003 (has links) (PDF) This study aims to build a new language model for Turkish continuous speech recognition. Turkish is very productive language in terms of word forms because of its agglutinative nature. For such languages like Turkish, the vocabulary size is far from being acceptable from only one simple stem, thousands of new words can be generated using inflectional and derivational suffixes. In this work, word are parsed into their stem and endings. First of all, we consider endings as words and we obtained bigram probabilities using stem and endings. Then, bigram probabilities are obtained using only the stems. Single pass recognition was performed by using bigram probabilities. As a second job, two pass recognition was performed. Firstly, previous bigram probabilities were used to create word lattices. Secondly, trigram probabilities were obtained from a larger text. Finally, one-best results were obtained by using word lattices and trigram probabilities. All work is done in Hidden Markov Model Toolkit (HTK) environment, except parsing and network transforming. TK Electronics 7800-8360
4	Formulaic Sequences in Business and Academic Writing of English Learners Xia, Detong 23 May 2022 (has links) No description available. Education English As A Second Language Linguistics lexical bundles phrase frames bigrams business emails workplace discourse business English learners working professionals
5	Modality in Spiritual Literature : A Corpus Aided Discourse Study on Sadhguru and Eckhart Tolle BHOWMIK, SHOWMIK JOY January 2022 (has links) This study investigates and discusses how two spiritual teachers from different parts of the world interact with their devotees, what the probable impacts of their interaction are, and whether they speak similarly or differently based on the use of modal auxiliary verbs and pronouns. Linguistically speaking, the mystics mostly have to address their audience/readers in a particular manner with expressions which represent certainty, possibility, obligation and so on; thus, a study of such is necessary and modal auxiliary verbs represent such expressions. The two primary texts were chosen based on contemporary work and popularity. One of the texts was authored by Sadhguru Jaggi Vasudev, a spiritual teacher, an international spokesperson and a popular author. The other text was by Eckhart Tolle, a spiritual teacher and best-selling author. A corpus-assisted discourse approach was taken while looking into modal auxiliary verbs and their pronoun bigrams using Ant-Conc and Log-likelihood Calculator. Both quantitative and qualitative approaches were taken for the analysis. Findings suggest three things. First, both authors use similar types of modal verbs (epistemic) in most cases. Second, after comparing the type of modal verbs (epistemic/deontic) significant differences are observed. When the authors use epistemic modals, the choice of bigrams addresses different audience types and the way they approach a concept is different. Sadhguru (2020) addresses the general audience/readers whereas Tolle (2004) addresses the readers who need spiritual guidance. Finally, the choice of modal verbs represents mostly certainty that keeps the mood of the book calm and content for the readers. To sum up, spiritual teachers mostly speak from their experience and represent the expression of certainty and possibility though they address their readers differently. Spirituality mystics modality modal shading epistemic deontic bigrams CADS General Language Studies and Linguistics Specific Literatures Litteraturstudier
6	Aprendizado semissupervisionado multidescrição em classificação de textos / Multi-view semi-supervised learning in text classification Braga, Ígor Assis 23 April 2010 (has links) Algoritmos de aprendizado semissupervisionado aprendem a partir de uma combinação de dados rotulados e não rotulados. Assim, eles podem ser aplicados em domínios em que poucos exemplos rotulados e uma vasta quantidade de exemplos não rotulados estão disponíveis. Além disso, os algoritmos semissupervisionados podem atingir um desempenho superior aos algoritmos supervisionados treinados nos mesmos poucos exemplos rotulados. Uma poderosa abordagem ao aprendizado semissupervisionado, denominada aprendizado multidescrição, pode ser usada sempre que os exemplos de treinamento são descritos por dois ou mais conjuntos de atributos disjuntos. A classificação de textos é um domínio de aplicação no qual algoritmos semissupervisionados vêm obtendo sucesso. No entanto, o aprendizado semissupervisionado multidescrição ainda não foi bem explorado nesse domínio dadas as diversas maneiras possíveis de se descrever bases de textos. O objetivo neste trabalho é analisar o desempenho de algoritmos semissupervisionados multidescrição na classificação de textos, usando unigramas e bigramas para compor duas descrições distintas de documentos textuais. Assim, é considerado inicialmente o difundido algoritmo multidescrição CO-TRAINING, para o qual são propostas modificações a fim de se tratar o problema dos pontos de contenção. É também proposto o algoritmo COAL, o qual pode melhorar ainda mais o algoritmo CO-TRAINING pela incorporação de aprendizado ativo como uma maneira de tratar pontos de contenção. Uma ampla avaliação experimental desses algoritmos foi conduzida em bases de textos reais. Os resultados mostram que o algoritmo COAL, usando unigramas como uma descrição das bases textuais e bigramas como uma outra descrição, atinge um desempenho significativamente melhor que um algoritmo semissupervisionado monodescrição. Levando em consideração os bons resultados obtidos por COAL, conclui-se que o uso de unigramas e bigramas como duas descrições distintas de bases de textos pode ser bastante compensador / Semi-supervised learning algorithms learn from a combination of both labeled and unlabeled data. Thus, they can be applied in domains where few labeled examples and a vast amount of unlabeled examples are available. Furthermore, semi-supervised learning algorithms may achieve a better performance than supervised learning algorithms trained on the same few labeled examples. A powerful approach to semi-supervised learning, called multi-view learning, can be used whenever the training examples are described by two or more disjoint sets of attributes. Text classification is a domain in which semi-supervised learning algorithms have shown some success. However, multi-view semi-supervised learning has not yet been well explored in this domain despite the possibility of describing textual documents in a myriad of ways. The aim of this work is to analyze the effectiveness of multi-view semi-supervised learning in text classification using unigrams and bigrams as two distinct descriptions of text documents. To this end, we initially consider the widely adopted CO-TRAINING multi-view algorithm and propose some modifications to it in order to deal with the problem of contention points. We also propose the COAL algorithm, which further improves CO-TRAINING by incorporating active learning as a way of dealing with contention points. A thorough experimental evaluation of these algorithms was conducted on real text data sets. The results show that the COAL algorithm, using unigrams as one description of text documents and bigrams as another description, achieves significantly better performance than a single-view semi-supervised algorithm. Taking into account the good results obtained by COAL, we conclude that the use of unigrams and bigrams as two distinct descriptions of text documents can be very effective Aprendizado de máquina Aprendizado multidescrição Aprendizado semissupervisionado Bigrams Biogramas Classificação de textos Co-training Co-Training cial Coal Machine learning Multi-view learning Self-training Self-training Semi-supervised learning Text classification Unigramas Unigrams
7	Aprendizado semissupervisionado multidescrição em classificação de textos / Multi-view semi-supervised learning in text classification Ígor Assis Braga 23 April 2010 (has links) Algoritmos de aprendizado semissupervisionado aprendem a partir de uma combinação de dados rotulados e não rotulados. Assim, eles podem ser aplicados em domínios em que poucos exemplos rotulados e uma vasta quantidade de exemplos não rotulados estão disponíveis. Além disso, os algoritmos semissupervisionados podem atingir um desempenho superior aos algoritmos supervisionados treinados nos mesmos poucos exemplos rotulados. Uma poderosa abordagem ao aprendizado semissupervisionado, denominada aprendizado multidescrição, pode ser usada sempre que os exemplos de treinamento são descritos por dois ou mais conjuntos de atributos disjuntos. A classificação de textos é um domínio de aplicação no qual algoritmos semissupervisionados vêm obtendo sucesso. No entanto, o aprendizado semissupervisionado multidescrição ainda não foi bem explorado nesse domínio dadas as diversas maneiras possíveis de se descrever bases de textos. O objetivo neste trabalho é analisar o desempenho de algoritmos semissupervisionados multidescrição na classificação de textos, usando unigramas e bigramas para compor duas descrições distintas de documentos textuais. Assim, é considerado inicialmente o difundido algoritmo multidescrição CO-TRAINING, para o qual são propostas modificações a fim de se tratar o problema dos pontos de contenção. É também proposto o algoritmo COAL, o qual pode melhorar ainda mais o algoritmo CO-TRAINING pela incorporação de aprendizado ativo como uma maneira de tratar pontos de contenção. Uma ampla avaliação experimental desses algoritmos foi conduzida em bases de textos reais. Os resultados mostram que o algoritmo COAL, usando unigramas como uma descrição das bases textuais e bigramas como uma outra descrição, atinge um desempenho significativamente melhor que um algoritmo semissupervisionado monodescrição. Levando em consideração os bons resultados obtidos por COAL, conclui-se que o uso de unigramas e bigramas como duas descrições distintas de bases de textos pode ser bastante compensador / Semi-supervised learning algorithms learn from a combination of both labeled and unlabeled data. Thus, they can be applied in domains where few labeled examples and a vast amount of unlabeled examples are available. Furthermore, semi-supervised learning algorithms may achieve a better performance than supervised learning algorithms trained on the same few labeled examples. A powerful approach to semi-supervised learning, called multi-view learning, can be used whenever the training examples are described by two or more disjoint sets of attributes. Text classification is a domain in which semi-supervised learning algorithms have shown some success. However, multi-view semi-supervised learning has not yet been well explored in this domain despite the possibility of describing textual documents in a myriad of ways. The aim of this work is to analyze the effectiveness of multi-view semi-supervised learning in text classification using unigrams and bigrams as two distinct descriptions of text documents. To this end, we initially consider the widely adopted CO-TRAINING multi-view algorithm and propose some modifications to it in order to deal with the problem of contention points. We also propose the COAL algorithm, which further improves CO-TRAINING by incorporating active learning as a way of dealing with contention points. A thorough experimental evaluation of these algorithms was conducted on real text data sets. The results show that the COAL algorithm, using unigrams as one description of text documents and bigrams as another description, achieves significantly better performance than a single-view semi-supervised algorithm. Taking into account the good results obtained by COAL, we conclude that the use of unigrams and bigrams as two distinct descriptions of text documents can be very effective Aprendizado de máquina Aprendizado multidescrição Aprendizado semissupervisionado Biogramas Classificação de textos Co-Training cial Self-training Unigramas Bigrams Co-training Coal Machine learning Multi-view learning Self-training Semi-supervised learning Text classification Unigrams

1

Page generated in 0.0255 seconds