Global ETD Search

91	Coh-Metrix-Dementia: análise automática de distúrbios de linguagem nas demências utilizando Processamento de Línguas Naturais / Coh-Metrix-Dementia: automatic analysis of language impairment in dementia using Natural Language Processing Andre Luiz Verucci da Cunha 27 October 2015 (has links) (Contexto) Segundo a Organização Mundial da Saúde, as demências são um problema de custo social elevado, cujo manejo é um desafio para as próximas décadas. Demências comuns incluem a Doença de Alzheimer (DA), bastante conhecida. Outra síndrome menos conhecida, o Comprometimento Cognitivo Leve (CCL), é relevante por ser o estágio inicial clinicamente definido da DA. Embora o CCL não seja tão conhecido do público, pessoas com um tipo especial dessa síndrome, o CCL amnéstico, evoluem para a DA a uma taxa bastante maior que a da população em geral. O diagnóstico das demências e síndromes relacionadas é feito com base na análise de aspectos linguísticos e cognitivos do paciente. Testes clássicos incluem testes de fluência, nomeação, e repetição. Entretanto, pesquisas recentes têm reconhecido cada vez mais a importância da análise da produção discursiva, especialmente de narrativas, como uma alternativa mais adequada, principalmente para a detecção do CCL. (Lacuna) Enquanto uma análise qualitativa do discurso pode revelar o tipo da doença apresentada pelo paciente, uma análise quantitativa é capaz de revelar a intensidade do dano cerebral existente. A grande dificuldade de análises quantitativas de discurso é sua exigência de esforços: o processo de análise rigorosa e detalhada da produção oral é bastante laborioso, o que dificulta sua adoção em larga escala. Nesse cenário, análises computadorizadas despontam como uma solução de interesse. Ferramentas de análise automática de discurso com vistas ao diagnóstico de demências de linguagem já existem para o inglês, mas nenhum trabalho nesse sentido foi feito para o português até o presente momento. (Objetivo) Este projeto visa criar um ambiente unificado, intitulado Coh-Metrix-Dementia, que se valerá de recursos e ferramentas de Processamento de Línguas Naturais (PLN) e de Aprendizado de Máquina para possibilitar a análise e o reconhecimento automatizados de demências, com foco inicial na DA e no CCL. (Hipótese) Tendo como base o ambiente Coh-Metrix adaptado para o português do Brasil, denominado Coh-Metrix-Port, e incluindo a adaptação para o português e inserção de vinte e cinco novas métricas para calcular a complexidade sintática, a densidade de ideias, e a coerência textual, via semântica latente, é possível classificar narrativas de sujeitos normais, com DA, e com CCL, em uma abordagem de aprendizado de máquina, com precisão comparável a dos testes clássicos. (Conclusão) Nos resultados experimentais, foi possível separar os pacientes entre controles, CCL, e DA com medida F de 81,7%, e separar controles e CCL com medida F de 90%. Os resultados indicam que o uso das métricas da ferramenta Coh-Metrix-Dementia é bastante promissor como recurso na detecção precoce de declínio nas habilidades de linguagem. / (Backgroung) According to the World Health Organization, dementia is a costly social issue, whose management will be a challenge on the coming decades. One common form of dementia is Alzheimers Disease (AD). Another less known syndrome, Mild Cognitive Impairment (MCI), is relevant for being the initial clinically defined stage of AD. Even though MCI is less known by the public, patients with a particular variant of this syndrome, Amestic MCI, evolve to AD in a considerably larger proportion than that of the general population. The diagnosis of dementia and related syndromes is based on the analysis of linguistic and cognitive aspects. Classical exams include fluency, naming, and repetition tests. However, recent research has been recognizing the importance of discourse analysis, specially narrative-based, as a more suitable alternative, specially for MCI detection. (Gap) While qualitative discourse analyses can determine the nature of the patients disease, quantitative analyses can reveal the extent of the existing brain damage. The greatest challenge in quantitative discourse analyses is that a rigorous and thorough evaluation of oral production is very labor-intensive, which hinders its large-scale adoption. In this scenario, computerized analyses become of increasing interest. Automated discourse analysis tools aiming at the diagnosis of language-impairing dementias already exist for the English language, but no such work has been made for Brasilian Portuguese so far. (Goal) This project aims to create a unified environment, entitled Coh-Metrix-Dementia, that will make use of Natural Language Processing and Machine Learning resources and tools to enable automated dementia analysis and classification, initially focusing on AD and MCI. (Hypothesis) Basing our work on Coh-Metrix-Port, the Brazilian Portuguese adaption of Coh-Metrix, and including the adaptation and inclusion of twenty-five new metrics for measuring syntactical complexity, idea density, and text cohesion through latent semantics, it is possible to classify narratives of healthy, AD, and MCI patients, in a machine learning approach, with a precision comparable to classical tests. (Conclusion) In our experiments, it was possible to separate patients in controls, DA, and CCL with 81.7% F-measure, and separate controls and CCL with 90% F-measure. These results indicate that Coh-Metrix-Dementia is a very promising resource in the early detection of language impairment. Classificação textual Comprometimento cognitivo leve Doença de Alzheimer Métricas textuais Alzheimer's disease Mild cognitive impairment Text classification Textual metrics
92	The Virtual Landscape of Geological Information Topics, Methods, and Rhetoric in Modern Geology Fratesi, Sarah Elizabeth 03 November 2008 (has links) Geology is undergoing changes that could influence its knowledge claims, reportedly becoming more laboratory-based, technology-driven, quantitative, and physics- and chemistry-rich over time. This dissertation uses techniques from information science and linguistics to examine the geologic literature of 1945-2005. It consists of two studies: an examination of the geological literature as an expanding network of related subdisciplines, and an investigation of the linguistic and graphical argumentation strategies within geological journal articles. The first investigation is a large-scale study of topics within articles from 67 geologic journals. Clustering of subdiscipline journals based on titles and keywords reveals six major areas of geology: sedimentology/stratigraphy, oceans/climate, solid-earth, earth-surface, hard-rock, and paleontology. Citation maps reveal similar relationships. Text classification of titles and keywords from general-geology journals reveals that geological research has shifted away from economic geology towards physics- and chemistry-based topics. Geological literature has grown and fragmented ("twigged") over time, sustained in its extreme specialization by the scientific collaborations characteristic of "big science." The second investigation is a survey of linguistic and graphic features within geological journal articles that signal certain types of scientific activity and reasoning. Longitudinal studies show that "classical geology" articles within Geological Society of America Bulletin have become shorter and more graphically dense from 1945-2005. Maps and graphs replace less-efficient text, photographs, and sketches. Linguistic markers reveal increases in formal scientific discourse, specialized vocabulary, and reading difficulty. Studies comparing GSA Bulletin to five subdiscipline journals reveals that, in 2005, GSA Bulletin, AAPG Bulletin, and Journal of Sedimentary Research had similar graphic profiles and presented both field and laboratory data. Ground Water, Journal of Geophysical Research - Solid Earth, and Geochimica et Cosmochimica Acta had more equations, graphs, and numerical-modeling results than the other journals. The dissertation concludes that geology evolves by spawning physics- and chemistry-rich subdisciplines with distinct methodologies. Publishing geologists accommodate increased theoretical rigor, not using classic hallmarks of hard science (e.g., equations), but by mobilizing spatial arguments within an increasingly dense web of linguistic and graphical signs. Substantial differences in topic, methodology, and argumentation between subdisciplines manifest the multifaceted and complex consistution of geology and geologic philosophy philosophy of geology bibliometrics content analysis citation map graphics twigging big science hard science clustering text classification American Studies Arts and Humanities
93	Enrichissement des Modèles de Classification de Textes Représentés par des Concepts / Improving text-classification models using the bag-of-concept paradigm Risch, Jean-Charles 27 June 2017 (has links) La majorité des méthodes de classification de textes utilisent le paradigme du sac de mots pour représenter les textes. Pourtant cette technique pose différents problèmes sémantiques : certains mots sont polysémiques, d'autres peuvent être des synonymes et être malgré tout différenciés, d'autres encore sont liés sémantiquement sans que cela soit pris en compte et enfin, certains mots perdent leur sens s'ils sont extraits de leur groupe nominal. Pour pallier ces problèmes, certaines méthodes ne représentent plus les textes par des mots mais par des concepts extraits d'une ontologie de domaine, intégrant ainsi la notion de sens au modèle. Les modèles intégrant la représentation des textes par des concepts restent peu utilisés à cause des résultats peu satisfaisants. Afin d'améliorer les performances de ces modèles, plusieurs méthodes ont été proposées pour enrichir les caractéristiques des textes à l'aide de nouveaux concepts extraits de bases de connaissances. Mes travaux donnent suite à ces approches en proposant une étape d'enrichissement des modèles à l'aide d'une ontologie de domaine associée. J'ai proposé deux mesures permettant d'estimer l'appartenance aux catégories de ces nouveaux concepts. A l'aide de l'algorithme du classifieur naïf Bayésien, j'ai testé et comparé mes contributions sur le corpus de textes labéllisés Ohsumed et l'ontologie de domaine Disease Ontology. Les résultats satisfaisants m'ont amené à analyser plus précisément le rôle des relations sémantiques dans l'enrichissement des modèles. Ces nouveaux travaux ont été le sujet d'une seconde expérience où il est question d'évaluer les apports des relations hiérarchiques d'hyperonymie et d'hyponymie. / Most of text-classification methods use the ``bag of words” paradigm to represent texts. However Bloahdom and Hortho have identified four limits to this representation: (1) some words are polysemics, (2) others can be synonyms and yet differentiated in the analysis, (3) some words are strongly semantically linked without being taken into account in the representation as such and (4) certain words lose their meaning if they are extracted from their nominal group. To overcome these problems, some methods no longer represent texts with words but with concepts extracted from a domain ontology (Bag of Concept), integrating the notion of meaning into the model. Models integrating the bag of concepts remain less used because of the unsatisfactory results, thus several methods have been proposed to enrich text features using new concepts extracted from knowledge bases. My work follows these approaches by proposing a model-enrichment step using a domain ontology, I proposed two measures to estimate to belong to the categories of these new concepts. Using the naive Bayes classifier algorithm, I tested and compared my contributions on the Ohsumed corpus using the domain ontology ``Disease Ontology”. The satisfactory results led me to analyse more precisely the role of semantic relations in the enrichment step. These new works have been the subject of a second experiment in which we evaluate the contributions of the hierarchical relations of hypernymy and hyponymy. Classification de Textes Intelligence Artificielle Mégadonnées Apprentissage Automatique Visualisation de Données Text Classification Artificial Intelligence Big Data Machine Learning Data Visualization
94	Multi-Label Text Classification with Transfer Learning for Policy Documents : The Case of the Sustainable Development Goals Rodríguez Medina, Samuel January 2019 (has links) We created and analyzed a text classification dataset from freely-available web documents from the United Nation's Sustainable Development Goals. We then used it to train and compare different multi-label text classifiers with the aim of exploring the alternatives for methods that facilitate the search of information of this type of documents. We explored the effectiveness of deep learning and transfer learning in text classification by fine-tuning different pre-trained language representations — Word2Vec, GloVe, ELMo, ULMFiT and BERT. We also compared these approaches against a baseline of more traditional algorithms without using transfer learning. More specifically, we used multinomial Naive Bayes, logistic regression, k-nearest neighbors and Support Vector Machines. We then analyzed the results of our experiments quantitatively and qualitatively. The best results in terms of micro-averaged F1 scores and AUROC are obtained by BERT. However, it is also interesting that the second best classifier in terms of micro-averaged F1 scores is the Support Vector Machines, closely followed by the logistic regression classifier, which both have the advantage of being less computationally expensive than BERT. The results also show a close relation between our dataset size and the effectiveness of the classifiers. machine learning deep neural networks transfer learning text classification sustainable development goals sdgs
95	MaSTA: a text-based machine learning approach for systems-of-systems in the big data context / MaSTA: uma abordagem de aprendizado de máquina orientado a textos para sistemas-de-sistemas no contexto de big data Bianchi, Thiago 11 April 2019 (has links) Systems-of-systems (SoS) have gained a very important status in industry and academia as an answer to the growing complexity of software-intensive systems. SoS are particular in the sense that their capabilities transcend the mere sum of the capacities of their diverse independent constituents. In parallel, the current growth in the amount of data collected in different formats is impressive and imposes a considerable challenge for researchers and professionals, characterizing hence the Big Data context. In this scenario, Machine Learning techniques have been increasingly explored to analyze and extract relevant knowledge from such data. SoS have also generated a large amount of data and text information and, in many situations, users of SoS need to manually register unstructured, critical texts, e.g., work orders and service requests, and also need to map them to structured information. Besides that, these are repetitive, time-/effort-consuming, and even error-prone tasks. The main objective of this Thesis is to present MaSTA, an approach composed of an innovative classification method to infer classifiers from large textual collections and an evaluation method that measures the reliability and performance levels of such classifiers. To evaluate the effectiveness of MaSTA, we conducted an experiment with a commercial SoS used by large companies that provided us four datasets containing near one million records related with three classification tasks. As a result, this experiment indicated that MaSTA is capable of automatically classifying the documents and also improve the user assertiveness by reducing the list of possible classifications. Moreover, this experiment indicated that MaSTA is a scalable solution for the Big Data scenarios in which document collections have hundreds of thousands (even millions) of documents, even produced by different constituents of an SoS. / Sistemas-de-sistemas (SoS) conquistaram um status muito importante na indústria e na academia como uma resposta à crescente complexidade dos sistemas intensivos de software. SoS são particulares no sentido de que suas capacidades transcendem a mera soma das capacidades de seus diversos constituintes independentes. Paralelamente, o crescimento atual na quantidade de dados coletados em diferentes formatos é impressionante e impõe um desafio considerável para pesquisadores e profissionais, caracterizando consequentemente o contexto de Big Data. Nesse cenário, técnicas de Aprendizado de Máquina têm sido cada vez mais exploradas para analisar e extrair conhecimento relevante de tais dados. SoS também têm gerado uma grande quantidade de dados e informações de texto e, em muitas situações, os usuários do SoS precisam registrar manualmente textos críticos não estruturados, por exemplo, ordens de serviço e solicitações de serviço, e também precisam mapeá-los para informações estruturadas. Além disso, essas tarefas são repetitivas, demoradas, e até mesmo propensas a erros. O principal objetivo desta Tese é apresentar o MaSTA, uma abordagem composta por um método de classificação inovador para inferir classificadores a partir de grandes coleções de texto e um método de avaliação que mensura os níveis de confiabilidade e desempenho desses classificadores. Para avaliar a eficácia do MaSTA, nós conduzimos um experimento com um SoS comercial utilizado por grandes empresas que nos forneceram quatro conjuntos de dados contendo quase um milhão de registros relacionados com três tarefas de classificação. Como resultado, esse experimento indicou que o MaSTA é capaz de classificar automaticamente os documentos e também melhorar a assertividade do usuário através da redução da lista de possíveis classificações. Além disso, esse experimento indicou que o MaSTA é uma solução escalável para os cenários de Big Data, nos quais as coleções de documentos têm centenas de milhares (até milhões) de documentos, até mesmo produzidos por diferentes constituintes de um SoS. Aprendizado de máquina Big Data Big Data Classificação de texto Machine learning Naive Bayes Naive Bayes Sistema-de-sistemas System-of-systems Text classification
96	Multi-scale analysis of languages and knowledge through complex networks / Análise multi-escala de línguas e conecimento por meio de redes complexas Arruda, Henrique Ferraz de 24 January 2019 (has links) There any many different aspects in natural languages and their related dynamics that have been studied. In the case of languages, some quantitative analyses have been done by using stochastic models. Furthermore, natural languages can be understood as complex systems. Thus, there is a possibility to use set of tools development to analyse complex networks, which are computationally represented by graphs, also to analyse natural languages. Furthermore, these tools can be used to represent and analyse some related dynamics taking place on the networks. Observe that knowledge is intrinsically related to language, because language is the vehicle used by humans beings to transmit dicoveries, and the language itself is also a type of knowledge. This thesis is divided into two types of analyses: (i) texts and (II) dynamical aspects. In the first part, we proposed networks representations of text in different scales analyses, starting from the analysis of writing style considering word adjacency networks (co-occurence) to understand local patterns of words, to a mesoscopic representation, which is created from chunks of text and grasps information of the unfolding of the story. In the second part, we considered the structure and dynamics related to knowledge and language, in this case, starting from the larger scale, in which we studied the connectivity between applied and theoretical physics. In the following, we simulated the knowledge acquisition by researchers in a multi-agent dynamics and an intelligent machine that solves problems, which is represented by a network. At the smallest considered scale, we simulate the transmission of networks. This transmission considers the data as a series of organized symbols that is obtained from a dynamics. In order to improve the speed of transmission, the series can be compacted. For that, we considered the information theory and Huffman code. The proposed network-based approaches were found to be suitable to deal with the employed analysis for all of the tested scales. / Existem diversos aspectos das linguagens naturais e de dinâmicas relacionadas que estão sendo estudadas. No caso das línguas, algumas análises quantitativas foram feitas usando modelos estocásticos. Ademais, linguagens naturais podem ser entendidas como sistemas complexos. Para analisar linguagens naturais, existe a possibilidade de utilizar o conjunto de ferramentas que já foram desenvolvidas para analisar redes complexas, que são representadas computacionalmente. Além disso, tais ferramentas podem ser utilizadas para representar e analisar algumas dinâmicas relacionadas a redes complexas. Observe que o conhecimento está intrinsecamente relacionado à linguagem, pois a linguagem é o veículo usado para transmitir novas descobertas, sendo que a própria linguagem também é um tipo de conhecimento. Esta tese é dividida em dois tipos de análise : (i) textos e (ii) aspectos dinâmicos. Na primeira parte foram propostas representações de redes de texto em diferentes escalas de análise. A partir da análise do estilo de escrita, considerando redes de adjacência de palavras (co-ocorrência) para entender padrões locais de palavras, até uma representação mesoscópica, que é criada a partir de pedaços de texto e que representa informações do texto de acordo com o desenrolar da história. Na segunda parte, foram consideradas a estrutura e dinâmica relacionadas ao conhecimento e à linguagem. Neste caso, partiu-se da escala maior, com a qual estudamos a conectividade entre física aplicada e física teórica. A seguir, simulou-se a aquisição de conhecimento por pesquisadores em uma dinâmica multi-agente e uma máquina inteligente que resolve problemas, que é representada por uma rede. Como a menor escala considerada, foi simulada a transmissão de redes. Essa transmissão considera os dados como uma série de símbolos organizados que são obtidos a partir de uma dinâmica. Para melhorar a velocidade de transmissão, a série pode ser compactada. Para tanto, foi utilizada a teoria da informação e o código de Huffman. As propostas de abordagens baseadas em rede foram consideradas adequadas para lidar com a análise empregada, em todas as escalas testadas. Classificação de textos Complex networks Dinâmicas relacionadas ao conhecimento Dynamics related to knowledge Mineração de textos Redes complexas Text classification Text mining
97	Utility of Feedback Given by Students During Courses Atkisson, Michael Alton 01 July 2017 (has links) This two-article dissertation summarizes the end-of-course survey and formative feedback literatures, as well as proposes actionability as a useful construct in the analysis of feedback from students captured in real-time during their courses. The present inquiry grew out of my work as the founder of DropThought Education, a Division of DropThought. DropThought Education was a student feedback system that helped instructional designers, instructors, and educational systems to use feedback from students to improve learning and student experience. To find out whether the DropThought style of feedback was more effective than other forms of capturing and analyzing student feedback, I needed to (1) examine the formative feedback literature and (2) test DropThought style feedback against traditional feedback forms. The method and theory proposed demonstrates that feedback from students can be specific and actionable when captured in the moment at students' activity level, in their own words. Application of the real-time feedback approach are relevant to practitioners and researchers alike, whether an instructor looking to improve her class activities, or a learning scientist carrying out interventionist, design-based research. formative feedback end-of-course feedback real-time feedback DropThought hierarchical generalized linear model text classification Educational Psychology
98	Text Classificaton In Turkish Marketing Domain And Context-sensitive Ad Distribution Engin, Melih 01 February 2009 (has links) (PDF) Online advertising has a continuously increasing popularity. Target audience of this new advertising method is huge. Additionally, there is another rapidly growing and crowded group related to internet advertising that consists of web publishers. Contextual advertising systems make it easier for publishers to present online ads on their web sites, since these online marketing systems automatically divert ads to web sites with related contents. Web publishers join ad networks and gain revenue by enabling ads to be displayed on their sites. Therefore, the accuracy of automated ad systems in determining ad-context relevance is crucial. In this thesis we construct a method for semantic classification of web site contexts in Turkish language and develop an ad serving system to display context related ads on web documents. The classification method uses both semantic and statistical techniques. The method is supervised, and therefore, needs processed sample data for learning classification rules. Therefore, we generate a Turkish marketing dataset and use it in our classification approaches. We form successful classification methods using different feature spaces and support vector machine configurations. Our results present a good comparison between these methods.
99	Collocation Segmentation for Text Chunking / Teksto skaidymas pastoviųjų junginių segmentais Daudaravičius, Vidas 04 February 2013 (has links) Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications. / Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose. Informatics Collocation segmentation Multi-word Terminology Machine translation Text classification Pastovieji junginiai Daugiažodžiai junginiai Terminologija Mašininis vertimas Tekstų klasifikavimas
100	Teksto skaidymas pastoviųjų junginių segmentais / Collocation segmentation for text chunking Daudaravičius, Vidas 04 February 2013 (has links) Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose. / Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications. Informatics Pastovieji junginiai Daugiažodžiai junginiai Terminologija Mašininis vertimas Tekstų klasifikavimas Collocation segmentation Multi-word Terminology Machine translation Text classification

Search results