Global ETD Search

101	Comparing Text Classification Libraries in Scala and Python : A comparison of precision and recall Garamvölgyi, Filip, Henning Bruce, August January 2021 (has links) In today’s internet era, more text than ever is being uploaded online. The text comes in many forms, such as social media posts, business reviews, and many more. For various reasons, there is an interest in analyzing the uploaded text. For instance, an airline business could ask their customers to review the service they have received. The feedback would be collected by asking the customer to leave a review and a score. A common scenario is a review with a good score that contains negative aspects. It is preferable to avoid a situation where the entirety of the review is regarded as positive because of the score if there are negative aspects mentioned. A solution to this would be to analyze each sentence of a review and classify it by negative, neutral or, positive depending on how the sentence is perceived. With the amount of text uploaded today, it is not feasible to manually analyze text. To automatically classify text by a set of criteria is called text classification. The process of specifically classifying text by how it is perceived is a subcategory of text classification known as sentiment analysis. Positive, neutral and, negative would be the sentiments to classify. The most popular frameworks associated with the implementation of sentiment analyzers are developed in the programming language Python. However, over the years, text classification has had an increase in popularity. The increase in popularity has caused new frameworks to be developed in new programming languages. Scala is one of the programming languages that has had new frameworks developed to work with sentiment analysis. However, in comparison to Python, it has fewer available resources. Python has more available libraries to work with, available documentation, and community support online. There are even fewer resources regarding sentiment analysis in a less common language such as Swedish. The problem is no one has compared a sentiment analyzer for Swedish text implemented using Scala and compared it to Python. The purpose of this thesis is to compare recall and precision of a sentiment analyzer implemented in Scala to Python. The goal of this thesis is to increase the knowledge regarding the state of text classification for less common natural languages in Scala. To conduct the study, a qualitative approach with the support of quantitative data was used. Two kinds of sentiment analyzers were implemented in Scala and Python. The first classified text as either positive or negative (binary sentiment analysis), the second sentiment analyzer would also classify text as neutral (multiclass sentiment analysis). To perform the comparative study, the implemented analyzers would perform classification on text with known sentiments. The quality of the classifications was measured using their F1-score. The results showed that Python had better recall and quality for both tasks. In the binary task, there was not as large of a difference between the two implementations. The resources from Python were more specialized for Swedish and did not seem to be as affected by the small dataset used as the resources in Scala. Scala had an F1-score of 0.78 for binary sentiment analysis and 0.65 for multiclass sentiment analysis. Python had an F1-score of 0.83 for binary sentiment analysis and 0.78 for multiclass sentiment analysis. / I dagens internetera laddas mer text upp än någonsin online. Texten finns i många former, till exempel inlägg på sociala medier, företagsrecensioner och många fler. Av olika skäl finns det ett intresse av att analysera den uppladdade texten. Till exempel kan ett flygbolag be sina kunder att lämna omdömen om tjänsten de nyttjat. Feedbacken samlas in genom att be kunden lämna ett omdöme och ett betyg. Ett vanligt scenario är en recension med ett bra betyg som innehåller negativa aspekter. Det är att föredra att undvika en situation där hela recensionen anses vara positiv på grund av poängen, om det nämnts negativa aspekter. En lösning på detta skulle vara att analysera varje mening i en recension och klassificera den som negativ, neutral eller positiv beroende på hur meningen uppfattas. Med den mängd text som laddas upp idag är det inte möjligt att manuellt analysera text. Att automatiskt klassificera text efter en uppsättning kriterier kallas textklassificering. Processen att specifikt klassificera text efter hur den uppfattas är en underkategori av textklassificering som kallas sentimentanalys. Positivt, neutralt och negativt skulle vara sentiment att klassificera. De mest populära ramverken för implementering av sentimentanalysatorer utvecklas i programmeringsspråket Python. Men genom åren har textklassificering ökat i popularitet. Ökningen i popularitet har gjort att nya ramverk utvecklats för nya programmeringsspråk. Scala är ett av programmeringsspråken som har utvecklat nya ramverk för att arbeta med sentimentanalys. I jämförelse med Python har den dock mindre tillgängliga resurser. Python har mer bibliotek, dokumentation och mer stöd online. Det finns ännu färre resurser när det gäller sentimentanalyser på ett mindre vanligt språk som svenska. Problemet är att ingen har jämfört en sentimentanalysator för svensk text implementerad med Scala och jämfört den med Python. Syftet med denna avhandling är att jämföra precision och recall på en sentimentanalysator implementerad i Scala med Python. Målet med denna avhandling är att öka kunskapen om tillståndet för textklassificering för mindre vanliga naturliga språk i Scala. För att genomföra studien användes ett kvalitativt tillvägagångssätt med stöd av kvantitativa data. Två typer av sentimentanalysatorer implementerades i Scala och Python. Den första klassificerade texten som antingen positiv eller negativ (binär sentimentanalys), den andra sentimentanalysatorn skulle också klassificera text som neutral (sentimentanalys i flera klasser). För att utföra den jämförande studien skulle de implementerade analysatorerna utföra klassificering på text med kända sentiment. Klassificeringarnas kvalitet mättes med deras F1-poäng. Resultaten visade att Python hade bättre precision och recall för båda uppgifterna. I den binära uppgiften var det inte lika stor skillnad mellan de två implementeringarna. Resurserna från Python var mer specialiserade för svenska och verkade inte påverkas lika mycket av den lilla dataset som används som resurserna i Scala. Scala hade ett F1-poäng på 0,78 för binär sentimentanalys och 0,65 för sentimentanalys i flera klasser. Python hade ett F1-poäng på 0,83 för binär sentimentanalys och 0,78 för sentimentanalys i flera klasser. LaBSE Spark NLP NLP Text classification Scala LaBSE Spark NLP NLP Textklassificering Scala Computer and Information Sciences Data- och informationsvetenskap
102	Extending a Text Classifier to Multiple Languages / Utöka en textklassificeringsmodell till flera språk Byström, Albin January 2021 (has links) This thesis explores the possibility to extend monolingual and bilingual text classifiers to multiple languages. Two different language models are explored, language aligned word embeddings and a transformer model. The goal was to take a classifier based on Swedish and English samples and extend it to Danish, German, and Finnish samples. The result shows that extending a text classifier by word embeddings alignment or by finetuning a multilingual transformer model is possible but with varying accuracy depending on the language. / Denna avhandling undersöker möjligheten att utvidga enspråkiga och tvåspråkiga textklassificatorer till flera språk. Två olika språkmodeller utforskas, justeras ordinbäddningar och en transformatormodell. Målet var att ta en klassificerare baserad på svenska och engelska texter och utvidga den till danska, tyska och finska texter. Resultatet visar att det är möjligt att utöka en textklassificering med ordinbäddning eller genom att finjustera en flerspråkig transformatormodell, men träffsäkerheten varierar beroende på språk. Natural language processing Multilingual Transformer Word embeddings Text classification Språkteknologi Flerspråkig Transformator Ordinbäddningar Textklassificering Computer and Information Sciences Data- och informationsvetenskap
103	Uncertainty Estimation on Natural Language Processing He, Jianfeng 15 May 2024 (has links) Text plays a pivotal role in our daily lives, encompassing various forms such as social media posts, news articles, books, reports, and more. Consequently, Natural Language Processing (NLP) has garnered widespread attention. This technology empowers us to undertake tasks like text classification, entity recognition, and even crafting responses within a dialogue context. However, despite the expansive utility of NLP, it frequently necessitates a critical decision: whether to place trust in a model's predictions. To illustrate, consider a state-of-the-art (SOTA) model entrusted with diagnosing a disease or assessing the veracity of a rumor. An incorrect prediction in such scenarios can have dire consequences, impacting individuals' health or tarnishing their reputation. Consequently, it becomes imperative to establish a reliable method for evaluating the reliability of an NLP model's predictions, which is our focus-uncertainty estimation on NLP. Though many works have researched uncertainty estimation or NLP, the combination of these two domains is rare. This is because most NLP research emphasizes model prediction performance but tends to overlook the reliability of NLP model predictions. Additionally, current uncertainty estimation models may not be suitable for NLP due to the unique characteristics of NLP tasks, such as the need for more fine-grained information in named entity recognition. Therefore, this dissertation proposes novel uncertainty estimation methods for different NLP tasks by considering the NLP task's distinct characteristics. The NLP tasks are categorized into natural language understanding (NLU) and natural language generation (NLG, such as text summarization). Among the NLU tasks, the understanding could be on two views, global-view (e.g. text classification at document level) and local-view (e.g. natural language inference at sentence level and named entity recognition at token level). As a result, we research uncertainty estimation on three tasks: text classification, named entity recognition, and text summarization. Besides, because few-shot text classification has captured much attention recently, we also research the uncertainty estimation on few-shot text classification. For the first topic, uncertainty estimation on text classification, few uncertainty models focus on improving the performance of text classification where human resources are involved. In response to this gap, our research focuses on enhancing the accuracy of uncertainty scores by bolstering the confidence associated with winning scores. we introduce MSD, a novel model comprising three distinct components: 'mix-up,' 'self-ensembling,' and 'distinctiveness score.' The primary objective of MSD is to refine the accuracy of uncertainty scores by mitigating the issue of overconfidence in winning scores while simultaneously considering various categories of uncertainty. seamlessly integrate with different Deep Neural Networks. Extensive experiments with ablation settings are conducted on four real-world datasets, resulting in consistently competitive improvements. Our second topic focuses on uncertainty estimation on few-shot text classification (UEFTC), which has few or even only one available support sample for each class. UEFTC represents an underexplored research domain where, due to limited data samples, a UEFTC model predicts an uncertainty score to assess the likelihood of classification errors. However, traditional uncertainty estimation models in text classification are ill-suited for UEFTC since they demand extensive training data, while UEFTC operates in a few-shot scenario, typically providing just a few support samples, or even just one, per class. To tackle this challenge, we introduce Contrastive Learning from Uncertainty Relations (CLUR) as a solution tailored for UEFTC. CLUR exhibits the unique capability to be effectively trained with only one support sample per class, aided by pseudo uncertainty scores. A distinguishing feature of CLUR is its autonomous learning of these pseudo uncertainty scores, in contrast to previous approaches that relied on manual specification. Our investigation of CLUR encompasses four model structures, allowing us to evaluate the performance of three commonly employed contrastive learning components in the context of UEFTC. Our findings highlight the effectiveness of two of these components. Our third topic focuses on uncertainty estimation on sequential labeling. Sequential labeling involves the task of assigning labels to individual tokens in a sequence, exemplified by Named Entity Recognition (NER). Despite significant advancements in enhancing NER performance in prior research, the realm of uncertainty estimation for NER (UE-NER) remains relatively uncharted but is of paramount importance. This topic focuses on UE-NER, seeking to gauge uncertainty scores for NER predictions. Previous models for uncertainty estimation often overlook two distinctive attributes of NER: the interrelation among entities (where the learning of one entity's embedding depends on others) and the challenges posed by incorrect span predictions in entity extraction. To address these issues, we introduce the Sequential Labeling Posterior Network (SLPN), designed to estimate uncertainty scores for the extracted entities while considering uncertainty propagation from other tokens. Additionally, we have devised an evaluation methodology tailored to the specific nuances of wrong-span cases. Our fourth topic focuses on an overlooked question that persists regarding the evaluation reliability of uncertainty estimation in text summarization (UE-TS). Text summarization, a key task in natural language generation (NLG), holds significant importance, particularly in domains where inaccuracies can have serious consequences, such as healthcare. UE-TS has garnered attention due to the potential risks associated with erroneous summaries. However, the reliability of evaluating UE-TS methods raises concerns, stemming from the interdependence between uncertainty model metrics and the wide array of NLG metrics. To address these concerns, we introduce a comprehensive UE-TS benchmark incorporating twenty-six NLG metrics across four dimensions. This benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model across two datasets. Additionally, it assesses the effectiveness of fourteen common uncertainty estimation methods. Our study underscores the necessity of utilizing diverse, uncorrelated NLG metrics and uncertainty estimation techniques for a robust evaluation of UE-TS methods. / Doctor of Philosophy / Text is integral to our daily activities, appearing in various forms such as social media posts, news articles, books, and reports. We rely on text for communication, information dissemination, and decision-making. Given its ubiquity, the ability to process and understand text through Natural Language Processing (NLP) has become increasingly important. NLP technology enables us to perform tasks like text classification, which involves categorizing text into predefined labels, and named entity recognition (NER), which identifies specific entities such as names, dates, and locations within text. Additionally, NLP facilitates generating coherent and contextually appropriate responses in conversational agents, enhancing human-computer interaction. However, the reliability of NLP models is crucial, especially in sensitive applications like medical diagnoses, where errors can have severe consequences. This dissertation focuses on uncertainty estimation in NLP, a less explored but essential area. Uncertainty estimation helps evaluate the confidence of NLP model predictions. We propose new methods tailored to various NLP tasks, acknowledging their unique needs. NLP tasks are divided into natural language understanding (NLU) and natural language generation (NLG). Within NLU, we look at tasks from two perspectives: a global view (e.g., document-level text classification) and a local view (e.g., sentence-level inference and token-level entity recognition). Our research spans text classification, named entity recognition (NER), and text summarization, with a special focus on few-shot text classification due to its recent prominence. For text classification, we introduce the MSD model, which includes three components to enhance uncertainty score accuracy and address overconfidence issues. This model integrates seamlessly with different neural networks and shows consistent improvements in experiments. For few-shot text classification, we develop Contrastive Learning from Uncertainty Relations (CLUR), designed to work effectively with minimal support samples per class. CLUR autonomously learns pseudo uncertainty scores, demonstrating effectiveness with various contrastive learning components. In NER, we address the unique challenges of entity interrelation and span prediction errors. We propose the Sequential Labeling Posterior Network (SLPN) to estimate uncertainty scores while considering uncertainty propagation from other tokens. For text summarization, we create a benchmark with tens of metrics to evaluate uncertainty estimation methods across two datasets. This benchmark helps assess the reliability of these methods, highlighting the need for diverse, uncorrelated metrics. Overall, our work advances the understanding and implementation of uncertainty estimation in NLP, providing more reliable and accurate predictions across different tasks. Uncertainty Estimation Bayesian Neural Network Evidential Neural Network Text Classification Few-Shot Named Entity Recognition Text Summarization
104	ENHANCING ELECTRONIC HEALTH RECORDS SYSTEMS AND DIAGNOSTIC DECISION SUPPORT SYSTEMS WITH LARGE LANGUAGE MODELS Furqan Ali Khan (19203916) 26 July 2024 (has links) <p dir="ltr">Within Electronic Health Record (EHR) Systems, physicians face extensive documentation, leading to alarming mental burnout. The disproportionate focus on data entry over direct patient care underscores a critical concern. Integration of Natural Language Processing (NLP) powered EHR systems offers relief by reducing time and effort in record maintenance.</p><p dir="ltr">Our research introduces the Automated Electronic Health Record System, which not only transcribes dialogues but also employs advanced clinical text classification. With an accuracy exceeding 98.97%, it saves over 90% of time compared to manual entry, as validated on MIMIC III and MIMIC IV datasets.</p><p dir="ltr">In addition to our system's advancements, we explore integration of Diagnostic Decision Support System (DDSS) leveraging Large Language Models (LLMs) and transformers, aiming to refine healthcare documentation and improve clinical decision-making. We explore the advantages, like enhanced accuracy and contextual understanding, as well as the challenges, including computational demands and biases, of using various LLMs.</p> Natural language processing EHR systems BERT for biomedical data NLP-Natural Language Processing Speech to text text classification model
105	Arabic language processing for text classification : contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification techniques Al-Nashashibi, May Yacoub Adib January 2012 (has links) The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations. 492.7
106	[en] SQLLOMINING: FINDING LEARNING OBJECTS USING MACHINE LEARNING METHODS / [pt] SQLLOMINING: OBTENÇÃO DE OBJETOS DE APRENDIZAGEM UTILIZANDO TÉCNICAS DE APRENDIZADO DE MÁQUINA SUSANA ROSICH SOARES VELLOSO 04 December 2007 (has links) [pt] Objetos de Aprendizagem ou Learning Objects (LOs) são porções de material didático tais como textos que podem ser reutilizados na composição de outros objetos maiores (aulas ou cursos). Um dos problemas da reutilização de LOs é descobri-los em seus contextos ou documentos texto originais tais como livros, e artigos. Visando a obtenção de LOs, este trabalho apresenta um processo que parte da extração, tratamento e carga de uma base de dados textual e em seguida, baseando-se em técnicas de aprendizado de máquina, uma combinação de EM (Expectation-Maximization) e um classificador Bayesiano, classifica-se os textos extraídos. Tal processo foi implementado em um sistema chamado SQLLOMining, que usa SQL como linguagem de programação e técnicas de mineração de texto na busca de LOs. / [en] Learning Objects (LOs) are pieces of instructional material like traditional texts that can be reused in the composition of more complex objects like classes or courses. There are some difficulties in the process of LO reutilization. One of them is to find pieces of documents that can be used like LOs. In this work we present a process that, in search for LOs, starts by extracting, transforming and loading a text database and then continue clustering these texts, using a machine learning methods that combines EM (Expectation- Maximization) and a Bayesian classifier. We implemented that process in a system called SQLLOMining that uses the SQL language and text mining methods in the search for LOs. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] ONTOLOGIA [en] ONTOLOGY [pt] BANCO DE DADOS [en] DATABASE [pt] EDUCACAO VIA WEB [en] E-LEARNING [pt] OBJETOS DE APRENDIZADO [en] LEARNING OBJECTS [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION
107	[en] A STUDY OF MULTILABEL TEXT CLASSIFICATION ALGORITHMS USING NAIVE-BAYES / [pt] UM ESTUDO DE ALGORITMOS PARA CLASSIFICAÇÃO AUTOMÁTICA DE TEXTOS UTILIZANDO NAIVE-BAYES DAVID STEINBRUCH 12 March 2007 (has links) [pt] A quantidade de informação eletrônica vem crescendo de forma acelerada, motivada principalmente pela facilidade de publicação e divulgação que a Internet proporciona. Desta forma, é necessária a organização da informação de forma a facilitar a sua aquisição. Muitos trabalhos propuseram resolver este problema através da classificação automática de textos associando a eles vários rótulos (classificação multirótulo). No entanto, estes trabalhos transformam este problema em subproblemas de classificação binária, considerando que existe independência entre as categorias. Além disso, utilizam limiares (thresholds), que são muito específicos para o conjunto de treinamento utilizado, não possuindo grande capacidade de generalização na aprendizagem. Esta dissertação propõe dois algoritmos de classificação automática de textos baseados no algoritmo multinomial naive Bayes e sua utilização em um ambiente on-line de classificação automática de textos com realimentação de relevância pelo usuário. Para testar a eficiência dos algoritmos propostos, foram realizados experimentos na base de notícias Reuters 21758 e na base de documentos médicos Ohsumed. / [en] The amount of electronic information has been growing fast, mainly due to the easiness of publication and spreading that Internet provides. Therefore, is necessary the organisation of information to facilitate its retrieval. Many works have solved this problem through the automatic text classification, associating to them several labels (multilabel classification). However, those works have transformed this problem into binary classification subproblems, considering there is not dependence among categories. Moreover, they have used thresholds, which are very sepecific of the classifier document base, and so, does not have great generalization capacity in the learning process. This thesis proposes two text classifiers based on the multinomial algorithm naive Bayes and its usage in an on-line text classification environment with user relevance feedback. In order to test the proposed algorithms efficiency, experiments have been performed on the Reuters 21578 news base, and on the Ohsumed medical document base. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] INTERNET [en] INTERNET [pt] CATEGORIZACAO DE TEXTOS [en] TEXT CATEGORIZATION [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] MULTIROTULO [en] MULTILABEL [pt] NAIVE-BAYES [en] NAIVE-BAYES
108	Detection of Frozen Video Subtitles Using Machine Learning Sjölund, Jonathan January 2019 (has links) When subtitles are burned into a video, an error can sometimes occur in the encoder that results in the same subtitle being burned into several frames, resulting in subtitles becoming frozen. This thesis provides a way to detect frozen video subtitles with the help of an implemented text detector and classifier. Two types of classifiers, naïve classifiers and machine learning classifiers, are tested and compared on a variety of different videos to see how much a machine learning approach can improve the performance. The naïve classifiers are evaluated using ground truth data to gain an understanding of the importance of good text detection. To understand the difficulty of the problem, two different machine learning classifiers are tested, logistic regression and random forests. The result shows that machine learning improves the performance over using naïve classifiers by improving the specificity from approximately 87.3% to 95.8% and improving the accuracy from 93.3% to 95.5%. Random forests achieve the best overall performance, but the difference compared to when using logistic regression is small enough that more computationally complex machine learning classifiers are not necessary. Using the ground truth shows that the weaker naïve classifiers would be improved by at least 4.2% accuracy, thus a better text detector is warranted. This thesis shows that machine learning is a viable option for detecting frozen video subtitles. Machine learning Text detection Text localization Text extraction Frozen subtitles Burnt-in subtitles Hardcoded subtitles Classification Text classification Frozen subtitle classification Computer Sciences Datavetenskap (datalogi)
109	Automatic Analysis of Blend Words / Analyse automatique de mots mélangés Warintarawej, Pattaraporn 04 April 2013 (has links) Mélanger des parties de mots est une façon qui peut sembler étonnante pour produire de nouvelles formes linguistiques. Cela est devenu une manière très utilisée pour inventer des noms pour le quotidien, les noms de marque, les noms utilisés dans les codes informatiques des logiciels, par exemple avec alicament (aliment and médicament), aspivenin (aspirer and venin). Il existe plusieurs façon de mélanger des mots pour en former d'autres, ce qui rend difficile l'analyse des mots produits. Dans cette thèse, nous nous proposons une approche d'analyse automatique des évocations de mots produits à l'aide de mélanges, en considérant des méthodes de classification de type top-k. Nous comparons trois méthodes d'analyse des parties d'un mot : n-grammes, syllabes et cellules morpho-phonologiques. Nous proposons deux algorithmes d'extraction des syllables ainsi que des méthodes d'évaluation.L'algorithme Enqualitum est proposé pour identifier les mots étant évoqués par le mot analysé. Notre proposition a été utilisée en particulier dans le domaine de l'analyse automatique en génie logiciel pour lequel nous avons proposé l'algorithme Sword pour produire un découpage pertinent des noms apparaissant dans les programmes. Les expérimentations ont démontré l'intérêt de nos propositions. / Lexical blending is amazing in the sense of morphological productivity, involving the coinage of a new lexeme by fusing parts of at least two source words. Since new things need new words, blending has become a frequent productive word creation such as smog (smoke and fog), or alicament (aliment and médicament) (a French blend word), etc. The challenge is to design methods to discover how the first source word and the second source word are combined. The thesis aims at automatic analysis blend words in order to find the source words they evoke. The contributions of the thesis can divided into two main parts. First, the contribution to automatic blend word analysis, we develop top-k classification and its evaluation framework to predict concepts of blend words. We investigate three different features of words: character N-grams, syllables and morpho-phonological stems. Moreover, we propose a novel approach to automatically identify blend source words, named Enqualitum. The experiments are conducted on both synthetic French blend words and words from a French thesaurus. Second, the contribution to software engineering application, we apply the idea of learning character patterns of identifiers to predict concepts of source codes and also introduce a method to automate semantic context in source codes. The experiments are conducted on real identifier names from open source software packages. The results show the usefulness and the effectiveness of our proposed approaches. Mots mélangés N-grams Linguistique Classification de Textes Génie Logiciel Compréhension automatique de programmes Blend words N-grams Linguistics Text classification Identifier names Automatic Software Understanding
110	"Classificação de páginas na internet" / "Internet pages classification" Martins Júnior, José 11 April 2003 (has links) O grande crescimento da Internet ocorreu a partir da década de 1990 com o surgimento dos provedores comerciais de serviços, e resulta principalmente da boa aceitação e vasta disseminação do uso da Web. O grande problema que afeta a escalabilidade e o uso de tal serviço refere-se à organização e à classificação de seu conteúdo. Os engenhos de busca atuais possibilitam a localização de páginas na Web pela comparação léxica de conjuntos de palavras perante os conteúdos dos hipertextos. Tal mecanismo mostra-se ineficaz quando da necessidade pela localização de conteúdos que expressem conceitos ou objetos, a exemplo de produtos à venda oferecidos em sites de comércio eletrônico. A criação da Web Semântica foi anunciada no ano de 2000 para esse propósito, visando o estabelecimento de novos padrões para a representação formal de conteúdos nas páginas Web. Com sua implantação, cujo prazo inicialmente previsto foi de dez anos, será possível a expressão de conceitos nos conteúdos dos hipertextos, que representarão objetos classificados por uma ontologia, viabilizando assim o uso de sistemas, baseados em conhecimento, implementados por agentes inteligentes de software. O projeto DEEPSIA foi concebido como uma solução centrada no comprador, ao contrário dos atuais Market Places, para resolver o problema da localização de páginas Web com a descrição de produtos à venda, fazendo uso de métodos de classificação de textos, apoiados pelos algoritmos k-NN e C4.5, no suporte ao processo decisório realizado por um agente previsto em sua arquitetura, o Crawler Agent. Os testes com o sistema em sites brasileiros denotaram a necessidade pela sua adaptação em diversos aspectos, incluindo-se o processo decisório envolvido, que foi abordado pelo presente trabalho. A solução para o problema envolveu a aplicação e a avaliação do método Support Vector Machines, e é descrita em detalhes. / The huge growth of the Internet has been occurring since 90s with the arrival of the internet service providers. One important reason is the good acceptance and wide dissemination of the Web. The main problem that affects its scalability and usage is the organization and classification of its content. The current search engines make possible the localization of pages in the Web by means of a lexical comparison among sets of words and the hypertexts contents. In order to find contents that express concepts or object, such as products for sale in electronic commerce sites such mechanisms are inefficient. The proposition of the Semantic Web was announced in 2000 for this purpose, envisioning the establishment of new standards for formal contents representation in the Web pages. With its implementation, whose deadline was initially stated for ten years, it will be possible to express concepts in hypertexts contents, that will fully represent objects classified into an ontology, making possible the use of knowledge based systems implemented by intelligent softwares agents. The DEEPSIA project was conceived as a solution centered in the purchaser, instead of current Market Places, in order to solve the problem of finding Web pages with products for sale description, making use of methods of text classification, with k-NN and C4.5 algorithms, to support the decision problem to be solved by an specific agent designed, the Crawler Agent. The tests of the system in Brazilian sites have denoted the necessity for its adaptation in many aspects, including the involved decision process, which was focused in present work. The solution for the problem includes the application and evaluation of the Support Vector Machines method, and it is described in detail. agent agente Classificação de Textos comércio eletrônico DEEPSIA DEEPSIA electronic commerce ontologia ontology Support Vector Machines Support Vector Machines text classification Web Web

Search results