Global ETD Search

41	Anti-Spam Study: an Alliance-based Approach Chiu, Yu-fen 12 September 2006 (has links) The growing problem of spam has generated a need for reliable anti-spam filters. There are many filtering techniques along with machine learning and data miming used to reduce the amount of spam. Such algorithms can achieve very high accuracy but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. Much work has been done to improve specific algorithms for the task of detecting spam, but less work has been report on leveraging multiple algorithms in email analysis. This study presents an alliance-based approach to classify, discovery and exchange interesting information on spam. Furthermore, the spam filter in this study is build base on the mixture of rough set theory (RST), genetic algorithm (GA) and XCS classifier system. RST has the ability to process imprecise and incomplete data such as spam. GA can speed up the rate of finding the optimal solution (i.e. the rules used to block spam). The reinforcement learning of XCS is a good mechanism to suggest the appropriate classification for the email. The results of spam filtering by alliance-based approach are evaluated by several statistical methods and the performance is great. Two main conclusions can be drawn from this study: (1) the rules exchanged from other mail servers indeed help the filter blocking more spam than before. (2) a combination of algorithms improves both accuracy and reducing false positives for the problem of spam detection. Rough set theory Reinforcement learning XCS classifier system Spam Text classification
42	Emotion Analysis Of Turkish Texts By Using Machine Learning Methods Boynukalin, Zeynep 01 July 2012 (has links) (PDF) Automatically analysing the emotion in texts is in increasing interest in today&rsquo / s research fields. The aim is to develop a machine that can detect type of user&rsquo / s emotion from his/her text. Emotion classification of English texts is studied by several researchers and promising results are achieved. In this thesis, an emotion classification study on Turkish texts is introduced. To the best of our knowledge, this is the first study on emotion analysis of Turkish texts. In English there exists some well-defined datasets for the purpose of emotion classification, but we could not find datasets in Turkish suitable for this study. Therefore, another important contribution is the generating a new data set in Turkish for emotion analysis. The dataset is generated by combining two types of sources. Several classification algorithms are applied on the dataset and results are compared. Due to the nature of Turkish language, new features are added to the existing methods to improve the success of the proposed method. QA Computer Software 76.75-76.765
43	A Self-Constructing Fuzzy Feature Clustering for Text Categorization Liu, Ren-jia 26 August 2009 (has links) Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods. text classification feature reduction feature clustering feature extraction fuzzy clustering fuzzy similarity
44	Análise de sentimentos baseada em aspectos e atribuições de polaridade / Aspect-based sentiment analysis and polarity assignment Kauer, Anderson Uilian January 2016 (has links) Com a crescente expansão da Web, cada vez mais usuários compartilham suas opiniões sobre experiências vividas. Essas opiniões estão, na maioria das vezes, representadas sob a forma de texto não estruturado. A Análise de Sentimentos (ou Mineração de Opinião) é a área dedicada ao estudo computacional das opiniões e sentimentos expressos em textos, tipicamente classificando-os de acordo com a sua polaridade (i.e., como positivos ou negativos). Ao mesmo tempo em que sites de vendas e redes sociais tornam-se grandes fontes de opiniões, cresce a busca por ferramentas que, de forma automática, classifiquem as opiniões e identifiquem a qual aspecto da entidade avaliada elas se referem. Neste trabalho, propomos métodos direcionados a dois pontos fundamentais para o tratamento dessas opiniões: (i) análise de sentimentos baseada em aspectos e (ii) atribuição de polaridade. Para a análise de sentimentos baseada em aspectos, desenvolvemos um método que identifica expressões que mencionem aspectos e entidades em um texto, utilizando ferramentas de processamento de linguagem natural combinadas com algoritmos de aprendizagem de máquina. Para a atribuição de polaridade, desenvolvemos um método que utiliza 24 atributos extraídos a partir do ranking gerado por um motor de busca e para gerar modelos de aprendizagem de máquina. Além disso, o método não depende de recursos linguísticos e pode ser aplicado sobre dados com ruídos. Experimentos realizados sobre datasets reais demonstram que, em ambas as contribuições, conseguimos resultados próximos aos dos baselines mesmo com um número pequeno de atributos. Ainda, para a atribuição de polaridade, os resultados são comparáveis aos de métodos do estado da arte que utilizam técnicas mais complexas. / With the growing expansion of the Web, more and more users share their views on experiences they have had. These views are, in most cases, represented in the form of unstructured text. The Sentiment Analysis (or Opinion Mining) is a research area dedicated to the computational study of the opinions and feelings expressed in texts, typically categorizing them according to their polarity (i.e., as positive or negative). As on-line sales and social networking sites become great sources of opinions, there is a growing need for tools that classify opinions and identify to which aspect of the evaluated entity they refer to. In this work, we propose methods aimed at two key points for the treatment of such opinions: (i) aspect-based sentiment analysis and (ii) polarity assignment. For aspect-based sentiment analysis, we developed a method that identifies expressions mentioning aspects and entities in text, using natural language processing tools combined with machine learning algorithms. For the identification of polarity, we developed a method that uses 24 attributes extracted from the ranking generated by a search engine to generate machine learning models. Furthermore, the method does not rely on linguistic resources and can be applied to noisy data. Experiments on real datasets show that, in both contributions, our results using a small number of attributes were similar to the baselines. Still, for assigning polarity, the results are comparable to prior art methods that use more complex techniques. Textos : Análise Mineracao : Dados Emoções Opinion mining Sentiment analysis Aspect extraction Text classification
45	Classificação de textos usando ontologias Guevara, Juan Florencio Valdivia January 2016 (has links) Orientadora: Profa. Dra. Debora Maria Rossi de Medeiros / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, 2016. / Em diversos domínios de conhecimento, uma das principais forma de divulgação de informação é por meio de documentos de texto. Exemplos são websites, artigos científicos, blogs, postagens em redes sociais e reviews de produtos/serviços. Dessa forma, extrair automaticamente informações desse tipo de fonte de dados se torna uma importante tarefa. Uma das formas mais clássicas de extração de informação de documentos textuais é chamada de classificação. Esta tarefa consiste de atribuir automaticamente a categoria a qual um texto pertence, com base em um conjunto de textos previamente categorizado. Extrair informação de documentos textuais é, em geral, uma tarefa desafiadora por lidar com uma forma não estruturada de dados, uma vez que uma mesma informação pode ser expressa de diversas manerias. Neste contexto, uma ontologia pode representar uma ferramenta poderosa para auxiliar a tarefa de extração de informação de textos. Ontologias são, em linhas gerais, dicionários de conceitos conectados por meio de relações semânticas. Este trabalho investiga o uso de ontologias na tarefa de classificação de textos. Foi proposta uma abordagem onde são criados novos atributos para descrever os textos de uma base com base nos conceitos de uma ontologia. Foram realizados experimentos com bases de textos benchmark amplamente utilizadas pela comunidade científica. Em geral, a abordagem proposta proporcionou vantagem em relação à abordagem convencional em cenários específicos. Esses cenários indicam uma região de potencias da nova abordagem que será melhor explorada em trabalhos futuros. / In several knowledge areas, one of the main forms of spreading information is through textual documents. Some examples are websites, scientific papers, blogs, social media posts and product/service reviews. Thus, automatically extracting information from this type of data becomes an important task. One of the most classic information extraction task is text classification. This task consists of automatically assigning the category to which a text belongs, based on a previously categorized text set. Extracting information from textual data is, in general, a challenging task because it deals with unstructured data, once the same piece of information can be expressed by different ways. In this context, an ontology may be a powerful tool to aid information extraction from texts. In a nutshell, ontologies are dictionaries of concepts linked according to semantic relations. This project studies the usage of ontologies in the task of text classification. We proposed an approach where new features for describing the texts based on an ontology concepts. Experiments with benchmark text bases, widely employed by scientific community. In general, the proposed approach overcomes the conventional approach in specific scenarios. These scenarios point to potential areas where the new approach will be better explored in future work. CLASSIFICAÇÃO DE TEXTOS Ontologias TEXT CLASSIFICATION ONTOLOGIES
46	Chinese Text Classification Based On Deep Learning Wang, Xutao January 2018 (has links) Text classification has always been a concern in area of natural language processing, especially nowadays the data are getting massive due to the development of internet. Recurrent neural network (RNN) is one of the most popular method for natural language processing due to its recurrent architecture which give it ability to process serialized information. In the meanwhile, Convolutional neural network (CNN) has shown its ability to extract features from visual imagery. This paper combine the advantages of RNN and CNN and proposed a model called BLSTM-C for Chinese text classification. BLSTM-C begins with a Bidirectional long short-term memory (BLSTM) layer which is an special kind of RNN to get a sequence output based on the past context and the future context. Then it feed this sequence to CNN layer which is utilized to extract features from the previous sequence. We evaluate BLSTM-C model on several tasks such as sentiment classification and category classification and the result shows our model’s remarkable performance on these text tasks. Text classification Recurrent neural network Convolutional neural network Computer Systems Datorsystem
47	Disaster tweet classification using parts-of-speech tags: a domain adaptation approach Robinson, Tyler January 1900 (has links) Master of Science / Department of Computer Science / Doina Caragea / Twitter is one of the most active social media sites today. Almost everyone is using it, as it is a medium by which people stay in touch and inform others about events in their lives. Among many other types of events, people tweet about disaster events. Both man made and natural disasters, unfortunately, occur all the time. When these tragedies transpire, people tend to cope in their own ways. One of the most popular ways people convey their feelings towards disaster events is by offering or asking for support, providing valuable information about the disaster, and voicing their disapproval towards those who may be the cause. However, not all of the tweets posted during a disaster are guaranteed to be useful or informative to authorities nor to the general public. As the number of tweets that are posted during a disaster can reach the hundred thousands range, it is necessary to automatically distinguish tweets that provide useful information from those that don't. Manual annotation cannot scale up to the large number of tweets, as it takes significant time and effort, which makes it unsuitable for real-time disaster tweet annotation. Alternatively, supervised machine learning has been traditionally used to learn classifiers that can quickly annotate new unseen tweets. But supervised machine learning algorithms make use of labeled training data from the disaster of interest, which is presumably not available for a current target disaster. However, it is reasonable to assume that some amount of labeled data is available for a prior source disaster. Therefore, domain adaptation algorithms that make use of labeled data from a source disaster to learn classifiers for the target disaster provide a promising direction in the area of tweet classification for disaster management. In prior work, domain adaptation algorithms have been trained based on tweets represented as bag-of-words. In this research, I studied the effect of Part of Speech (POS) tag unigrams and bigrams on the performance of the domain adaptation classifiers. Specifically, I used POS tag unigram and bigram features in conjunction with a Naive Bayes Domain Adaptation algorithm to learn classifiers from source labeled data together with target unlabeled data, and subsequently used the resulting classifiers to classify target disaster tweets. The main research question addressed through this work was if the POS tags can help improve the performance of the classifiers learned from tweet bag-of-words representations only. Experimental results have shown that the POS tags can improve the performance of the classifiers learned from words only, but not always. Furthermore, the results of the experiments show that POS tag bigrams contain more information as compared to POS tag unigrams, as the classifiers learned from bigrams have better performance than those learned from unigrams. Domain Adaptation Text classification Tweet Disaster management Part of speech Naive Bayes
48	A Study of Text Mining Framework for Automated Classification of Software Requirements in Enterprise Systems January 2016 (has links) abstract: Text Classification is a rapidly evolving area of Data Mining while Requirements Engineering is a less-explored area of Software Engineering which deals the process of defining, documenting and maintaining a software system's requirements. When researchers decided to blend these two streams in, there was research on automating the process of classification of software requirements statements into categories easily comprehensible for developers for faster development and delivery, which till now was mostly done manually by software engineers - indeed a tedious job. However, most of the research was focused on classification of Non-functional requirements pertaining to intangible features such as security, reliability, quality and so on. It is indeed a challenging task to automatically classify functional requirements, those pertaining to how the system will function, especially those belonging to different and large enterprise systems. This requires exploitation of text mining capabilities. This thesis aims to investigate results of text classification applied on functional software requirements by creating a framework in R and making use of algorithms and techniques like k-nearest neighbors, support vector machine, and many others like boosting, bagging, maximum entropy, neural networks and random forests in an ensemble approach. The study was conducted by collecting and visualizing relevant enterprise data manually classified previously and subsequently used for training the model. Key components for training included frequency of terms in the documents and the level of cleanliness of data. The model was applied on test data and validated for analysis, by studying and comparing parameters like precision, recall and accuracy. / Dissertation/Thesis / Masters Thesis Engineering 2016 Computer science Engineering data analytics R requirements classification text classification text mining
49	A comperative study of text classification models on invoices : The feasibility of different machine learning algorithms and their accuracy Ekström, Linus, Augustsson, Andreas January 2018 (has links) Text classification for companies is becoming more important in a world where an increasing amount of digital data are made available. The aim is to research whether five different machine learning algorithms can be used to automate the process of classification of invoice data and see which one gets the highest accuracy. Algorithms are in a later stage combined for an attempt to achieve higher results. N-grams are used, and results are compared in form of total accuracy of classification for each algorithm. A library in Python, called scikit-learn, implementing the chosen algorithms, was used. Data is collected and generated to represent data present on a real invoice where data has been extracted. Results from this thesis show that it is possible to use machine learning for this type of problem. The highest scoring algorithm (LinearSVC from scikit-learn) classifies 86% of all samples correctly. This is a margin of 16% above the acceptable level of 70%. Machine learning text classification invoices supervised learning information retrieval ensemble learning Computer Sciences Datavetenskap (datalogi)
50	Análise de sentimentos baseada em aspectos e atribuições de polaridade / Aspect-based sentiment analysis and polarity assignment Kauer, Anderson Uilian January 2016 (has links) Com a crescente expansão da Web, cada vez mais usuários compartilham suas opiniões sobre experiências vividas. Essas opiniões estão, na maioria das vezes, representadas sob a forma de texto não estruturado. A Análise de Sentimentos (ou Mineração de Opinião) é a área dedicada ao estudo computacional das opiniões e sentimentos expressos em textos, tipicamente classificando-os de acordo com a sua polaridade (i.e., como positivos ou negativos). Ao mesmo tempo em que sites de vendas e redes sociais tornam-se grandes fontes de opiniões, cresce a busca por ferramentas que, de forma automática, classifiquem as opiniões e identifiquem a qual aspecto da entidade avaliada elas se referem. Neste trabalho, propomos métodos direcionados a dois pontos fundamentais para o tratamento dessas opiniões: (i) análise de sentimentos baseada em aspectos e (ii) atribuição de polaridade. Para a análise de sentimentos baseada em aspectos, desenvolvemos um método que identifica expressões que mencionem aspectos e entidades em um texto, utilizando ferramentas de processamento de linguagem natural combinadas com algoritmos de aprendizagem de máquina. Para a atribuição de polaridade, desenvolvemos um método que utiliza 24 atributos extraídos a partir do ranking gerado por um motor de busca e para gerar modelos de aprendizagem de máquina. Além disso, o método não depende de recursos linguísticos e pode ser aplicado sobre dados com ruídos. Experimentos realizados sobre datasets reais demonstram que, em ambas as contribuições, conseguimos resultados próximos aos dos baselines mesmo com um número pequeno de atributos. Ainda, para a atribuição de polaridade, os resultados são comparáveis aos de métodos do estado da arte que utilizam técnicas mais complexas. / With the growing expansion of the Web, more and more users share their views on experiences they have had. These views are, in most cases, represented in the form of unstructured text. The Sentiment Analysis (or Opinion Mining) is a research area dedicated to the computational study of the opinions and feelings expressed in texts, typically categorizing them according to their polarity (i.e., as positive or negative). As on-line sales and social networking sites become great sources of opinions, there is a growing need for tools that classify opinions and identify to which aspect of the evaluated entity they refer to. In this work, we propose methods aimed at two key points for the treatment of such opinions: (i) aspect-based sentiment analysis and (ii) polarity assignment. For aspect-based sentiment analysis, we developed a method that identifies expressions mentioning aspects and entities in text, using natural language processing tools combined with machine learning algorithms. For the identification of polarity, we developed a method that uses 24 attributes extracted from the ranking generated by a search engine to generate machine learning models. Furthermore, the method does not rely on linguistic resources and can be applied to noisy data. Experiments on real datasets show that, in both contributions, our results using a small number of attributes were similar to the baselines. Still, for assigning polarity, the results are comparable to prior art methods that use more complex techniques. Textos : Análise Mineracao : Dados Emoções Opinion mining Sentiment analysis Aspect extraction Text classification

Search results