131 |
Otimização e análise das máquinas de vetores de suporte aplicadas à classificação de documentos. / Optimization and analysis of support vector machine applied to text classification.Eduardo Akira Kinto 17 June 2011 (has links)
A análise das informações armazenadas é fundamental para qualquer tomada de decisão, mas para isso ela deve estar organizada e permitir fácil acesso. Quando temos um volume de dados muito grande, esta tarefa torna-se muito mais complicada do ponto de vista computacional. É fundamental, então, haver mecanismos eficientes para análise das informações. As Redes Neurais Artificiais (RNA), as Máquinas de Vetores-Suporte (Support Vector Machine - SVM) e outros algoritmos são frequentemente usados para esta finalidade. Neste trabalho, iremos explorar o SMO (Sequential Minimal Optimization) e alterá-lo, com a finalidade de atingir um tempo de treinamento menor, mas, ao mesmo tempo manter a capacidade de classificação. São duas as alterações propostas, uma, no seu algoritmo de treinamento e outra, na sua arquitetura. A primeira modificação do SMO proposta neste trabalho é permitir a atualização de candidatos ao vetor suporte no mesmo ciclo de atualização de um coeficiente de Lagrange. Dos algoritmos que codificam o SVM, o SMO é um dos mais rápidos e um dos que menos consome memória. A complexidade computacional do SMO é menor com relação aos demais algoritmos porque ele não trabalha com inversão de uma matriz de kernel. Esta matriz, que é quadrada, costuma ter um tamanho proporcional ao número de amostras que compõem os chamados vetores-suporte. A segunda proposta para diminuir o tempo de treinamento do SVM consiste na subdivisão ordenada do conjunto de treinamento, utilizando-se a dimensão de maior entropia. Esta subdivisão difere das abordagens tradicionais pelo fato de as amostras não serem constantemente submetidas repetidas vezes ao treinamento do SVM. Finalmente, é aplicado o SMO proposto para classificação de documentos ou textos por meio de uma abordagem nova, a classificação de uma-classe usando classificadores binários. Como toda classificação de documentos, a análise dos atributos é uma etapa fundamental, e aqui uma nova contribuição é apresentada. Utilizamos a correlação total ponto a ponto para seleção das palavras que formam o vetor de índices de palavras. / Stored data analysis is very important when taking a decision in every business, but to accomplish this task data must be organized in a way it can be easily accessed. When we have a huge amount of information, data analysis becomes a very computational hard job. So, it is essential to have an efficient mechanism for information analysis. Artificial neural networks (ANN), support vector machine (SVM) and other algorithms are frequently used for information analysis, and also in huge volume information analysis. In this work we will explore the sequential minimal optimization (SMO) algorithm, a learning algorithm for the SVM. We will modify it aiming for a lower training time and also to maintaining its classification generalization capacity. Two modifications are proposed to the SMO, one in the training algorithm and another in its architecture. The first modification to the SMO enables more than one Lagrange coefficient update by choosing the neighbor samples of the updating pair (current working set). From many options of SVM implementation, SMO was chosen because it is one of the fastest and less memory consuming one. The computational complexity of the SMO is lower than other types of SVM because it does not require handling a huge Kernel matrix. Matrix inversion is one of the most time consuming step of SVM, and its size is as bigger as the number of support vectors of the sample set. The second modification to the SMO proposes the creation of an ordered subset using as a reference one of the dimensions; entropy measure is used to choose the dimension. This subset creation is different from other division based SVM architectures because samples are not used in more than one training pair set. All this improved SVM is used on a one-class like classification task of documents. Every document classification problem needs a good feature vector (feature selection and dimensionality reduction); we propose in this work a novel feature indexing mechanism using the pointwise total correlation.
|
132 |
Uso de Seleção de Características da Wikipedia na Classificação Automática de Textos. / Selection of Wikipedia features for automatic text classificationAlvarenga, Leonel Diógenes Carvalhaes 20 September 2012 (has links)
Submitted by Cássia Santos (cassia.bcufg@gmail.com) on 2014-07-31T14:43:10Z
No. of bitstreams: 2
license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5)
uso_de_selecao_de_caracteristicas_da_wikipedia_na_classificacao_automatica_de_textos.pdf: 1449954 bytes, checksum: 9086dec3868b6b703340b550c614d33d (MD5) / Made available in DSpace on 2014-07-31T14:43:10Z (GMT). No. of bitstreams: 2
license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5)
uso_de_selecao_de_caracteristicas_da_wikipedia_na_classificacao_automatica_de_textos.pdf: 1449954 bytes, checksum: 9086dec3868b6b703340b550c614d33d (MD5)
Previous issue date: 2012-09-20 / Fundação de Amparo à Pesquisa do Estado de Goiás - FAPEG / The traditional methods of text classification typically represent documents only as a
set of words, also known as "Bag of Words"(BOW). Several studies have shown good
results on making use of thesauri and encyclopedias as external information sources,
aiming to expand the BOW representation by the identification of synonymy and
hyponymy relationships between present terms in a document collection. However,
the expansion process may introduce terms that lead to an erroneous classification. In
this paper, we propose the use of feature selection measures in order to select features
extracted from Wikipedia in order to improve the efectiveness of the expansion
process. The study also proposes a feature selection measure called Tendency Factor
to One Category (TF1C), so that the experiments showed that this measure proves
to be competitive with the other measures Information Gain, Gain Ratio and Chisquared,
in the process, delivering the best gains in microF1 and macroF1, in most
experiments. The full use of features selected in this process showed to be more stable
in assisting the classification, while it showed lower performance on restricting its
insertion only to documents of the classes in which these features are well punctuated
by the selection measures. When applied in the Reuters-21578, Ohsumed first -
20000 and 20Newsgroups collections, our approach to feature selection allowed the
reduction of noise insertion inherent in the expansion process, and improved the
results of use hyponyms, and demonstrated that the synonym relationship from
Wikipedia can also be used in the document expansion, increasing the efectiveness
of the automatic text classification. / Os métodos tradicionais de classificação de textos normalmente representam documentos
apenas como um conjunto de palavras, também conhecido como BOW (do inglês, Bag of Words). Vários estudos têm mostrado bons resultados ao utilizar-se de tesauros e enciclopédias como fontes externas de informações, objetivando expandir a representação BOW a partir da identificação de relacionamentos de sinonômia e hiponômia entre os termos presentes em uma coleção de documentos. Todavia, o processo
de expansão pode introduzir termos que conduzam a uma classificação errônea do documento. No presente trabalho, propõe-se a aplicação de medidas de avaliação de termos para a seleção de características extraídas da Wikipédia, com o objetivo de melhorar a eficácia de sua utilização durante o processo de expansão de documentos. O estudo também propõe uma medida de seleção de características denominada
Fator de Tendência a uma Categoria (FT1C), de modo que os experimentos realizados demonstraram que esta medida apresenta desempenho competitivo com as medidas Information Gain, Gain Ratio e Chi-squared, neste processo, apresentando os melhores ganhos de microF1 e macroF1, na maioria dos experimentos realizados. O uso integral das características selecionadas neste processo, demonstrou auxiliar a classificação de forma mais estável, ao passo que apresentou menor desempenho ao
se restringir sua inserção somente aos documentos das classes em que estas características são bem pontuadas pelas medidas de seleção. Ao ser aplicada nas coleções Reuters-21578, Ohsumed rst-20000 e 20Newsgroups, a abordagem com seleção de características permitiu a redução da inserção de ruídos inerentes do processo de expansão e potencializou o uso de hipônimos, assim como demonstrou que as relações de sinonômia da Wikipédia também podem ser utilizadas na expansão de documentos, elevando a eficácia da classificação automática de textos.
|
133 |
Inferência das áreas de atuação de pesquisadores / Inference of the area of expertise of researchersFelipe Penhorate Carvalho da Fonseca 30 January 2018 (has links)
Atualmente, existe uma grande gama de dados acadêmicos disponíveis na web. Com estas informações é possível realizar tarefas como descoberta de especialistas em uma dada área, identificação de potenciais bolsistas de produtividade, sugestão de colaboradores, entre outras diversas. Contudo, o sucesso destas tarefas depende da qualidade dos dados utilizados, pois dados incorretos ou incompletos tendem a prejudicar o desempenho dos algoritmos aplicados. Diversos repositórios de dados acadêmicos não contêm ou não exigem a informação explícita das áreas de atuação dos pesquisadores. Nos dados dos currículos Lattes essa informação existe, porém é inserida manualmente pelo pesquisador sem que haja nenhum tipo de validação (e potencialmente possui informações desatualizadas, faltantes ou mesmo incorretas). O presente trabalho utilizou técnicas de aprendizado de máquina na inferência das áreas de atuação de pesquisadores com base nos dados cadastrados na plataforma Lattes. Os títulos da produção científica foram utilizados como fonte de dados, sendo estes enriquecidos com informações semanticamente relacionadas presentes em outras bases, além de adotar representações diversas para o texto dos títulos e outras informações acadêmicas como orientações e projetos de pesquisa. Objetivou-se avaliar se o enriquecimento dos dados melhora o desempenho dos algoritmos de classificação testados, além de analisar a contribuição de fatores como métricas de redes sociais, idioma dos títulos e a própria estrutura hierárquica das áreas de atuação no desempenho dos algoritmos. A técnica proposta pode ser aplicada a diferentes dados acadêmicos (não sendo restrita a dados presentes na plataforma Lattes), mas os dados oriundos dessa plataforma foram utilizados para os testes e validações da solução proposta. Como resultado, identificou-se que a técnica utilizada para realizar o enriquecimento do texto não auxiliou na melhoria da precisão da inferência. Todavia, as métricas de redes sociais e representações numéricas melhoram a inferência quando comparadas com técnicas do estado da arte, assim como o uso da própria estrutura hierárquica de classes, que retornou os melhores resultados dentre os obtidos / Nowadays, there is a wide range of academic data available on the web. With this information, it is possible to solve tasks such as the discovery of specialists in a given area, identification of potential scholarship holders, suggestion of collaborators, among others. However, the success of these tasks depends on the quality of the data used, since incorrect or incomplete data tend to impair the performance of the applied algorithms. Several academic data repositories do not contain or do not require the explicit information of the researchers\' areas. In the data of the Lattes curricula, this information exists, but it is inserted manually by the researcher without any kind of validation (and potentially it is outdated, missing or even there is incorrect information). The present work utilized machine learning techniques in the inference of the researcher\'s areas based on the data registered in the Lattes platform. The titles of the scientific production were used as data source and they were enriched with semantically related information present in other bases, besides adopting other representations for the text of the titles and other academic information as orientations and research projects. The objective of this dissertation was to evaluate if the data enrichment improves the performance of the classification algorithms tested, as well as to analyze the contribution of factors such as social network metrics, the language of the titles and the hierarchical structure of the areas in the performance of the algorithms. The proposed technique can be applied to different academic data (not restricted to data present in the Lattes platform), but the data from this platform was used for the tests and validations of the proposed solution. As a result, it was identified that the technique used to perform the enrichment of the text did not improve the accuracy of the inference. However, social network metrics and numerical representations improved inference accuracy when compared to state-of-the-art techniques, as well as the use of the hierarchical structure of the classes, which returned the best results among the obtained
|
134 |
Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text ClassificationFishbein, Jonathan Michael January 2008 (has links)
Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts.
|
135 |
Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text ClassificationFishbein, Jonathan Michael January 2008 (has links)
Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts.
|
136 |
A Document Similarity Measure and Its ApplicationsGan, Zih-Dian 07 September 2011 (has links)
In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.
|
137 |
Discovering Discussion Activity Flows in an On-line Forum Using Data Mining TechniquesHsieh, Lu-shih 22 July 2008 (has links)
In the Internet era, more and more courses are taught through a course management system (CMS) or learning management system (LMS). In an asynchronous virtual learning environment, an instructor has the need to beware the progress of discussions in forums, and may intervene if ecessary in order to facilitate students¡¦ learning. This research proposes a discussion forum activity flow tracking system, called FAFT (Forum Activity Flow Tracer), to utomatically monitor the discussion activity flow of threaded forum postings in CMS/LMS. As CMS/LMS is getting popular in facilitating learning activities, the proposedFAFT can be used to facilitate instructors to identify students¡¦ interaction types in discussion forums.
FAFT adopts modern data/text mining techniques to discover the patterns of forum discussion activity flows, which can be used for instructors to facilitate the online learning activities. FAFT consists of two subsystems: activity classification (AC) and activity flow discovery (AFD). A posting can be perceived as a type of announcement, questioning, clarification, interpretation, conflict, or assertion. AC adopts a cascade model to classify various activitytypes of posts in a discussion thread. The empirical evaluation of the classified types from a repository of postings in earth science on-line courses in a senior high school shows that AC can effectively facilitate the coding rocess, and the cascade model can deal with the imbalanced distribution nature of discussion postings.
AFD adopts a hidden Markov model (HMM) to discover the activity flows. A discussion activity flow can be presented as a hidden Markov model (HMM) diagram that an instructor can adopt to predict which iscussion activity flow type of a discussion thread may be followed. The empirical results of the HMM from an online forum in earth science subject in a senior high school show that FAFT can effectively predict the type of a discussion activity flow. Thus, the proposed FAFT can be embedded in a course management system to automatically predict the activity flow type of a discussion thread, and in turn reduce the teachers¡¦ loads on managing online discussion forums.
|
138 |
Rough set-based reasoning and pattern mining for information filteringZhou, Xujuan January 2008 (has links)
An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).
|
139 |
Modelo preditivo de situações como apoio à consciência situacional e ao processo decisório em sistemas de resposta à emergência / Situations predictive model for aid situation awareness and decision process in emergency response systemsBerti, Claudia Beatriz 28 August 2017 (has links)
Submitted by Claudia Berti (claudiabberti@gmail.com) on 2018-06-04T08:46:28Z
No. of bitstreams: 2
Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5)
Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Approved for entry into archive by Eunice Nunes (eunicenunes6@gmail.com) on 2018-06-04T12:44:18Z (GMT) No. of bitstreams: 2
Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5)
Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Approved for entry into archive by Eunice Nunes (eunicenunes6@gmail.com) on 2018-06-04T12:59:57Z (GMT) No. of bitstreams: 2
Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5)
Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Made available in DSpace on 2018-06-04T13:00:10Z (GMT). No. of bitstreams: 2
Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5)
Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5)
Previous issue date: 2017-08-28 / Não recebi financiamento / Situation Awareness (SAW) is a concept widely used in areas that require critical decision making, and refers to the ability of an individual or team to perceive, understand and anticipate the future state of a current situation, which is influenced by the dynamicity and critical nature of events. SAW is considered as the main precursor of the decision-making process. In the emergency response area, obtaining and maintaining SAW requires a great effort from the human operator, the cognitive overload required in the activity, high level of stress involving the care, exhaustive shifts that may negatively reflect the care process and consequently the decision process as one all. Decision support systems that address aspects of the SAW can contribute to the enrichment and maintenance of the operator's SAW and in the decision-making process. Given this context, this work presents a Situational Predictive Model to systematize the development of modules to support the human operator's SAW in emergency response systems, which provides for the use of service models and protocols of institutions acting as prototypical situations. Objectively the model proposes the prediction and or the premature identification of the situation while the applicant has emergency assistance. A Conceptual Model was developed that guided the construction of the Predictive Model and will serve as basis for other developments. So-called human sensors and social sensors have become important sources of information especially in social networks. For the treatment of this data, text classifier methods are used with satisfactory results that cover the areas of education, security, entertainment, commercial, among others. For the emergency responses domain, object of this thesis, human sensors are the main source of information and machine learning techniques as text classifiers show important alternatives. In order to be validated, the Predictive Situations Model was implemented with the creation of a vocabulary based on the actual decision-making models of the Military Police of the State of São Paulo (PMESP) and the development of algorithms two classifying methods (Bag of Words and Naïve Bayes). Tests were performed with four different types of input instances (sentences). For all the metrics analyzed (accuracy, accuracy and coverage) the tests demonstrated superiority of the Naïve Bayes algorithm. The difference between the hit rates in relation to the Bag of Word algorithm for the class of instances with the highest degree of identification difficulty was over 37%. These results demonstrated good potential the Predictive Situations Model to collaborate with the existing systems of emergency services, allowing more attendance effectiveness and reduction of the cognitive overload that the attendants are routinely subjected to. / Consciência da situação ou consciência situacional (Situation Awareness – SAW) é um conceito amplamente utilizado em áreas que requerem tomada de decisão crítica, e se refere à habilidade de um indivíduo ou equipe de percepção, compreensão e antecipação de estado futuro de uma situação corrente, que é influenciada pela dinamicidade e natureza crítica de eventos. SAW é considerada como principal precursora do processo decisório. Em domínios, por exemplo, de resposta à emergência, obter e manter SAW requer do operador humano grande esforço, pela sobrecarga cognitiva exigida na atividade, alto nível de estresse que envolve o atendimento, turnos exaustivos que podem refletir negativamente no processo de atendimento e consequentemente no processo decisório como um todo. Sistemas de apoio à tomada de decisão que contemplam aspectos da SAW podem contribuir no enriquecimento e manutenção da SAW do operador e no processo decisório. Diante desse contexto, este trabalho apresenta um Modelo Preditivo de Situações para sistematizar o desenvolvimento de módulos de apoio a SAW de operadores humanos em sistemas de resposta à emergência, que prevê a utilização de modelos de atendimento e protocolos das instituições atuando como situações prototípicas. Objetivamente o modelo propõe a previsão e ou a identificação prematura da situação em tempo real ao atendimento da emergência. Conjuntamente foi desenvolvido um Modelo Conceitual que norteou a construção do Modelo Preditivo e servirá como base a outros desenvolvimentos. Atualmente os denominados sensores humanos e sensores sociais, especialmente de redes sociais, estão sendo utilizados, de forma crescente, como importantes fontes de informação para a melhor compreensão de situações em diferentes áreas de aplicação. No domínio de resposta à emergência, objeto de estudo desta tese, os sensores humanos são a principal fonte de informação, sobre a qual técnicas de aprendizagem de máquina como classificadores de texto foram aplicadas com resultados muito positivos. Para ser validado, o Modelo Preditivo de Situações foi implementado com a criação de um vocabulário baseado nos modelos decisórios reais da Polícia Militar do Estado de São Paulo (PMESP) e com o desenvolvimento de algoritmos de dois métodos classificadores (Bag of Words e Naïve Bayes). Testes foram realizados com quatro tipos diferentes de instâncias de entrada (frases). Para todas as métricas analisadas (precisão, acurácia e cobertura) os testes demostraram superioridade do algoritmo Naïve Bayes. A diferença entre a taxa de acerto em relação ao algoritmo Bag of Word para a classe de instâncias com maior grau de dificuldade de identificação foi superior a 37%. Tais resultados demonstraram bom potencial do Modelo Preditivo de Situações de colaborar com os sistemas já existentes de atendimento emergencial, possibilitando maior efetividade no atendimento e diminuição da sobrecarga cognitiva a que são submetidos os atendentes cotidianamente.
|
140 |
[en] MACHINE LEARNING FOR SENTIMENT CLASSIFICATION / [pt] APRENDIZADO DE MÁQUINA PARA O PROBLEMA DE SENTIMENT CLASSIFICATIONPEDRO OGURI 18 May 2007 (has links)
[pt] Sentiment Analysis é um problema de categorização de texto
no qual deseja-se identificar opiniões favoráveis e
desfavoráveis com relação a um tópico.
Um exemplo destes tópicos de interesse são organizações e
seus produtos. Neste problema, documentos são
classificados pelo sentimento, conotação,
atitudes e opiniões ao invés de se restringir aos fatos
descritos neste. O principal desafio em Sentiment
Classification é identificar como sentimentos são
expressados em textos e se tais sentimentos indicam uma
opinião positiva (favorável) ou negativa (desfavorável)
com relação a um tópico. Devido ao crescente volume de
dados disponível na Web, onde todos tendem
a ser geradores de conteúdo e expressarem opiniões sobre
os mais variados assuntos, técnicas de Aprendizado de
Máquina vem se tornando cada vez mais atraentes.
Nesta dissertação investigamos métodos de Aprendizado de
Máquina para Sentiment Analysis. Apresentamos alguns
modelos de representação de documentos como saco de
palavras e N-grama. Testamos os classificadores
SVM (Máquina de Vetores Suporte) e Naive Bayes com
diferentes modelos de representação textual e comparamos
seus desempenhos. / [en] Sentiment Analysis is a text categorization problem in
which we want to
identify favorable and unfavorable opinions towards a
given topic. Examples
of such topics are organizations and its products. In this
problem, docu-
ments are classifed according to their sentiment,
connotation, attitudes and
opinions instead of being limited to the facts described
in it.
The main challenge in Sentiment Classification is
identifying how sentiments
are expressed in texts and whether they indicate a
positive (favorable) or
negative (unfavorable) opinion towards a topic. Due to the
growing volume
of information available online in an environment where we
all tend to be
content generators and express opinions on a variety of
subjects, Machine
Learning techniques have become more and more attractive.
In this dissertation, we investigate Machine Learning
methods applied to
Sentiment Analysis. We present document representation
models such as
bag-of-words and N-grams.We compare the performance of the
Naive Bayes
and the Support Vector Machine classifiers for each
proposed model
|
Page generated in 0.046 seconds