Global ETD Search

131	Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text Classification Fishbein, Jonathan Michael January 2008 (has links) Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts. Holographic Reduced Representations Vector Space Model Text Classification Parts of Speech Tagging Random Indexing Support Vector Machines Syntactic Structure Semantics System Design Engineering
132	Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text Classification Fishbein, Jonathan Michael January 2008 (has links) Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts. Holographic Reduced Representations Vector Space Model Text Classification Parts of Speech Tagging Random Indexing Support Vector Machines Syntactic Structure Semantics System Design Engineering
133	A Document Similarity Measure and Its Applications Gan, Zih-Dian 07 September 2011 (has links) In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others. k-means document similarity Similarity measure BEP F1 single-label multi-label Accuracy text classification Entropy document clustering k-NN ML-KNN
134	Discovering Discussion Activity Flows in an On-line Forum Using Data Mining Techniques Hsieh, Lu-shih 22 July 2008 (has links) In the Internet era, more and more courses are taught through a course management system (CMS) or learning management system (LMS). In an asynchronous virtual learning environment, an instructor has the need to beware the progress of discussions in forums, and may intervene if ecessary in order to facilitate students¡¦ learning. This research proposes a discussion forum activity flow tracking system, called FAFT (Forum Activity Flow Tracer), to utomatically monitor the discussion activity flow of threaded forum postings in CMS/LMS. As CMS/LMS is getting popular in facilitating learning activities, the proposedFAFT can be used to facilitate instructors to identify students¡¦ interaction types in discussion forums. FAFT adopts modern data/text mining techniques to discover the patterns of forum discussion activity flows, which can be used for instructors to facilitate the online learning activities. FAFT consists of two subsystems: activity classification (AC) and activity flow discovery (AFD). A posting can be perceived as a type of announcement, questioning, clarification, interpretation, conflict, or assertion. AC adopts a cascade model to classify various activitytypes of posts in a discussion thread. The empirical evaluation of the classified types from a repository of postings in earth science on-line courses in a senior high school shows that AC can effectively facilitate the coding rocess, and the cascade model can deal with the imbalanced distribution nature of discussion postings. AFD adopts a hidden Markov model (HMM) to discover the activity flows. A discussion activity flow can be presented as a hidden Markov model (HMM) diagram that an instructor can adopt to predict which iscussion activity flow type of a discussion thread may be followed. The empirical results of the HMM from an online forum in earth science subject in a senior high school show that FAFT can effectively predict the type of a discussion activity flow. Thus, the proposed FAFT can be embedded in a course management system to automatically predict the activity flow type of a discussion thread, and in turn reduce the teachers¡¦ loads on managing online discussion forums. Support Vector Machine (SVM) Content Management System (CMS). Text classification Learning Management System (LMS) Decision tree Data mining Text mining Hidden Markov Model (HMM)
135	Rough set-based reasoning and pattern mining for information filtering Zhou, Xujuan January 2008 (has links) An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).
136	Modelo preditivo de situações como apoio à consciência situacional e ao processo decisório em sistemas de resposta à emergência / Situations predictive model for aid situation awareness and decision process in emergency response systems Berti, Claudia Beatriz 28 August 2017 (has links) Submitted by Claudia Berti (claudiabberti@gmail.com) on 2018-06-04T08:46:28Z No. of bitstreams: 2 Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5) Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Approved for entry into archive by Eunice Nunes (eunicenunes6@gmail.com) on 2018-06-04T12:44:18Z (GMT) No. of bitstreams: 2 Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5) Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Approved for entry into archive by Eunice Nunes (eunicenunes6@gmail.com) on 2018-06-04T12:59:57Z (GMT) No. of bitstreams: 2 Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5) Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) / Made available in DSpace on 2018-06-04T13:00:10Z (GMT). No. of bitstreams: 2 Tese_DOC_702806.pdf: 2723844 bytes, checksum: 41136d680ab0e665de58c6e74bbe7fe5 (MD5) Carta Comprovante_DOC_702806.pdf: 377043 bytes, checksum: 09e3818f3e8c5eaed8195aa5beef0112 (MD5) Previous issue date: 2017-08-28 / Não recebi financiamento / Situation Awareness (SAW) is a concept widely used in areas that require critical decision making, and refers to the ability of an individual or team to perceive, understand and anticipate the future state of a current situation, which is influenced by the dynamicity and critical nature of events. SAW is considered as the main precursor of the decision-making process. In the emergency response area, obtaining and maintaining SAW requires a great effort from the human operator, the cognitive overload required in the activity, high level of stress involving the care, exhaustive shifts that may negatively reflect the care process and consequently the decision process as one all. Decision support systems that address aspects of the SAW can contribute to the enrichment and maintenance of the operator's SAW and in the decision-making process. Given this context, this work presents a Situational Predictive Model to systematize the development of modules to support the human operator's SAW in emergency response systems, which provides for the use of service models and protocols of institutions acting as prototypical situations. Objectively the model proposes the prediction and or the premature identification of the situation while the applicant has emergency assistance. A Conceptual Model was developed that guided the construction of the Predictive Model and will serve as basis for other developments. So-called human sensors and social sensors have become important sources of information especially in social networks. For the treatment of this data, text classifier methods are used with satisfactory results that cover the areas of education, security, entertainment, commercial, among others. For the emergency responses domain, object of this thesis, human sensors are the main source of information and machine learning techniques as text classifiers show important alternatives. In order to be validated, the Predictive Situations Model was implemented with the creation of a vocabulary based on the actual decision-making models of the Military Police of the State of São Paulo (PMESP) and the development of algorithms two classifying methods (Bag of Words and Naïve Bayes). Tests were performed with four different types of input instances (sentences). For all the metrics analyzed (accuracy, accuracy and coverage) the tests demonstrated superiority of the Naïve Bayes algorithm. The difference between the hit rates in relation to the Bag of Word algorithm for the class of instances with the highest degree of identification difficulty was over 37%. These results demonstrated good potential the Predictive Situations Model to collaborate with the existing systems of emergency services, allowing more attendance effectiveness and reduction of the cognitive overload that the attendants are routinely subjected to. / Consciência da situação ou consciência situacional (Situation Awareness – SAW) é um conceito amplamente utilizado em áreas que requerem tomada de decisão crítica, e se refere à habilidade de um indivíduo ou equipe de percepção, compreensão e antecipação de estado futuro de uma situação corrente, que é influenciada pela dinamicidade e natureza crítica de eventos. SAW é considerada como principal precursora do processo decisório. Em domínios, por exemplo, de resposta à emergência, obter e manter SAW requer do operador humano grande esforço, pela sobrecarga cognitiva exigida na atividade, alto nível de estresse que envolve o atendimento, turnos exaustivos que podem refletir negativamente no processo de atendimento e consequentemente no processo decisório como um todo. Sistemas de apoio à tomada de decisão que contemplam aspectos da SAW podem contribuir no enriquecimento e manutenção da SAW do operador e no processo decisório. Diante desse contexto, este trabalho apresenta um Modelo Preditivo de Situações para sistematizar o desenvolvimento de módulos de apoio a SAW de operadores humanos em sistemas de resposta à emergência, que prevê a utilização de modelos de atendimento e protocolos das instituições atuando como situações prototípicas. Objetivamente o modelo propõe a previsão e ou a identificação prematura da situação em tempo real ao atendimento da emergência. Conjuntamente foi desenvolvido um Modelo Conceitual que norteou a construção do Modelo Preditivo e servirá como base a outros desenvolvimentos. Atualmente os denominados sensores humanos e sensores sociais, especialmente de redes sociais, estão sendo utilizados, de forma crescente, como importantes fontes de informação para a melhor compreensão de situações em diferentes áreas de aplicação. No domínio de resposta à emergência, objeto de estudo desta tese, os sensores humanos são a principal fonte de informação, sobre a qual técnicas de aprendizagem de máquina como classificadores de texto foram aplicadas com resultados muito positivos. Para ser validado, o Modelo Preditivo de Situações foi implementado com a criação de um vocabulário baseado nos modelos decisórios reais da Polícia Militar do Estado de São Paulo (PMESP) e com o desenvolvimento de algoritmos de dois métodos classificadores (Bag of Words e Naïve Bayes). Testes foram realizados com quatro tipos diferentes de instâncias de entrada (frases). Para todas as métricas analisadas (precisão, acurácia e cobertura) os testes demostraram superioridade do algoritmo Naïve Bayes. A diferença entre a taxa de acerto em relação ao algoritmo Bag of Word para a classe de instâncias com maior grau de dificuldade de identificação foi superior a 37%. Tais resultados demonstraram bom potencial do Modelo Preditivo de Situações de colaborar com os sistemas já existentes de atendimento emergencial, possibilitando maior efetividade no atendimento e diminuição da sobrecarga cognitiva a que são submetidos os atendentes cotidianamente. Consciência da situação Tomada de decisão Resposta à emergência Aprendizado de máquina Classificação de texto Situation awareness Decision making Emergency response Machine learning Text classification
137	[en] MACHINE LEARNING FOR SENTIMENT CLASSIFICATION / [pt] APRENDIZADO DE MÁQUINA PARA O PROBLEMA DE SENTIMENT CLASSIFICATION PEDRO OGURI 18 May 2007 (has links) [pt] Sentiment Analysis é um problema de categorização de texto no qual deseja-se identificar opiniões favoráveis e desfavoráveis com relação a um tópico. Um exemplo destes tópicos de interesse são organizações e seus produtos. Neste problema, documentos são classificados pelo sentimento, conotação, atitudes e opiniões ao invés de se restringir aos fatos descritos neste. O principal desafio em Sentiment Classification é identificar como sentimentos são expressados em textos e se tais sentimentos indicam uma opinião positiva (favorável) ou negativa (desfavorável) com relação a um tópico. Devido ao crescente volume de dados disponível na Web, onde todos tendem a ser geradores de conteúdo e expressarem opiniões sobre os mais variados assuntos, técnicas de Aprendizado de Máquina vem se tornando cada vez mais atraentes. Nesta dissertação investigamos métodos de Aprendizado de Máquina para Sentiment Analysis. Apresentamos alguns modelos de representação de documentos como saco de palavras e N-grama. Testamos os classificadores SVM (Máquina de Vetores Suporte) e Naive Bayes com diferentes modelos de representação textual e comparamos seus desempenhos. / [en] Sentiment Analysis is a text categorization problem in which we want to identify favorable and unfavorable opinions towards a given topic. Examples of such topics are organizations and its products. In this problem, docu- ments are classifed according to their sentiment, connotation, attitudes and opinions instead of being limited to the facts described in it. The main challenge in Sentiment Classification is identifying how sentiments are expressed in texts and whether they indicate a positive (favorable) or negative (unfavorable) opinion towards a topic. Due to the growing volume of information available online in an environment where we all tend to be content generators and express opinions on a variety of subjects, Machine Learning techniques have become more and more attractive. In this dissertation, we investigate Machine Learning methods applied to Sentiment Analysis. We present document representation models such as bag-of-words and N-grams.We compare the performance of the Naive Bayes and the Support Vector Machine classifiers for each proposed model [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] CLASSIFICADORES BAYSIANOS [en] BAYSIANS CLASSIFIERS [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] SUPPORT VECTOR MACHINES [en] SUPPORT VECTOR MACHINES [pt] SENTIMENT ANALYSIS [en] SENTIMENT ANALYSIS
138	Explorer et apprendre à partir de collections de textes multilingues à l'aide des modèles probabilistes latents et des réseaux profonds / Mining and learning from multilingual text collections using topic models and word embeddings Balikas, Georgios 20 October 2017 (has links) Le texte est l'une des sources d'informations les plus répandues et les plus persistantes. L'analyse de contenu du texte se réfère à des méthodes d'étude et de récupération d'informations à partir de documents. Aujourd'hui, avec une quantité de texte disponible en ligne toujours croissante l'analyse de contenu du texte revêt une grande importance parce qu' elle permet une variété d'applications. À cette fin, les méthodes d'apprentissage de la représentation sans supervision telles que les modèles thématiques et les word embeddings constituent des outils importants.L'objectif de cette dissertation est d'étudier et de relever des défis dans ce domaine.Dans la première partie de la thèse, nous nous concentrons sur les modèles thématiques et plus précisément sur la manière d'incorporer des informations antérieures sur la structure du texte à ces modèles.Les modèles de sujets sont basés sur le principe du sac-de-mots et, par conséquent, les mots sont échangeables. Bien que cette hypothèse profite les calculs des probabilités conditionnelles, cela entraîne une perte d'information.Pour éviter cette limitation, nous proposons deux mécanismes qui étendent les modèles de sujets en intégrant leur connaissance de la structure du texte. Nous supposons que les documents sont répartis dans des segments de texte cohérents. Le premier mécanisme attribue le même sujet aux mots d'un segment. La seconde, capitalise sur les propriétés de copulas, un outil principalement utilisé dans les domaines de l'économie et de la gestion des risques, qui sert à modéliser les distributions communes de densité de probabilité des variables aléatoires tout en n'accédant qu'à leurs marginaux.La deuxième partie de la thèse explore les modèles de sujets bilingues pour les collections comparables avec des alignements de documents explicites. En règle générale, une collection de documents pour ces modèles se présente sous la forme de paires de documents comparables. Les documents d'une paire sont écrits dans différentes langues et sont thématiquement similaires. À moins de traductions, les documents d'une paire sont semblables dans une certaine mesure seulement. Pendant ce temps, les modèles de sujets représentatifs supposent que les documents ont des distributions thématiques identiques, ce qui constitue une hypothèse forte et limitante. Pour le surmonter, nous proposons de nouveaux modèles thématiques bilingues qui intègrent la notion de similitude interlingue des documents qui constituent les paires dans leurs processus générateurs et d'inférence.La dernière partie de la thèse porte sur l'utilisation d'embeddings de mots et de réseaux de neurones pour trois applications d'exploration de texte. Tout d'abord, nous abordons la classification du document polylinguistique où nous soutenons que les traductions d'un document peuvent être utilisées pour enrichir sa représentation. À l'aide d'un codeur automatique pour obtenir ces représentations de documents robustes, nous démontrons des améliorations dans la tâche de classification de documents multi-classes. Deuxièmement, nous explorons la classification des tweets à plusieurs tâches en soutenant que, en formant conjointement des systèmes de classification utilisant des tâches corrélées, on peut améliorer la performance obtenue. À cette fin, nous montrons comment réaliser des performances de pointe sur une tâche de classification du sentiment en utilisant des réseaux neuronaux récurrents. La troisième application que nous explorons est la récupération d'informations entre langues. Compte tenu d'un document écrit dans une langue, la tâche consiste à récupérer les documents les plus similaires à partir d'un ensemble de documents écrits dans une autre langue. Dans cette ligne de recherche, nous montrons qu'en adaptant le problème du transport pour la tâche d'estimation des distances documentaires, on peut obtenir des améliorations importantes. / Text is one of the most pervasive and persistent sources of information. Content analysis of text in its broad sense refers to methods for studying and retrieving information from documents. Nowadays, with the ever increasing amounts of text becoming available online is several languages and different styles, content analysis of text is of tremendous importance as it enables a variety of applications. To this end, unsupervised representation learning methods such as topic models and word embeddings constitute prominent tools.The goal of this dissertation is to study and address challengingproblems in this area, focusing on both the design of novel text miningalgorithms and tools, as well as on studying how these tools can be applied to text collections written in a single or several languages.In the first part of the thesis we focus on topic models and more precisely on how to incorporate prior information of text structure to such models.Topic models are built on the premise of bag-of-words, and therefore words are exchangeable. While this assumption benefits the calculations of the conditional probabilities it results in loss of information.To overcome this limitation we propose two mechanisms that extend topic models by integrating knowledge of text structure to them. We assume that the documents are partitioned in thematically coherent text segments. The first mechanism assigns the same topic to the words of a segment. The second, capitalizes on the properties of copulas, a tool mainly used in the fields of economics and risk management that is used to model the joint probability density distributions of random variables while having access only to their marginals.The second part of the thesis explores bilingual topic models for comparable corpora with explicit document alignments. Typically, a document collection for such models is in the form of comparable document pairs. The documents of a pair are written in different languages and are thematically similar. Unless translations, the documents of a pair are similar to some extent only. Meanwhile, representative topic models assume that the documents have identical topic distributions, which is a strong and limiting assumption. To overcome it we propose novel bilingual topic models that incorporate the notion of cross-lingual similarity of the documents that constitute the pairs in their generative and inference processes. Calculating this cross-lingual document similarity is a task on itself, which we propose to address using cross-lingual word embeddings.The last part of the thesis concerns the use of word embeddings and neural networks for three text mining applications. First, we discuss polylingual document classification where we argue that translations of a document can be used to enrich its representation. Using an auto-encoder to obtain these robust document representations we demonstrate improvements in the task of multi-class document classification. Second, we explore multi-task sentiment classification of tweets arguing that by jointly training classification systems using correlated tasks can improve the obtained performance. To this end we show how can achieve state-of-the-art performance on a sentiment classification task using recurrent neural networks. The third application we explore is cross-lingual information retrieval. Given a document written in one language, the task consists in retrieving the most similar documents from a pool of documents written in another language. In this line of research, we show that by adapting the transportation problem for the task of estimating document distances one can achieve important improvements. Classification de textes L'apprentissage en profondeur Probabilistic Graphical Models Topic Modeling Deep Learning Text Classification Document Retrieval 004
139	[en] STOCK MARKET BEHAVIOR PREDICTION USING FINANCIAL NEWS IN PORTUGUESE / [pt] PREDIÇÃO DO COMPORTAMENTO DO MERCADO FINANCEIRO UTILIZANDO NOTÍCIAS EM PORTUGUÊS HERALDO PIMENTA BORGES FILHO 27 August 2015 (has links) [pt] Um conjunto de teorias financeiras, tais como a hipótese do mercado eficiente e a teoria do passeio aleatório, afirma ser impossível prever o futuro do mercado de ações baseado na informação atualmente disponível. Entretanto, pesquisas recentes têm provado o contrário ao constatar uma relação entre o conteúdo de uma notícia corrente e o comportamento de um ativo. Nosso objetivo é projetar e implementar um algoritmo de predição que utiliza notícias jornalísticas sobre empresas de capital aberto para prever o comportamento de ações na bolsa de valores. Utilizamos uma abordagem baseada em aprendizado de máquina para a tarefa de predição do comportamento de um ativo nas posições de alta, baixa ou neutra, utilizando informações quantitativas e qualitativas, como notícias sobre o mercado financeiro. Avaliamos o nosso sistema em um dataset com seis mil notícias e nossos experimentos apresentam uma acurácia de 68.57 porcento para a tarefa. / [en] A set of financial theories, such as the eficient market hypothesis and the theory of random walk, says it is impossible to predict the future of the stock market based on currently available information. However, recent research has proven otherwise by finding a relationship between the content of a news and current behavior of an stock. Our goal is to develop and implement a prediction algorithm that uses financial news about joint-stock company to predict the stock s behavior on the stock exchange. We use an approach based on machine learning for the task of predicting the behavior of an stock in positions of up, down or neutral, using quantitative and qualitative information, such as financial. We evaluate our system on a dataset with six thousand news and our experiments indicate an accuracy of 68.57 percent for the task. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] MERCADO DE ACOES [en] ACTIONS MARKET [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] SVM [en] SVM [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [en] NATURAL LANGUAGE PROCESSING
140	A Semantic Triplet Based Story Classifier January 2013 (has links) abstract: Text classification, in the artificial intelligence domain, is an activity in which text documents are automatically classified into predefined categories using machine learning techniques. An example of this is classifying uncategorized news articles into different predefined categories such as "Business", "Politics", "Education", "Technology" , etc. In this thesis, supervised machine learning approach is followed, in which a module is first trained with pre-classified training data and then class of test data is predicted. Good feature extraction is an important step in the machine learning approach and hence the main component of this text classifier is semantic triplet based features in addition to traditional features like standard keyword based features and statistical features based on shallow-parsing (such as density of POS tags and named entities). Triplet {Subject, Verb, Object} in a sentence is defined as a relation between subject and object, the relation being the predicate (verb). Triplet extraction process, is a 5 step process which takes input corpus as a web text document(s), each consisting of one or many paragraphs, from RSS feeds to lists of extremist website. Input corpus feeds into the "Pronoun Resolution" step, which uses an heuristic approach to identify the noun phrases referenced by the pronouns. The next step "SRL Parser" is a shallow semantic parser and converts the incoming pronoun resolved paragraphs into annotated predicate argument format. The output of SRL parser is processed by "Triplet Extractor" algorithm which forms the triplet in the form {Subject, Verb, Object}. Generalization and reduction of triplet features is the next step. Reduced feature representation reduces computing time, yields better discriminatory behavior and handles curse of dimensionality phenomena. For training and testing, a ten- fold cross validation approach is followed. In each round SVM classifier is trained with 90% of labeled (training) data and in the testing phase, classes of remaining 10% unlabeled (testing) data are predicted. Concluding, this paper proposes a model with semantic triplet based features for story classification. The effectiveness of the model is demonstrated against other traditional features used in the literature for text classification tasks. / Dissertation/Thesis / M.S. Computer Science 2013 Computer science Machine learning Natural Language Processing Ravi Karad SVM (Support Vector Machine) classifier Text classification

Search results