Global ETD Search

281	A Conditional Random Field (CRF) Based Machine Learning Framework for Product Review Mining Ming, Yue January 2019 (has links) The task of opinion mining from product reviews has been achieved by employing rule-based approaches or generative learning models such as hidden Markov models (HMMs). This paper introduced a discriminative model using linear-chain Conditional Random Fields (CRFs) that can naturally incorporate arbitrary, non-independent features of the input without conditional independence among the features or distributional assumptions of inputs. The framework firstly performs part-of-speech (POS) tagging tasks over each word in sentences of review text. The performance is evaluated based on three criteria: precision, recall and F-score. The result shows that this approach is effective for this type of natural language processing (NLP) tasks. Then the framework extracts the keywords associated with each product feature and summarizes into concise lists that are simple and intuitive for people to read. conditional random fields machine learning natural language processing opinion mining text mining
282	Ukhetho : A Text Mining Study Of The South African General Elections Moodley, Avashlin January 2019 (has links) The elections in South Africa are contested by multiple political parties appealing to a diverse population that comes from a variety of socioeconomic backgrounds. As a result, a rich source of discourse is created to inform voters about election-related content. Two common sources of information to help voters with their decision are news articles and tweets, this study aims to understand the discourse in these two sources using natural language processing. Topic modelling techniques, Latent Dirichlet Allocation and Non- negative Matrix Factorization, are applied to digest the breadth of information collected about the elections into topics. The topics produced are subjected to further analysis that uncovers similarities between topics, links topics to dates and events and provides a summary of the discourse that existed prior to the South African general elections. The primary focus is on the 2019 elections, however election-related articles from 2014 and 2019 were also compared to understand how the discourse has changed. / Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2019. / Computer Science / MIT (Big Data Science) / Unrestricted UCTD Election analysis, natural language processing text mining latent dirichlet allocation non-negative matrix factorization
283	Zlepšení předpovědi sociálních značek využitím Data Mining / Improved Prediction of Social Tags Using Data Mining Harár, Pavol January 2015 (has links) This master’s thesis deals with using Text mining as a method to predict tags of articles. It describes the iterative way of handling big data files, parsing the data, cleaning the data and scoring of terms in article using TF-IDF. It describes in detail the flow of program written in programming language Python 3.4.3. The result of processing more than 1 million articles from Wikipedia database is a dictionary of English terms. By using this dictionary one is capable of determining the most important terms from article in corpus of articles. Relevancy of consequent tags proves the method used in this case.
284	Klasifikace textu pomocí metody SVM / Text Classification with the SVM Method Synek, Radovan January 2010 (has links) This thesis deals with text mining. It focuses on problems of document classification and related techniques, mainly data preprocessing. Project also introduces the SVM method, which has been chosen for classification, design and testing of implemented application.
285	Algoritmus pro detekci pozitívního a negatívního textu / The algorithm for the detection of positive and negative text Musil, David January 2016 (has links) As information and communication technology develops swiftly, amount of information produced by various sources grows as well. Sorting and obtaining knowledge from this data requires significant effort which is not ensured easily by a human, meaning machine processing is taking place. Acquiring emotion from text data is an interesting area of research and it’s going through considerable expansion while being used widely. Purpose of this thesis is to create a system for positive and negative emotion detection from text along with evaluation of its performance. System was created with Java programming language and it allows training with use of large amount of data (known as Big Data), exploiting Spark library. Thesis describes structure and handling text from database used as source of input data. Classificator model was created with use of Support Vector Machines and optimized by the n-grams method.
286	Modélisation automatique des conversations en tant que processus d'intentions de discours interdépendantes / Automatically modeling conversations as processes of interrelated speech Intentions Epure, Elena Viorica 14 December 2018 (has links) La prolifération des données numériques a permis aux communautés de scientifiques et de praticiens de créer de nouvelles technologies basées sur les données pour mieux connaître les utilisateurs finaux et en particulier leur comportement. L’objectif est alors de fournir de meilleurs services et un meilleur support aux personnes dans leur expérience numérique. La majorité de ces technologies créées pour analyser le comportement humain utilisent très souvent des données de logs générées passivement au cours de l’interaction homme-machine. Une particularité de ces traces comportementales est qu’elles sont enregistrées et stockées selon une structure clairement définie. En revanche, les traces générées de manière proactive sont très peu structurées et représentent la grande majorité des données numériques existantes. De plus, les données non structurées se trouvent principalement sous forme de texte. À ce jour, malgré la prédominance des données textuelles et la pertinence des connaissances comportementales dans de nombreux domaines, les textes numériques sont encore insuffisamment étudiés en tant que traces du comportement humain pour révéler automatiquement des connaissances détaillées sur le comportement.L’objectif de recherche de cette thèse est de proposer une méthode indépendante du corpus pour exploiter automatiquement les communications asynchrones en tant que traces de comportement générées de manière proactive afin de découvrir des modèles de processus de conversations,axés sur des intentions de discours et des relations, toutes deux exhaustives et détaillées.Plusieurs contributions originales sont faites. Il y est menée la seule revue systématique existante à ce jour sur la modélisation automatique des conversations asynchrones avec des actes de langage. Une taxonomie des intentions de discours est dérivée de la linguistique pour modéliser la communication asynchrone. Comparée à toutes les taxonomies des travaux connexes,celle proposée est indépendante du corpus, à la fois plus détaillée et exhaustive dans le contexte donné, et son application par des non-experts est prouvée au travers d’expériences approfondies.Une méthode automatique, indépendante du corpus, pour annoter les énoncées de communication asynchrone avec la taxonomie des intentions de discours proposée, est conçue sur la base d’un apprentissage automatique supervisé. Pour cela, deux corpus "ground-truth" validés sont créés et trois groupes de caractéristiques (discours, contenu et conversation) sont conçus pour être utilisés par les classificateurs. En particulier, certaines des caractéristiques du discours sont nouvelles et définies en considérant des moyens linguistiques pour exprimer des intentions de discours,sans s’appuyer sur le contenu explicite du corpus, le domaine ou les spécificités des types de communication asynchrones. Une méthode automatique basée sur la fouille de processus est conçue pour générer des modèles de processus d’intentions de discours interdépendantes à partir de tours de parole, annotés avec plusieurs labels par phrase. Comme la fouille de processus repose sur des logs d’événements structurés et bien définis, un algorithme est proposé pour produire de tels logs d’événements à partir de conversations. Par ailleurs, d’autres solutions pour transformer les conversations annotées avec plusieurs labels par phrase en logs d’événements, ainsi que l’impact des différentes décisions sur les modèles comportementaux en sortie sont analysées afin d’alimenter de futures recherches.Des expériences et des validations qualitatives à la fois en médecine et en analyse conversationnelle montrent que la solution proposée donne des résultats fiables et pertinents. Cependant,des limitations sont également identifiées, elles devront être abordées dans de futurs travaux. / The proliferation of digital data has enabled scientific and practitioner communities to createnew data-driven technologies to learn about user behaviors in order to deliver better services and support to people in their digital experience. The majority of these technologies extensively derive value from data logs passively generated during the human-computer interaction. A particularity of these behavioral traces is that they are structured. However, the pro-actively generated text across Internet is highly unstructured and represents the overwhelming majority of behavioral traces. To date, despite its prevalence and the relevance of behavioral knowledge to many domains, such as recommender systems, cyber-security and social network analysis,the digital text is still insufficiently tackled as traces of human behavior to automatically reveal extensive insights into behavior.The main objective of this thesis is to propose a corpus-independent method to automatically exploit the asynchronous communication as pro-actively generated behavior traces in order to discover process models of conversations, centered on comprehensive speech intentions and relations. The solution is built in three iterations, following a design science approach.Multiple original contributions are made. The only systematic study to date on the automatic modeling of asynchronous communication with speech intentions is conducted. A speech intention taxonomy is derived from linguistics to model the asynchronous communication and, comparedto all taxonomies from the related works, it is corpus-independent, comprehensive—as in both finer-grained and exhaustive in the given context, and its application by non-experts is proven feasible through extensive experiments. A corpus-independent, automatic method to annotate utterances of asynchronous communication with the proposed speech intention taxonomy is designed based on supervised machine learning. For this, validated ground-truth corpora arecreated and groups of features—discourse, content and conversation-related, are engineered to be used by the classifiers. In particular, some of the discourse features are novel and defined by considering linguistic means to express speech intentions, without relying on the corpus explicit content, domain or on specificities of the asynchronous communication types. Then, an automatic method based on process mining is designed to generate process models of interrelated speech intentions from conversation turns, annotated with multiple speech intentions per sentence. As process mining relies on well-defined structured event logs, an algorithm to produce such logs from conversations is proposed. Additionally, an extensive design rationale on how conversations annotated with multiple labels per sentence could be transformed in event logs and what is the impact of different decisions on the output behavioral models is released to support future research. Experiments and qualitative validations in medicine and conversation analysis show that the proposed solution reveals reliable and relevant results, but also limitations are identified,to be addressed in future works. Exploitation minière Actes de langage Fouille de texte Intention mining Speech acts Text mining 004
287	Classification of Stock Exchange News Kroha, Petr, Baeza-Yates, Ricardo 24 November 2004 (has links) In this report we investigate how much similarity good news and bad news may have in context of long-terms market trends. We discuss the relation between text mining, classification, and information retrieval. We present examples that use identical set of words but have a quite different meaning, we present examples that can be interpreted in both positive or negative sense so that the decision is difficult as before reading them. Our examples prove that methods of information retrieval are not strong enough to solve problems as specified above. For searching of common properties in groups of news we had used classifiers (e.g. naive Bayes classifier) after we found that the use of diagnostic methods did not deliver reasonable results. For our experiments we have used historical data concerning the German market index DAX 30. / In diesem Bericht untersuchen wir, wieviel Ähnlichkeit gute und schlechte Nachrichten im Kontext von Langzeitmarkttrends besitzen. Wir diskutieren die Verbindungen zwischen Text Mining, Klassifikation und Information Retrieval. Wir präsentieren Beispiele, die identische Wortmengen verwenden, aber trotzdem recht unterschiedliche Bedeutungen besitzen; Beispiele, die sowohl positiv als auch negativ interpretiert werden können. Sie zeigen Probleme auf, die mit Methoden des Information Retrieval nicht gelöst werden können. Um nach Gemeinsamkeiten in Nachrichtengruppen zu suchen, verwendeten wir Klassifikatoren (z.B. Naive Bayes), nachdem wir herausgefunden hatten, dass der Einsatz von diagnostizierenden Methoden keine vernünftigen Resultate erzielte. Für unsere Experimente nutzten wir historische Daten des Deutschen Aktienindex DAX 30. info:eu-repo/classification/ddc/330 ddc:330 Aktienbörse Automatische Klassifikation Bayes-Verfahren Information Retrieval Text Mining
288	Fast Data Analysis Methods For Social Media Data Nhlabano, Valentine Velaphi 07 August 2018 (has links) The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted Big Data Machine Learning Sentiment Analysis Text Mining Apache Hadoop UCTD
289	Análisis de la relación existente entre el ewom generado por consumidoras de servicios de belleza en Facebook en Lima Metropolitana, de acuerdo con su puntuación o recomendación. Un enfoque desde el Text mining Torres Shuan, Nicole Ailen 28 November 2019 (has links) Tema: Análisis de la relación existente entre el ewom generado por consumidoras de servicios de belleza en Facebook en Lima Metropolitana, de acuerdo con su puntuación o recomendación. Un enfoque desde el Text mining. Objetivo: Usar técnicas de minería de texto en el procesamiento del ewom que ayude a explicar el grado de relación existente entre la puntuación/ valoración y el sentimiento de un comentario en el sector de belleza. La presente investigación tiene como tema central el estudio del boca a boca electrónico o también conocido como ewom; y la coherencia existente entre esta variable y su acompañante (valoración/ puntuación); esta relación es medida a través de indicadores propuestos, de los cuales una variable representa la mayor influencia en el modelo, esta es la variable “sentimiento”. Para poder lograr el objetivo propuesto se realizaron estudios de tipo cualitativo y cuantitativo. El desarrollo cualitativo se centro en investigar el accionar de las consumidoras al dejar una opinión a modo de cocreación de valor para con las empresas del rubro de la belleza. El estudio cuantitativo fue progresivo; ya que, involucro el uso de diversas herramientas para el resultado final; en primera instancia se recolecto la base de opiniones, se aplicaron filtros, se analizaron los sentimientos de los comentarios con el software de “semantria for Excel”; y, por último, se realizo el análisis de regresiones con la herramienta estadística SPSS. Es importante reconocer que los dos tipos de investigación ayudaron a afianzar el modelo; ya que, permitieron conocer el comportamiento actual de las usuarias peruanas de servicio de belleza en el canal digital; y si su aporte (ewom) estaba asociado con el sentimiento relativo a la satisfacción del servicio recibido. Al finalizar la investigación, se proponen recomendaciones a nivel digital (online) y servicio (offline) para generar una mayor satisfacción en las usuarias. / Topic: Analysis of relation between the ewom generated by the consumers of beauty services on Facebook in Metropolitan Lima, according to the asigned score or recommendation. An approach from Text mining Objective: To use text mining techniques in the processing of the ewom that will help to explain the degree of relation between score/valuation and feeling of a comment in the beauty sector. The central theme of this research is the study of electronic word-of-mouth or also known as ewom; and the existing coherence between this variable and its (assessment/scoring); this relationship is measured through indicators of which a variable represents the greatest influence on the model, this is the most important one is the "feeling" variable. In order to achieve the objective proposed, studies of the following type were carried out qualitative and quantitative. The qualitative development focused on researching the actions of the consumers by leaving an opinion in the form of co-creation of value for the consumers companies in the beauty sector. The study was progressive; since, it involved the use of various tools for the development of the final result; in the first instance, the basis of opinions was gathered, the following were applied filters, we analyzed the feelings of the comments with the software of "semantria for Excel"; and finally, regression analysis was performed with the tool SPSS statistics. It is important to recognize that the two types of research helped to strengthen the model; since, they allowed to know the current behavior of the Peruvian users of beauty service on the digital channel; and if his contribution (ewom) was associated with the feeling relative to the satisfaction of the service. At the end of the research, recommendations are proposed at the digital level (online) and service (offline) to generate a greater satisfaction in the users of the service. / Trabajo de investigación Minería de texto Redes sociales Servicio de estética Text mining Social media Aesthetics services
290	Srovnání sylabů předmětů na různých univerzitách dolováním znalosti z textu Moravcová, Libuše January 2018 (has links) The thesis is focused on how to get the most accurate information about Universities, faculties, fields, and the syllabi of particular subjects of those Universities through text-mining tools. The first part describes the basics of text mining and related topics, collecting and creating data text background, turning them into the English language. In the next phase, the database will be generated from accumulated data entries. The purpose of the next step will be to obtain the most matching results such as specific phrases. The procedure of valorizing and summarizing will be used at the end of the thesis. In case of any problems, possible solutions or alternatives will be suggested.

Search results