Global ETD Search

291	Classification of Stock Exchange News Kroha, Petr, Baeza-Yates, Ricardo 24 November 2004 (has links) In this report we investigate how much similarity good news and bad news may have in context of long-terms market trends. We discuss the relation between text mining, classification, and information retrieval. We present examples that use identical set of words but have a quite different meaning, we present examples that can be interpreted in both positive or negative sense so that the decision is difficult as before reading them. Our examples prove that methods of information retrieval are not strong enough to solve problems as specified above. For searching of common properties in groups of news we had used classifiers (e.g. naive Bayes classifier) after we found that the use of diagnostic methods did not deliver reasonable results. For our experiments we have used historical data concerning the German market index DAX 30. / In diesem Bericht untersuchen wir, wieviel Ähnlichkeit gute und schlechte Nachrichten im Kontext von Langzeitmarkttrends besitzen. Wir diskutieren die Verbindungen zwischen Text Mining, Klassifikation und Information Retrieval. Wir präsentieren Beispiele, die identische Wortmengen verwenden, aber trotzdem recht unterschiedliche Bedeutungen besitzen; Beispiele, die sowohl positiv als auch negativ interpretiert werden können. Sie zeigen Probleme auf, die mit Methoden des Information Retrieval nicht gelöst werden können. Um nach Gemeinsamkeiten in Nachrichtengruppen zu suchen, verwendeten wir Klassifikatoren (z.B. Naive Bayes), nachdem wir herausgefunden hatten, dass der Einsatz von diagnostizierenden Methoden keine vernünftigen Resultate erzielte. Für unsere Experimente nutzten wir historische Daten des Deutschen Aktienindex DAX 30. info:eu-repo/classification/ddc/330 ddc:330 Aktienbörse Automatische Klassifikation Bayes-Verfahren Information Retrieval Text Mining
292	Fast Data Analysis Methods For Social Media Data Nhlabano, Valentine Velaphi 07 August 2018 (has links) The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted Big Data Machine Learning Sentiment Analysis Text Mining Apache Hadoop UCTD
293	Análisis de la relación existente entre el ewom generado por consumidoras de servicios de belleza en Facebook en Lima Metropolitana, de acuerdo con su puntuación o recomendación. Un enfoque desde el Text mining Torres Shuan, Nicole Ailen 28 November 2019 (has links) Tema: Análisis de la relación existente entre el ewom generado por consumidoras de servicios de belleza en Facebook en Lima Metropolitana, de acuerdo con su puntuación o recomendación. Un enfoque desde el Text mining. Objetivo: Usar técnicas de minería de texto en el procesamiento del ewom que ayude a explicar el grado de relación existente entre la puntuación/ valoración y el sentimiento de un comentario en el sector de belleza. La presente investigación tiene como tema central el estudio del boca a boca electrónico o también conocido como ewom; y la coherencia existente entre esta variable y su acompañante (valoración/ puntuación); esta relación es medida a través de indicadores propuestos, de los cuales una variable representa la mayor influencia en el modelo, esta es la variable “sentimiento”. Para poder lograr el objetivo propuesto se realizaron estudios de tipo cualitativo y cuantitativo. El desarrollo cualitativo se centro en investigar el accionar de las consumidoras al dejar una opinión a modo de cocreación de valor para con las empresas del rubro de la belleza. El estudio cuantitativo fue progresivo; ya que, involucro el uso de diversas herramientas para el resultado final; en primera instancia se recolecto la base de opiniones, se aplicaron filtros, se analizaron los sentimientos de los comentarios con el software de “semantria for Excel”; y, por último, se realizo el análisis de regresiones con la herramienta estadística SPSS. Es importante reconocer que los dos tipos de investigación ayudaron a afianzar el modelo; ya que, permitieron conocer el comportamiento actual de las usuarias peruanas de servicio de belleza en el canal digital; y si su aporte (ewom) estaba asociado con el sentimiento relativo a la satisfacción del servicio recibido. Al finalizar la investigación, se proponen recomendaciones a nivel digital (online) y servicio (offline) para generar una mayor satisfacción en las usuarias. / Topic: Analysis of relation between the ewom generated by the consumers of beauty services on Facebook in Metropolitan Lima, according to the asigned score or recommendation. An approach from Text mining Objective: To use text mining techniques in the processing of the ewom that will help to explain the degree of relation between score/valuation and feeling of a comment in the beauty sector. The central theme of this research is the study of electronic word-of-mouth or also known as ewom; and the existing coherence between this variable and its (assessment/scoring); this relationship is measured through indicators of which a variable represents the greatest influence on the model, this is the most important one is the "feeling" variable. In order to achieve the objective proposed, studies of the following type were carried out qualitative and quantitative. The qualitative development focused on researching the actions of the consumers by leaving an opinion in the form of co-creation of value for the consumers companies in the beauty sector. The study was progressive; since, it involved the use of various tools for the development of the final result; in the first instance, the basis of opinions was gathered, the following were applied filters, we analyzed the feelings of the comments with the software of "semantria for Excel"; and finally, regression analysis was performed with the tool SPSS statistics. It is important to recognize that the two types of research helped to strengthen the model; since, they allowed to know the current behavior of the Peruvian users of beauty service on the digital channel; and if his contribution (ewom) was associated with the feeling relative to the satisfaction of the service. At the end of the research, recommendations are proposed at the digital level (online) and service (offline) to generate a greater satisfaction in the users of the service. / Trabajo de investigación Minería de texto Redes sociales Servicio de estética Text mining Social media Aesthetics services
294	Srovnání sylabů předmětů na různých univerzitách dolováním znalosti z textu Moravcová, Libuše January 2018 (has links) The thesis is focused on how to get the most accurate information about Universities, faculties, fields, and the syllabi of particular subjects of those Universities through text-mining tools. The first part describes the basics of text mining and related topics, collecting and creating data text background, turning them into the English language. In the next phase, the database will be generated from accumulated data entries. The purpose of the next step will be to obtain the most matching results such as specific phrases. The procedure of valorizing and summarizing will be used at the end of the thesis. In case of any problems, possible solutions or alternatives will be suggested.
295	Získavanie a analýza dát pre oblasť crowdfundingu Koštial, Martin January 2019 (has links) The thesis deals with data acquisition from crowdfunding and their analysis. The theoretical part is focused on the description of available technologies and algorithms for data analysis. In the practical part the data collection is realized. Data mining and text mining algorithms are applied in this section for data.
296	Analýza textových používateľských hodnotení vybranej skupiny produktov Valovič, Roman January 2019 (has links) This work focuses on the design of a system that identifies frequently discussed product features in product reviews, summarizes them, and displays them to the user in terms of sentiment. The work deals with the issue of natural language processing, with a specific focus on Czech languague. The reader will be introduced the methods of preprocessing the text and their impact on the quality of the analysis results. The identification of the mainly discussed products features is carried out by cluster analysis using the K-Means algorithm, where we assume that sufficiently internally homogeneous clusters will represent the individual features of the products. A new area that will be explored in this work is the representation of documents using the Word embeddings technique, and its potential of using vector space as input for machine learning algorithms.
297	Ursäkta, vi har lite bråttom : Om automatisering för att effektivisera tillgängliggörandet av affärstryck / Would You Mind Hurrying Up Please : On Automatization as a Means of Improved Efficiency When Cataloging Commercial Ephemera Hellgren, Andreas January 2019 (has links) The demand on research libraries to digitize theircollections as a means of increasing the availabilityof said collections are increasing. However, a prerequisite for this is the cataloging of the collections – a task commonly associated with large demands on time and other resources. One way of handling this might be efforts in applying automatization as a part of the cataloging process. This thesis examines the possibilities of using automatization when catalog- ing commercial ephemera. For this, focus is directed towards the features of the material; the process of cataloging; and the demands on the catalogued mate- rial from its various users using a theory based on Monica J. Bates (2002) Cascade-model. By conducting a case study, consisting of observations based on contextual inquiry and interviews partly using photo elicitation, automatization of cataloging is found to be a possible way to improve availability, but not without its own complications and demands on re- sources. In conclusion, suggestions are made for considerations libraries should be aware of before automatization might be implemented at research libraries. contextual inquiry photo elicitation automatisering maskininlärning text mining katalogisering efemärt tryck affärstryck Information Studies Biblioteks- och informationsvetenskap
298	Automatisk synonymgenerering med Word2Vec for query expansion inom e-handel Kojic, Kemal, Petersson, Emil January 2018 (has links) I detta arbete undersöks hur väl automatisk synonymgenerering genom maskininlärnings-metoden Word2Vec, som tränats över en datamängd från Google News på hundra miljarder ord, lämpar sig för query expansion inom ehandel. Detta görs genom användning av produkt- och eventdata från ett välkänt modebolag där synonymer genereras utifrån söksträngar som loggats i eventdata genom olika metoder som i sin tur bildar synonymböcker som används i framtida sökningar med hjälp av query expansion. För att kunna besvara studiens forskningsfrågor utförs först en kvantitativ analys. Denna analys utförs på data som matchade köp, produktträffar, no-hits och söktid. Information om denna data genereras utifrån en söksimulator som simulerar loggade händelser från användarsessioner i ett ehandelssystem. Därefter filtreras de genererade synonymböckerna genom att ta bort synonymer som är kopplade till de söksträngar som producerat ett sämre resultat i simuleringen med synonymer, än utan. För att validera vårt resultat från den kvantitativa analysen utförs även en kvalitativ analys på skillnaden i sökresultatet som de olika metoderna tar fram, där vi undersöker vad det är för produkter som tas fram med hjälp av synonymerna, för att undersöka dess relevans. Våra tester uppvisar att ett lägre tröskelvärde leder till fler produkträffar och minskar antalet no-hits. Antalet produktträffar ökades med mellan 4\%-10\%, no-hits reducerades med mellan 11\%-22\%. I de fall där söksträngen har tilldelats bra synonymer påverkas relevansen av produkterna positivt då fler relevanta produkter dyker upp i sökresultatet. I de fall där söksträngen har tilldelats mindre bra synonymer påverkas relevansen av produkterna negativt då vissa irrelevanta produkter dyker upp i sökresultatet som användaren antagligen inte vill se i sitt sökresultat. I alla fall där de automatiskt genererade synonymerna används så befinner sig majoriteten av alla köpta produkter i den första halvan av sökresultatet, däremot minskar antalet köpta produkter på den första platsen i sökresultatet i alla fallen. / In this thesis, we examine automatic synonym generation through the use of the machine learning algorithm Word2Vec that has been trained using a Google News data set containing a hundred million words to find out if it is suitable for query expansions in e-commerce. This is examined through the use of product- and event data from a well-known fashion company where synonyms are generated from search-queries that have been logged in the event data through different methods, resulting in thesaurus' that are used in future searches with the use of query expansions. In order to answer the thesis' research question, a quantitative analysis is performed. This analysis is performed on data such as matched payments, product matches, no-hits and search time. Information about this data is generated through a search simulator that simulates logged events from user sessions in a e-commerce system. The generated thesaurus' are later filtered through the removal of synonyms that are connected to search queries whose results have produced worse results than the results without synonyms. In order to validate our results from the quantitative analysis a qualitative analysis is also performed on the difference of the search result that the different methods produce. In this qualitative analysis we research what type of products that the added synonyms produce in order to understand the relevance of the search query. Our tests show that the lower the threshold is, the higher the number of product hits and the lower the number of no-hits. Our tests shows that the number of product hits was increased by between 4\%-10\%, the number of no-hits was reduced by 11\%-22\%. In all of the tests using automatically generated synonyms, the results show that the majority of the purchased products are presented in the first half of the search result, however, in all of the tests using automatically generated synonyms the number of purchases in the first position of the search result was reduced. synonymgenerering query expansion e-handel word2vec text mining Engineering and Technology Teknik och teknologier
299	Semantik und Sentiment: Konzepte, Verfahren und Anwendungen von Text-Mining Neubauer, Nicolas 06 June 2014 (has links) Diese Arbeit befasst sich mit zwei Themenbereichen des Data Mining beziehungsweise Text Mining, den zugehörigen algorithmischen Verfahren sowie Konzepten und untersucht mögliche Anwendungsszenarien. Auf der einen Seite wird das Gebiet der semantischen Ähnlichkeit besprochen. Kurz, der Frage, wie algorithmisch bestimmt werden kann, wie viel zwei Begriffe oder Konzepte miteinander zu tun haben. Die Technologie um das Wissen, dass etwa "Regen" ein Bestandteil von "Wetter" sein kann, ermöglicht verschiedenste Anwendungen. In dieser Arbeit wird ein Überblick über gängige Literatur gegeben, das Forschungsgebiet wird grob in die zwei Schulen der wissensbasierten und statistischen Methoden aufgeteilt und in jeder wird ein Beitrag durch Untersuchung vorhandener und Vorstellung eigener Ähnlichkeitsmaße geleistet. Eine Studie mit Probanden und ein daraus entstandener Datensatz liefert schließlich Einblicke in die Präferenzen von Menschen bezüglich ihrer Ähnlichkeitswahrnehmung. Auf der anderen Seite steht das Gebiet des Sentiment Mining, in dem versucht wird, algorithmisch aus großen Sammlungen unstrukturierten Texts, etwa Nachrichten von Twitter oder anderen sozialen Netzwerken, Stimmungen und Meinungen zu identifizieren und zu klassifizieren. Nach einer Besprechung zugehöriger Literatur wird der Aufbau eines neuen Testdatensatzes motiviert und die Ergebnisse der Gewinnung dieses beschrieben. Auf dieser neuen Grundlage erfolgt eine ausführliche Auswertung einer Vielzahl von Vorgehensweisen und Klassifikationsmethoden. Schließlich wird die praktische Nutzbarkeit der Ergebnisse anhand verschiedener Anwendungsszenarien bei Produkt-Präsentationen sowie Medien- oder Volksereignissen wie der Bundestagswahl nachgewiesen. text mining data mining natural language processing sentiment semantics semantik 54.82 - Textverarbeitung E.0 - GENERAL ddc:000
300	Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology Raies, Arwa B. 29 November 2018 (has links) The task of association extraction involves identifying links between different entities. Here, we make contributions to two applications related to the biomedical field. The first application is in the domain of text mining aiming at extracting associations between methylated genes and diseases from biomedical literature. Gathering such associations can benefit disease diagnosis and treatment decisions. We developed the DDMGD database to provide a comprehensive repository of information related to genes methylated in diseases, gene expression, and disease progression. Using DEMGD, a text mining system that we developed, and with an additional post-processing, we extracted ~100,000 of such associations from free-text. The accuracy of extracted associations is 82% as estimated on 2,500 hand-curated entries. The second application is in the domain of computational toxicology that aims at identifying relationships between chemical compounds and toxicity effects. Identifying toxicity effects of chemicals is a necessary step in many processes including drug design. To extract these associations, we propose using multi-label classification (MLC) methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology that could help in identifying guidelines for overcoming the existing deficiencies of these methods. Therefore, we performed extensive benchmarking and analysis of ~19,000 MLC models. We demonstrated variability in the performance of these models under several conditions and determined the best performing model that achieves accuracy of 91% on an independent testing set. Finally, we propose a novel framework, LDR (learning from dense regions), for developing MLC and multi-target regression (MTR) models from datasets with missing labels. The framework is generic, so it can be applied to predict associations between samples and discrete or continuous labels. Our assessment shows that LDR performed better than the baseline approach (i.e., the binary relevance algorithm) when evaluated using four MLC and five MTR datasets. LDR achieved accuracy scores of up to 97% using testing MLC datasets, and R2 scores up to 88% for testing MTR datasets. Additionally, we developed a novel method for minority oversampling to tackle the problem of imbalanced MLC datasets. Our method improved the precision score of LDR by 10%. machine learning text mining computational toxicology DNA methylation multi-lable classification multi-target regression

Search results