Global ETD Search

291	Získavanie a analýza dát pre oblasť crowdfundingu Koštial, Martin January 2019 (has links) The thesis deals with data acquisition from crowdfunding and their analysis. The theoretical part is focused on the description of available technologies and algorithms for data analysis. In the practical part the data collection is realized. Data mining and text mining algorithms are applied in this section for data.
292	Analýza textových používateľských hodnotení vybranej skupiny produktov Valovič, Roman January 2019 (has links) This work focuses on the design of a system that identifies frequently discussed product features in product reviews, summarizes them, and displays them to the user in terms of sentiment. The work deals with the issue of natural language processing, with a specific focus on Czech languague. The reader will be introduced the methods of preprocessing the text and their impact on the quality of the analysis results. The identification of the mainly discussed products features is carried out by cluster analysis using the K-Means algorithm, where we assume that sufficiently internally homogeneous clusters will represent the individual features of the products. A new area that will be explored in this work is the representation of documents using the Word embeddings technique, and its potential of using vector space as input for machine learning algorithms.
293	Ursäkta, vi har lite bråttom : Om automatisering för att effektivisera tillgängliggörandet av affärstryck / Would You Mind Hurrying Up Please : On Automatization as a Means of Improved Efficiency When Cataloging Commercial Ephemera Hellgren, Andreas January 2019 (has links) The demand on research libraries to digitize theircollections as a means of increasing the availabilityof said collections are increasing. However, a prerequisite for this is the cataloging of the collections – a task commonly associated with large demands on time and other resources. One way of handling this might be efforts in applying automatization as a part of the cataloging process. This thesis examines the possibilities of using automatization when catalog- ing commercial ephemera. For this, focus is directed towards the features of the material; the process of cataloging; and the demands on the catalogued mate- rial from its various users using a theory based on Monica J. Bates (2002) Cascade-model. By conducting a case study, consisting of observations based on contextual inquiry and interviews partly using photo elicitation, automatization of cataloging is found to be a possible way to improve availability, but not without its own complications and demands on re- sources. In conclusion, suggestions are made for considerations libraries should be aware of before automatization might be implemented at research libraries. contextual inquiry photo elicitation automatisering maskininlärning text mining katalogisering efemärt tryck affärstryck Information Studies Biblioteks- och informationsvetenskap
294	Using a Text Mining Approach to Examine Online Learning Research Trends of the Past 20 Years (1997-2016) Keahey, Heather Lynn 12 1900 (has links) The purpose of this research is to identify longitudinal trends relevant to online learning research within 15 highly regarded, peer-reviewed publications in educational technology and online education. Online instruction has become a popular form of education delivery across academic institutions. A review of literature on the topic shows that missing from the corpus is a trend analysis focused in online learning research across multiple journals. Previous efforts of establishing trends in online learning are narrow in focus using only one journal or a shortened time frame. This metatrend analysis employed text mining techniques to examine twenty years (1997-2016) of published research in an effort to establish past, present and emerging trends within published literature. A general bibliometric analysis is offered highlighting prolific and yearly journal publications. Meaningful trending terms used during the twenty-year time period were identified and analyzed. A cluster analysis performed on the extracted data provides a single layer taxonomy regarding online learning research. Time trends within the clusters were identified to offer a more in-depth analysis. Trends revealed during the research indicate a changing relationship of online learning and distance education. A strong emphasis on students and learning was noted as a consistent trend throughout the literature. Emerging categories recognized include openness and mobility, game-based learning, and MOOCs. The intention of the research is to offer an overview of trends in online learning research in order to contribute to the ongoing dialogue concerning the development and delivery of online education. online learning text mining Education, Technology Web-based instruction. Internet in education. Computer-assisted instruction.
295	Automatisk synonymgenerering med Word2Vec for query expansion inom e-handel Kojic, Kemal, Petersson, Emil January 2018 (has links) I detta arbete undersöks hur väl automatisk synonymgenerering genom maskininlärnings-metoden Word2Vec, som tränats över en datamängd från Google News på hundra miljarder ord, lämpar sig för query expansion inom ehandel. Detta görs genom användning av produkt- och eventdata från ett välkänt modebolag där synonymer genereras utifrån söksträngar som loggats i eventdata genom olika metoder som i sin tur bildar synonymböcker som används i framtida sökningar med hjälp av query expansion. För att kunna besvara studiens forskningsfrågor utförs först en kvantitativ analys. Denna analys utförs på data som matchade köp, produktträffar, no-hits och söktid. Information om denna data genereras utifrån en söksimulator som simulerar loggade händelser från användarsessioner i ett ehandelssystem. Därefter filtreras de genererade synonymböckerna genom att ta bort synonymer som är kopplade till de söksträngar som producerat ett sämre resultat i simuleringen med synonymer, än utan. För att validera vårt resultat från den kvantitativa analysen utförs även en kvalitativ analys på skillnaden i sökresultatet som de olika metoderna tar fram, där vi undersöker vad det är för produkter som tas fram med hjälp av synonymerna, för att undersöka dess relevans. Våra tester uppvisar att ett lägre tröskelvärde leder till fler produkträffar och minskar antalet no-hits. Antalet produktträffar ökades med mellan 4\%-10\%, no-hits reducerades med mellan 11\%-22\%. I de fall där söksträngen har tilldelats bra synonymer påverkas relevansen av produkterna positivt då fler relevanta produkter dyker upp i sökresultatet. I de fall där söksträngen har tilldelats mindre bra synonymer påverkas relevansen av produkterna negativt då vissa irrelevanta produkter dyker upp i sökresultatet som användaren antagligen inte vill se i sitt sökresultat. I alla fall där de automatiskt genererade synonymerna används så befinner sig majoriteten av alla köpta produkter i den första halvan av sökresultatet, däremot minskar antalet köpta produkter på den första platsen i sökresultatet i alla fallen. / In this thesis, we examine automatic synonym generation through the use of the machine learning algorithm Word2Vec that has been trained using a Google News data set containing a hundred million words to find out if it is suitable for query expansions in e-commerce. This is examined through the use of product- and event data from a well-known fashion company where synonyms are generated from search-queries that have been logged in the event data through different methods, resulting in thesaurus' that are used in future searches with the use of query expansions. In order to answer the thesis' research question, a quantitative analysis is performed. This analysis is performed on data such as matched payments, product matches, no-hits and search time. Information about this data is generated through a search simulator that simulates logged events from user sessions in a e-commerce system. The generated thesaurus' are later filtered through the removal of synonyms that are connected to search queries whose results have produced worse results than the results without synonyms. In order to validate our results from the quantitative analysis a qualitative analysis is also performed on the difference of the search result that the different methods produce. In this qualitative analysis we research what type of products that the added synonyms produce in order to understand the relevance of the search query. Our tests show that the lower the threshold is, the higher the number of product hits and the lower the number of no-hits. Our tests shows that the number of product hits was increased by between 4\%-10\%, the number of no-hits was reduced by 11\%-22\%. In all of the tests using automatically generated synonyms, the results show that the majority of the purchased products are presented in the first half of the search result, however, in all of the tests using automatically generated synonyms the number of purchases in the first position of the search result was reduced. synonymgenerering query expansion e-handel word2vec text mining Engineering and Technology Teknik och teknologier
296	Semantik und Sentiment: Konzepte, Verfahren und Anwendungen von Text-Mining Neubauer, Nicolas 06 June 2014 (has links) Diese Arbeit befasst sich mit zwei Themenbereichen des Data Mining beziehungsweise Text Mining, den zugehörigen algorithmischen Verfahren sowie Konzepten und untersucht mögliche Anwendungsszenarien. Auf der einen Seite wird das Gebiet der semantischen Ähnlichkeit besprochen. Kurz, der Frage, wie algorithmisch bestimmt werden kann, wie viel zwei Begriffe oder Konzepte miteinander zu tun haben. Die Technologie um das Wissen, dass etwa "Regen" ein Bestandteil von "Wetter" sein kann, ermöglicht verschiedenste Anwendungen. In dieser Arbeit wird ein Überblick über gängige Literatur gegeben, das Forschungsgebiet wird grob in die zwei Schulen der wissensbasierten und statistischen Methoden aufgeteilt und in jeder wird ein Beitrag durch Untersuchung vorhandener und Vorstellung eigener Ähnlichkeitsmaße geleistet. Eine Studie mit Probanden und ein daraus entstandener Datensatz liefert schließlich Einblicke in die Präferenzen von Menschen bezüglich ihrer Ähnlichkeitswahrnehmung. Auf der anderen Seite steht das Gebiet des Sentiment Mining, in dem versucht wird, algorithmisch aus großen Sammlungen unstrukturierten Texts, etwa Nachrichten von Twitter oder anderen sozialen Netzwerken, Stimmungen und Meinungen zu identifizieren und zu klassifizieren. Nach einer Besprechung zugehöriger Literatur wird der Aufbau eines neuen Testdatensatzes motiviert und die Ergebnisse der Gewinnung dieses beschrieben. Auf dieser neuen Grundlage erfolgt eine ausführliche Auswertung einer Vielzahl von Vorgehensweisen und Klassifikationsmethoden. Schließlich wird die praktische Nutzbarkeit der Ergebnisse anhand verschiedener Anwendungsszenarien bei Produkt-Präsentationen sowie Medien- oder Volksereignissen wie der Bundestagswahl nachgewiesen. text mining data mining natural language processing sentiment semantics semantik 54.82 - Textverarbeitung E.0 - GENERAL ddc:000
297	Contributions to Computational Methods for Association Extraction from Biomedical Data: Applications to Text Mining and In Silico Toxicology Raies, Arwa B. 29 November 2018 (has links) The task of association extraction involves identifying links between different entities. Here, we make contributions to two applications related to the biomedical field. The first application is in the domain of text mining aiming at extracting associations between methylated genes and diseases from biomedical literature. Gathering such associations can benefit disease diagnosis and treatment decisions. We developed the DDMGD database to provide a comprehensive repository of information related to genes methylated in diseases, gene expression, and disease progression. Using DEMGD, a text mining system that we developed, and with an additional post-processing, we extracted ~100,000 of such associations from free-text. The accuracy of extracted associations is 82% as estimated on 2,500 hand-curated entries. The second application is in the domain of computational toxicology that aims at identifying relationships between chemical compounds and toxicity effects. Identifying toxicity effects of chemicals is a necessary step in many processes including drug design. To extract these associations, we propose using multi-label classification (MLC) methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology that could help in identifying guidelines for overcoming the existing deficiencies of these methods. Therefore, we performed extensive benchmarking and analysis of ~19,000 MLC models. We demonstrated variability in the performance of these models under several conditions and determined the best performing model that achieves accuracy of 91% on an independent testing set. Finally, we propose a novel framework, LDR (learning from dense regions), for developing MLC and multi-target regression (MTR) models from datasets with missing labels. The framework is generic, so it can be applied to predict associations between samples and discrete or continuous labels. Our assessment shows that LDR performed better than the baseline approach (i.e., the binary relevance algorithm) when evaluated using four MLC and five MTR datasets. LDR achieved accuracy scores of up to 97% using testing MLC datasets, and R2 scores up to 88% for testing MTR datasets. Additionally, we developed a novel method for minority oversampling to tackle the problem of imbalanced MLC datasets. Our method improved the precision score of LDR by 10%. machine learning text mining computational toxicology DNA methylation multi-lable classification multi-target regression
298	Complexity penalized methods for structured and unstructured data Goeva, Aleksandrina 08 November 2017 (has links) A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate. Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining
299	Standardization of textual data for comprehensive job market analysis / Normalisation textuelle pour une analyse exhaustive du marché de l'emploi Malherbe, Emmanuel 18 November 2016 (has links) Sachant qu'une grande partie des offres d'emplois et des profils candidats est en ligne, le e-recrutement constitue un riche objet d'étude. Ces documents sont des textes non structurés, et le grand nombre ainsi que l'hétérogénéité des sites de recrutement implique une profusion de vocabulaires et nomenclatures. Avec l'objectif de manipuler plus aisément ces données, Multiposting, une entreprise française spécialisée dans les outils de e-recrutement, a soutenu cette thèse, notamment en terme de données, en fournissant des millions de CV numériques et offres d'emplois agrégées de sources publiques.Une difficulté lors de la manipulation de telles données est d'en déduire les concepts sous-jacents, les concepts derrière les mots n'étant compréhensibles que des humains. Déduire de tels attributs structurés à partir de donnée textuelle brute est le problème abordé dans cette thèse, sous le nom de normalisation. Avec l'objectif d'un traitement unifié, la normalisation doit fournir des valeurs dans une nomenclature, de sorte que les attributs résultants forment une représentation structurée unique de l'information. Ce traitement traduit donc chaque document en un language commun, ce qui permet d'agréger l'ensemble des données dans un format exploitable et compréhensible. Plusieurs questions sont cependant soulevées: peut-on exploiter les structures locales des sites web dans l'objectif d'une normalisation finale unifiée? Quelle structure de nomenclature est la plus adaptée à la normalisation, et comment l'exploiter? Est-il possible de construire automatiquement une telle nomenclature de zéro, ou de normaliser sans en avoir une?Pour illustrer le problème de la normalisation, nous allons étudier par exemple la déduction des compétences ou de la catégorie professionelle d'une offre d'emploi, ou encore du niveau d'étude d'un profil de candidat. Un défi du e-recrutement est que les concepts évoluent continuellement, de sorte que la normalisation se doit de suivre les tendances du marché. A la lumière de cela, nous allons proposer un ensemble de modèles d'apprentissage statistique nécessitant le minimum de supervision et facilement adaptables à l'évolution des nomenclatures. Les questions posées ont trouvé des solutions dans le raisonnement à partir de cas, le learning-to-rank semi-supervisé, les modèles à variable latente, ainsi qu'en bénéficiant de l'Open Data et des médias sociaux. Les différents modèles proposés ont été expérimentés sur des données réelles, avant d'être implémentés industriellement. La normalisation résultante est au coeur de SmartSearch, un projet qui fournit une analyse exhaustive du marché de l'emploi. / With so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market. Apprentissage statistique Fouille de texte Traitement Automatique de la Langue E-Recrutement Machine Learning Text Mining Natural Langage processing E-Recruitment
300	Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists Gungor, Abdulmecit 03 April 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Authorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors. Authorship Attribution Word2Vec Doc2Vec Word2Vec Inversion Word Scoring

Search results