Global ETD Search

511	Shlukování textových dat / Text Data Clustering Leixner, Petr January 2010 (has links) Process of text data clustering can be used to analysis, navigation and structure large sets of texts or hypertext documents. The basic idea is to group the documents into a set of clusters on the basis of their similarity. The well-known methods of text clustering, however, do not really solve the specific problems of text clustering like high dimensionality of the input data, very large size of the databases and understandability of the cluster description. This work deals with mentioned problems and describes the modern method of text data clustering based on the use of frequent term sets, which tries to solve deficiencies of other clustering methods.
512	Analýza sentimentu s využitím dolování dat / Sentiment Analysis with Use of Data Mining Sychra, Martin January 2016 (has links) The theme of the work is sentiment analysis, especially in terms of informatics (marginally from a linguistic point of view). The linguistic part discusses the term sentiment and language methods for its analysis, e.g. lemmatization, POS tagging, using the list of stopwords etc. More attention is paid to the structure of the sentiment analyzer which is based on some of the machine learning methods (support vector machines, Naive Bayes and maximum entropy classification). On the basis of the theoretical background, a functional analyzer is projected and implemented. The experiments are focused mainly on comparing the classification methods and on the benefits of using the individual preprocessing methods. The success rate of the constructed classifier reaches up to 84 % in the cross-validation.
513	Extraction d’Information pour les réseaux de régulation de la graine chez Arabidopsis Thaliana. / Information Extraction for the Seed Development Regulatory Networks of Arabidopsis Thaliana. Valsamou, Dialekti 17 January 2017 (has links) Même si l’information est abondante dans le monde, l’information structurée, prête à être utilisée est rare. Ce travail propose l’Extraction d’Information (EI) comme une approche efficace pour la production de l’information structurée, utilisable sur la biologie, en présentant une tâche complète d’EI sur un organisme modèle, Arabidopsis thaliana. Un système d’EI se charge d’extraire les parties de texte les plus significatives et d’identifier leurs relations sémantiques. En collaboration avec des experts biologistes sur la plante A. Thaliana un modèle de connaissance a été conçu. Son objectif est de formaliser la connaissance nécessaire pour bien décrire le domaine du développement de la graine. Ce modèle contient toutes les entités et relations les connectant qui sont essentielles et peut être directement utilisé par des algorithmes. En parallèle ce modèle a été testé et appliqué sur un ensemble d’articles scientifiques du domaine, le corpus nécessaire pour l’entraînement de l’apprentissage automatique. Les experts ont annoté le texte en utilisant les entités et relations du modèle. Le modèle et le corpus annoté sont les premiers proposés pour le développement de la graine, et parmi les rares pour A. Thaliana, malgré son importance biologique. Ce modèle réconcilie les besoins d’avoir un modèle assez complexe pour bien décrirele domaine, et d’avoir assez de généralité pour pouvoir utiliser des méthodes d’apprentissage automatique. Une approche d’extraction de relations (AlvisRE) a également été élaborée et développée. Une fois les entités reconnues, l’extracteur de relations cherche à détecter les cas où le texte mentionne une relation entre elles, et identifier précisément de quel type de relation du modèle il s’agit. L’approche AlvisRE est basée sur la similarité textuelle et utilise à la fois des informations lexiques,syntactiques et sémantiques. Dans les expériences réalisées, AlvisRE donne des résultats qui sont équivalents et parfois supérieurs à l’état de l’art. En plus, AlvisRE a l’avantage de la modularité et adaptabilité en utilisant des informations sémantiques produites automatiquement. Ce dernier caractéristique permet d’attendre des performances équivalentes dans d’autres domaines. / While information is abundant in the world, structured, ready-to-use information is rare. Thiswork proposes Information Extraction (IE) as an efficient approach for producing structured,usable information on biology, by presenting a complete IE task on a model biological organism,Arabidopsis thaliana. Information Extraction is the process of extracting meaningful parts of text and identifying their semantic relations.In collaboration with experts on the plant A. Thaliana, a knowledge model was conceived. The goal of this model is providing a formal representation of the knowledge that is necessary to sufficiently describe the domain of grain development. This model contains all the entities and the relations between them which are essential and it can directly be used by algorithms. Inparallel, this model was tested and applied on a set of scientific articles of the domain. These documents constitute the corpus which is needed to train machine learning algorithms. Theexperts annotated the text using the entities and relations of the model. This corpus and this model are the first available for grain development and among very few on A. Thaliana, despite the latter’s importance in biology. This model manages to answer both needs of being complexenough to describe the domain well, and of having enough generalization for machine learning.A relation extraction approach (AlvisRE) was also elaborated and developed. After entityre cognition, the relation extractor tries to detect the cases where the text mentions that twoentities are in a relation, and identify precisely to which type of the model these relations belongto. AlvisRE’s approach is based on textual similarity and it uses all types of information available:lexical, syntactic and semantic. In the tests conducted, AlvisRE had results that are equivalentor sometimes better than the state of the art. Additionally, AlvisRE has the advantage of being modular and adaptive by using semantic information that was produced automatically. This last feature allows me to expect similar performance in other domains. Extraction d'information Fouille de données Traitement automatique de langues Bioinformatique Apprentissage automatique Fouille de texte Information Extraction Data Mining Natural Language Processing Bioinformatics Machine Learning Text Mining
514	Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden Diensten Pfeifer, Katja 16 June 2014 (has links) Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen. Die Kombination verschiedener solcher Text-Mining-Dienste zur Lösung konkreter Extraktionsaufgaben erscheint vielversprechend, da so bestehende Stärken ausgenutzt, Schwächen der Systeme minimiert werden können und die Nutzung von Text-Mining-Lösungen vereinfacht werden kann. Die vorliegende Arbeit adressiert die flexible Kombination von Text-Mining-Diensten in einem serviceorientierten System und erweitert den Stand der Technik um gezielte Methoden zur Auswahl der Text-Mining-Dienste, zur Aggregation der Ergebnisse und zur Abbildung der eingesetzten Klassifikationsschemata. Zunächst wird die derzeit existierende Dienstlandschaft analysiert und aufbauend darauf eine Ontologie zur funktionalen Beschreibung der Dienste bereitgestellt, so dass die funktionsgesteuerte Auswahl und Kombination der Text-Mining-Dienste ermöglicht wird. Des Weiteren werden am Beispiel entitätsextrahierender Dienste Algorithmen zur qualitätssteigernden Kombination von Extraktionsergebnissen erarbeitet und umfangreich evaluiert. Die Arbeit wird durch zusätzliche Abbildungs- und Integrationsprozesse ergänzt, die eine Anwendbarkeit auch in heterogenen Dienstlandschaften, bei denen unterschiedliche Klassifikationsschemata zum Einsatz kommen, gewährleisten. Zudem werden Möglichkeiten der Übertragbarkeit auf andere Text-Mining-Methoden erörtert. info:eu-repo/classification/ddc/004 ddc:004
515	Sdílená ekonomika v kontextu postmateriálních hodnot: případ segmentu ubytování v Praze / Sharing Economy in the Context of Postmaterial Values: The Case of Accommodation Segment in Prague Svobodová, Tereza January 2020 (has links) This master's thesis is about the success of sharing economy in the accommodation segment in Prague. The thesis is based on theories conceptualizing sharing economy as a result of social and value change, not only as technological one. Using online review data, the user experience of shared accommodation via Airbnb and traditional via Booking are compared. Analysis is conducted with focus on users' satisfied needs and fulfilled values. For processing the data, text mining techniques (topic modelling and sentiment analysis) were employed. The major result is that in Prague the models of sharing economy accommodation meets the growing need in society to fulfil post-material values in the market much better than the models of traditional accommodation (hotels, hostels, boarding houses). In their experiences, Airbnb users reflect social and emotional values more often, even though most sharing economy accommodations in Prague do not involve any physical sharing with the host. The thesis thus brings a unique perspective on the Airbnb phenomenon in the Czech context and contributes to the discussion of why the market share of the sharing economy in the accommodation segment in Prague has been growing, while traditional models stagnated.
516	Alternative Media Online News during the Covid-19 Pandemic: within a Swedish Context : A comparative content analysis of Alternative Media and Mainstream media newspapers online in Sweden during the coverage of the coronavirus pandemic. Ekberg, Robin, Svensson, Gina Michelle January 2021 (has links) Technology has allowed for the ability to create online platforms for sources ofreliable and unreliable news media. It is therefore important to understand the roleand relevance of alternative news media today and how disinformation is spreadonline. In this paper, we will examine the role of alternative news media websites inSweden and how it compares to the mainstream media websites spread of informationduring the coverage of the Covid-19 pandemic. We will also explore what thedifference is in their portrayal of events during the coronavirus, and what makes thisdifference appealing to certain readers. Using Google, we searched the top ten articlesfrom four online media news sources. Two of which sources were mainstream mediasources and the other two sources as alternative media. These articles were searchedfor during a specific timeframe and with the keywords such as “Corona” and “Covid19”. The dates for the timeframe of this experiment come from when the coronavirusbroke out in Sweden to one year after the event occurred. The top ten search resultsfrom each news source, as provided to us by Google’s algorithm, were placedthrough a text analysis tool called Voyant. The data findings are presented in threeformats: a Word Cloud, TermsBerry and Distinctive Words Comparison. The resultsof the experiment show a stark contrast in the difference in reporting on thecoronavirus topic between different online newspaper media, specifically alternativemedia news sources. Further research is recommended on a larger scale within thetopic of online alternative media. Text Mining Online Media Bias Digital Humanities Alternative Media Mainstream Media Social Informatics Coronavirus Covid-19 Voyant Information Systems, Social aspects
517	Une approche générique pour l'analyse croisant contenu et usage des sites Web par des méthodes de bipartitionnement / A generic approach to combining web content and usage analysis using biclustering algorithms Charrad, Malika 22 March 2010 (has links) Dans cette thèse, nous proposons une nouvelle approche WCUM (Web Content and Usage Mining based approach) permettant de relier l'analyse du contenu à l'analyse de l'usage d'un site Web afin de mieux comprendre le comportement général des visiteurs du site. Ce travail repose sur l'utilisation de l'algorithme CROKI2 de classification croisée implémenté selon deux stratégies d'optimisation différentes que nous comparons à travers des expérimentations sur des données générées artificiellement. Afin de pallier le problème de détermination du nombre de classes sur les lignes et les colonnes, nous proposons de généraliser certains indices proposés initialement pour évaluer les partitions obtenues par des algorithmes de classification simple, aux algorithmes de classification simultanée. Pour évaluer la performance de ces indices nous proposons un algorithme de génération de biclasses artificielles pour effectuer des simulations et valider les résultats. Des expérimentations sur des données artificielles ainsi qu'une application sur des données réelles ont été réalisées pour évaluer l'efficacité de l'approche proposée. / In this thesis, we propose a new approach WCUM (Web Content and Usage Mining based approach) for linking content analysis to usage analysis of a website to better understand the general behavior of the web site visitors. This work is based on the use of the block clustering algorithm CROKI2 implemented by two different strategies of optimization that we compared through experiments on artificially generated data. To mitigate the problem of determination of the number of clusters on rows and columns, we suggest to generalize the use of some indices originally proposed to evaluate the partitions obtained by clustering algorithms to evaluate bipartitions obtained by simultaneous clustering algorithms. To evaluate the performance of these indices on data with biclusters structure, we proposed an algorithm for generating artificial data to perform simulations and validate the results. Experiments on artificial data as well as on real data were realized to estimate the efficiency of the proposed approach. Classification simultanée Algorithme Croki2 Biclustering Fouille du web Classification croisée Fouille de l'usage du web Block clustering Nombre de biclasses Fouille du contenu du web Web Usage Mining Web Content Mining Text mining
518	A visual analytics approach for multi-resolution and multi-model analysis of text corpora : application to investigative journalism / Une approche de visualisation analytique pour une analyse multi-résolution de corpus textuels : application au journalisme d’investigation Médoc, Nicolas 16 October 2017 (has links) À mesure que la production de textes numériques croît exponentiellement, un besoin grandissant d’analyser des corpus de textes se manifeste dans beaucoup de domaines d’application, tant ces corpus constituent des sources inépuisables d’information et de connaissance partagées. Ainsi proposons-nous dans cette thèse une nouvelle approche de visualisation analytique pour l’analyse de corpus textuels, mise en œuvre pour les besoins spécifiques du journalisme d’investigation. Motivées par les problèmes et les tâches identifiés avec une journaliste d’investigation professionnelle, les visualisations et les interactions ont été conçues suivant une méthodologie centrée utilisateur, impliquant l’utilisateur durant tout le processus de développement. En l’occurrence, les journalistes d’investigation formulent des hypothèses, explorent leur sujet d’investigation sous tous ses angles, à la recherche de sources multiples étayant leurs hypothèses de travail. La réalisation de ces tâches, très fastidieuse lorsque les corpus sont volumineux, requiert l’usage de logiciels de visualisation analytique se confrontant aux problématiques de recherche abordées dans cette thèse. D’abord, la difficulté de donner du sens à un corpus textuel vient de sa nature non structurée. Nous avons donc recours au modèle vectoriel et son lien étroit avec l’hypothèse distributionnelle, ainsi qu’aux algorithmes qui l’exploitent pour révéler la structure sémantique latente du corpus. Les modèles de sujets et les algorithmes de biclustering sont efficaces pour l’extraction de sujets de haut niveau. Ces derniers correspondent à des groupes de documents concernant des sujets similaires, chacun représenté par un ensemble de termes extraits des contenus textuels. Une telle structuration par sujet permet notamment de résumer un corpus et de faciliter son exploration. Nous proposons une nouvelle visualisation, une carte pondérée des sujets, qui dresse une vue d’ensemble des sujets de haut niveau. Elle permet d’une part d’interpréter rapidement les contenus grâce à de multiples nuages de mots, et d’autre part, d’apprécier les propriétés des sujets telles que leur taille relative et leur proximité sémantique. Bien que l’exploration des sujets de haut niveau aide à localiser des sujets d’intérêt ainsi que leur voisinage, l’identification de faits précis, de points de vue ou d’angles d’analyse, en lien avec un événement ou une histoire, nécessite un niveau de structuration plus fin pour représenter des variantes de sujet. Cette structure imbriquée révélée par Bimax, une méthode de biclustering basée sur des motifs avec chevauchement, capture au sein des biclusters les co-occurrences de termes partagés par des sous-ensembles de documents pouvant dévoiler des faits, des points de vue ou des angles associés à des événements ou des histoires communes. Cette thèse aborde les problèmes de visualisation de biclusters avec chevauchement en organisant les biclusters terme-document en une hiérarchie qui limite la redondance des termes et met en exergue les parties communes et distinctives des biclusters. Nous avons évalué l’utilité de notre logiciel d’abord par un scénario d’utilisation doublé d’une évaluation qualitative avec une journaliste d’investigation. En outre, les motifs de co-occurrence des variantes de sujet révélées par Bima. sont déterminés par la structure de sujet englobante fournie par une méthode d’extraction de sujet. Cependant, la communauté a peu de recul quant au choix de la méthode et son impact sur l’exploration et l’interprétation des sujets et de ses variantes. Ainsi nous avons conduit une expérience computationnelle et une expérience utilisateur contrôlée afin de comparer deux méthodes d’extraction de sujet. D’un côté Coclu. est une méthode de biclustering disjointe, et de l’autre, hirarchical Latent Dirichlet Allocation (hLDA) est un modèle de sujet probabiliste dont les distributions de probabilité forment une structure de bicluster avec chevauchement. (...) / As the production of digital texts grows exponentially, a greater need to analyze text corpora arises in various domains of application, insofar as they constitute inexhaustible sources of shared information and knowledge. We therefore propose in this thesis a novel visual analytics approach for the analysis of text corpora, implemented for the real and concrete needs of investigative journalism. Motivated by the problems and tasks identified with a professional investigative journalist, visualizations and interactions are designed through a user-centered methodology involving the user during the whole development process. Specifically, investigative journalists formulate hypotheses and explore exhaustively the field under investigation in order to multiply sources showing pieces of evidence related to their working hypothesis. Carrying out such tasks in a large corpus is however a daunting endeavor and requires visual analytics software addressing several challenging research issues covered in this thesis. First, the difficulty to make sense of a large text corpus lies in its unstructured nature. We resort to the Vector Space Model (VSM) and its strong relationship with the distributional hypothesis, leveraged by multiple text mining algorithms, to discover the latent semantic structure of the corpus. Topic models and biclustering methods are recognized to be well suited to the extraction of coarse-grained topics, i.e. groups of documents concerning similar topics, each one represented by a set of terms extracted from textual contents. We provide a new Weighted Topic Map visualization that conveys a broad overview of coarse-grained topics by allowing quick interpretation of contents through multiple tag clouds while depicting the topical structure such as the relative importance of topics and their semantic similarity. Although the exploration of the coarse-grained topics helps locate topic of interest and its neighborhood, the identification of specific facts, viewpoints or angles related to events or stories requires finer level of structuration to represent topic variants. This nested structure, revealed by Bimax, a pattern-based overlapping biclustering algorithm, captures in biclusters the co-occurrences of terms shared by multiple documents and can disclose facts, viewpoints or angles related to events or stories. This thesis tackles issues related to the visualization of a large amount of overlapping biclusters by organizing term-document biclusters in a hierarchy that limits term redundancy and conveys their commonality and specificities. We evaluated the utility of our software through a usage scenario and a qualitative evaluation with an investigative journalist. In addition, the co-occurrence patterns of topic variants revealed by Bima. are determined by the enclosing topical structure supplied by the coarse-grained topic extraction method which is run beforehand. Nonetheless, little guidance is found regarding the choice of the latter method and its impact on the exploration and comprehension of topics and topic variants. Therefore we conducted both a numerical experiment and a controlled user experiment to compare two topic extraction methods, namely Coclus, a disjoint biclustering method, and hierarchical Latent Dirichlet Allocation (hLDA), an overlapping probabilistic topic model. The theoretical foundation of both methods is systematically analyzed by relating them to the distributional hypothesis. The numerical experiment provides statistical evidence of the difference between the resulting topical structure of both methods. The controlled experiment shows their impact on the comprehension of topic and topic variants, from analyst perspective. (...) Visualisation analytique Fouille de texte Modèles de sujet Co-clustering Étude utilisateur Journalisme d'investigation Visual analytics Text mining Topic models Co-clustering User study Investigative journalism 005.74
519	Condition-specific differential subnetwork analysis for biological systems Jhamb, Deepali 04 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Biological systems behave differently under different conditions. Advances in sequencing technology over the last decade have led to the generation of enormous amounts of condition-specific data. However, these measurements often fail to identify low abundance genes/proteins that can be biologically crucial. In this work, a novel text-mining system was first developed to extract condition-specific proteins from the biomedical literature. The literature-derived data was then combined with proteomics data to construct condition-specific protein interaction networks. Further, an innovative condition-specific differential analysis approach was designed to identify key differences, in the form of subnetworks, between any two given biological systems. The framework developed here was implemented to understand the differences between limb regeneration-competent Ambystoma mexicanum and –deficient Xenopus laevis. This study provides an exhaustive systems level analysis to compare regeneration competent and deficient subnetworks to show how different molecular entities inter-connect with each other and are rewired during the formation of an accumulation blastema in regenerating axolotl limbs. This study also demonstrates the importance of literature-derived knowledge, specific to limb regeneration, to augment the systems biology analysis. Our findings show that although the proteins might be common between the two given biological conditions, they can have a high dissimilarity based on their biological and topological properties in the subnetwork. The knowledge gained from the distinguishing features of limb regeneration in amphibians can be used in future to chemically induce regeneration in mammalian systems. The approach developed in this dissertation is scalable and adaptable to understand differential subnetworks between any two biological systems. This methodology will not only facilitate the understanding of biological processes and molecular functions which govern a given system but also provide novel intuitions about the pathophysiology of diseases/conditions. Limb regeneration Text mining Differential network analysis Subnetwork analysis Concept based mining Extremities (Anatomy) -- Regeneration Extremities (Anatomy) -- Physiology Text processing (Computer science) Data mining
520	Text Mining for Social Harm and Criminal Justice Applications Pandey, Ritika 08 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Increasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users’ transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug culture and provide tools with potential applications in combating opioid crises. Lastly, we present a knowledge graph based framework for homicide investigation chronologies that may aid investigators in analyzing homicide case data and also allow for post hoc analysis of key features that determine whether a homicide is ultimately solved. For this purpose we perform named entity recognition to determine witnesses, detectives and suspects from chronology, use keyword expansion to identify various evidence types and finally link these entities and evidence to construct a homicide investigation knowledge graph. We compare the performance over several choice of methodologies for these sub-tasks and analyze the association between network statistics of knowledge graph and homicide solvability. Text mining Machine learning Data mining Computer Science Social Harm Criminal Justice Information retrieval Natural Language Processing Crime topic modeling addiction analysis homicide investigation analysis

Search results