Global ETD Search

51	Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis / Prétopologie et modélisation de sujets pour l'analyse de systèmes complexes : application à la classification de documents et à l'analyse de réseaux complexes Bui, Quang Vu 27 September 2018 (has links) Les travaux de cette thèse présentent le développement d'algorithmes de classification de documents d'une part, ou d'analyse de réseaux complexes d'autre part, en s'appuyant sur la prétopologie, une théorie qui modélise le concept de proximité. Le premier travail développe un cadre pour la classification de documents en combinant une approche de topicmodeling et la prétopologie. Notre contribution propose d'utiliser des distributions de sujets extraites à partir d'un traitement topic-modeling comme entrées pour des méthodes de classification. Dans cette approche, nous avons étudié deux aspects : déterminer une distance adaptée entre documents en étudiant la pertinence des mesures probabilistes et des mesures vectorielles, et effet réaliser des regroupements selon plusieurs critères en utilisant une pseudo-distance définie à partir de la prétopologie. Le deuxième travail introduit un cadre général de modélisation des Réseaux Complexes en développant une reformulation de la prétopologie stochastique, il propose également un modèle prétopologique de cascade d'informations comme modèle général de diffusion. De plus, nous avons proposé un modèle agent, Textual-ABM, pour analyser des réseaux complexes dynamiques associés à des informations textuelles en utilisant un modèle auteur-sujet et nous avons introduit le Textual-Homo-IC, un modèle de cascade indépendant de la ressemblance, dans lequel l'homophilie est fondée sur du contenu textuel obtenu par un topic-model. / The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling. Prétopologie Topic Modeling Allocation de Dirichlet latente Clustering de documents Réseaux complexes Diffusion de l'information Pretopology Topic Modeling Latent Dirichlet Allocation Document Clustering Complex Networks Information diffusion
52	A Document Similarity Measure and Its Applications Gan, Zih-Dian 07 September 2011 (has links) In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others. k-means document similarity Similarity measure BEP F1 single-label multi-label Accuracy text classification Entropy document clustering k-NN ML-KNN
53	Fuzzy Cluster-Based Query Expansion Tai, Chia-Hung 29 July 2004 (has links) Advances in information and network technologies have fostered the creation and availability of a vast amount of online information, typically in the form of text documents. Information retrieval (IR) pertains to determining the relevance between a user query and documents in the target collection, then returning those documents that are likely to satisfy the user¡¦s information needs. One challenging issue in IR is word mismatch, which occurs when concepts can be described by different words in the user queries and/or documents. Query expansion is a promising approach for dealing with word mismatch in IR. In this thesis, we develop a fuzzy cluster-based query expansion technique to solve the word mismatch problem. Using existing expansion techniques (i.e., global analysis and non-fuzzy cluster-based query expansion) as performance benchmarks, our empirical results suggest that the fuzzy cluster-based query expansion technique can provide a more accurate query result than the benchmark techniques can. Fuzzy clustering Fuzzy cluster-based query expansion Term association Information retrieval Word mismatch Query expansion Document clustering Cluster-based query expansion Thesaurus Text mining
54	Incorporating semantic and syntactic information into document representation for document clustering Wang, Yong, January 2005 (has links) Thesis (Ph.D.) -- Mississippi State University. Department of Computer Science and Engineering. / Title from title screen. Includes bibliographical references.
55	Selecionando candidatos a descritores para agrupamentos hierárquicos de documentos utilizando regras de associação / Selecting candidate labels for hierarchical document clusters using association rules Fabiano Fernandes dos Santos 17 September 2010 (has links) Uma forma de extrair e organizar o conhecimento, que tem recebido muita atenção nos últimos anos, é por meio de uma representação estrutural dividida por tópicos hierarquicamente relacionados. Uma vez construída a estrutura hierárquica, é necessário encontrar descritores para cada um dos grupos obtidos pois a interpretação destes grupos é uma tarefa complexa para o usuário, já que normalmente os algoritmos não apresentam descrições conceituais simples. Os métodos encontrados na literatura consideram cada documento como uma bag-of-words e não exploram explicitamente o relacionamento existente entre os termos dos documento do grupo. No entanto, essas relações podem trazer informações importantes para a decisão dos termos que devem ser escolhidos como descritores dos nós, e poderiam ser representadas por regras de associação. Assim, o objetivo deste trabalho é avaliar a utilização de regras de associação para apoiar a identificação de descritores para agrupamentos hierárquicos. Para isto, foi proposto o método SeCLAR (Selecting Candidate Labels using Association Rules), que explora o uso de regras de associação para a seleção de descritores para agrupamentos hierárquicos de documentos. Este método gera regras de associação baseadas em transações construídas à partir de cada documento da coleção, e utiliza a informação de relacionamento existente entre os grupos do agrupamento hierárquico para selecionar candidatos a descritores. Os resultados da avaliação experimental indicam que é possível obter uma melhora significativa com relação a precisão e a cobertura dos métodos tradicionais / One way to organize knowledge, that has received much attention in recent years, is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters, since most algorithms do not produce simple descriptions and the interpretation of these clusters is a difficult task for users. The related works consider each document as a bag-of-words and do not explore explicitly the relationship between the terms of the documents. However, these relationships can provide important information to the decision of the terms that must be chosen as descriptors of the nodes, and could be represented by rass. This works aims to evaluate the use of association rules to support the identification of labels for hierarchical document clusters. Thus, this paper presents the SeCLAR (Selecting Candidate Labels using Association Rules) method, which explores the use of association rules for the selection of good candidates for labels of hierarchical clusters of documents. This method generates association rules based on transactions built from each document in the collection, and uses the information relationship between the nodes of hierarchical clustering to select candidates for labels. The experimental results show that it is possible to obtain a significant improvement with respect to precision and recall of traditional methods Agrupamento hierárquico de documantos Mineração de texto Regras de associação Association rules Hierarchical document clustering Label hierarchical clustering Text mining
56	Towards automated learning from software development issues : Analyzing open source project repositories using natural language processing and machine learning techniques Salov, Aleksandar January 2017 (has links) This thesis presents an in-depth investigation on the subject of how natural language processing and machine learning techniques can be utilized in order to perform a comprehensive analysis of programming issues found in different open source project repositories hosted on GitHub. The research is focused on examining issues gathered from a number of JavaScript repositories based on their user generated textual description. The primary goal of the study is to explore how natural language processing and machine learning methods can facilitate the process of identifying and categorizing distinct issue types. Furthermore, the research goes one step further and investigates how these same techniques can support users in searching for potential solutions to these issues. For this purpose, an initial proof-of-concept implementation is developed, which collects over 30 000 JavaScript issues from over 100 GitHub repositories. Then, the system extracts the titles of the issues, cleans and processes the data, before supplying it to an unsupervised clustering model which tries to uncover any discernible similarities and patterns within the examined dataset. What is more, the main system is supplemented by a dedicated web application prototype, which enables users to utilize the underlying machine learning model in order to find solutions to their programming related issues. Furthermore, the developed implementation is meticulously evaluated through a number of measures. First of all, the trained clustering model is assessed by two independent groups of external reviewers - one group of fellow researchers and another group of practitioners in the software industry, so as to determine whether the resulting categories contain distinct types of issues. Moreover, in order to find out if the system can facilitate the search for issue solutions, the web application prototype is tested in a series of user sessions with participants who are not only representative of the main target group which can benefit most from such a system, but who also have a mixture of both practical and theoretical backgrounds. The results of this research demonstrate that the proposed solution can effectively categorize issues according to their type, solely based on the user generated free-text title. This provides strong evidence that natural language processing and machine learning techniques can be utilized for analyzing issues and automating the overall learning process. However, the study was unable to conclusively determine whether these same methods can aid the search for issue solutions. Nevertheless, the thesis provides a detailed account of how this problem was addressed and can therefore serve as the basis for future research. machine learning natural language processing document clustering issue categorization issue classification issue analysis solution suggestions open source GitHub project repositories Media and Communication Technology Medieteknik
57	Near-Duplicate Detection Using Instance Level Constraints Patel, Vishal 08 1900 (has links) (PDF) For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms. Latent Dirichlet Allocation Information Retrieval Near-Duplicate Detection Constrained Clustering Group LDA Duplicate Bug Report Detection Near-Duplicate Document Detection Computer Science
58	Klasifikace elektronických dokumentů s využitím shlukové analýzy / Classification of electronic documents using cluster analysis Ševčík, Radim January 2009 (has links) The current age is characterised by unprecedented information growth, whether it is by amount or complexity. Most of it is available in digital form so we can analyze it using cluster analysis. We have tried to classify the documents from 20 Newsgroups collection in terms of their content only. The aim was to asses available clustering methods in a variety of applications. After the transformation into binary vector representation we performed several experiments and measured the values of entropy, purity and time of execution in application CLUTO. For a small number of clusters the best results offered the direct method (generally hierarchical method), but for more it was the repeated bisection (divisive). Agglomerative method proved not to be suitable. Using simulation we estimated the optimal number of clusters to be 10. For this solution we described in detail features of each cluster using repeated bisection method and i2 criterion function. In the future focus should be set on realisation of binary clustering with advantage of programming languages like Perl or C++. Results of this work might be of interest to web search engine developers and electronic catalogue administrators.
59	Semantic Topic Modeling and Trend Analysis Mann, Jasleen Kaur January 2021 (has links) This thesis focuses on finding an end-to-end unsupervised solution to solve a two-step problem of extracting semantically meaningful topics and trend analysis of these topics from a large temporal text corpus. To achieve this, the focus is on using the latest develop- ments in Natural Language Processing (NLP) related to pre-trained language models like Google’s Bidirectional Encoder Representations for Transformers (BERT) and other BERT based models. These transformer-based pre-trained language models provide word and sentence embeddings based on the context of the words. The results are then compared with traditional machine learning techniques for topic modeling. This is done to evalu- ate if the quality of topic models has improved and how dependent the techniques are on manually defined model hyperparameters and data preprocessing. These topic models provide a good mechanism for summarizing and organizing a large text corpus and give an overview of how the topics evolve with time. In the context of research publications or scientific journals, such analysis of the corpus can give an overview of research/scientific interest areas and how these interests have evolved over the years. The dataset used for this thesis is research articles and papers from a journal, namely ’Journal of Cleaner Productions’. This journal has more than 24000 research articles at the time of working on this project. We started with implementing Latent Dirichlet Allocation (LDA) topic modeling. In the next step, we implemented LDA along with document clus- tering to get topics within these clusters. This gave us an idea of the dataset and also gave us a benchmark. After having some base results, we explored transformer-based contextual word and sentence embeddings to evaluate if this leads to more meaningful, contextual, and semantic topics. For document clustering, we have used K-means clustering. In this thesis, we also discuss methods to optimally visualize the topics and the trend changes of these topics over the years. Finally, we conclude with a method for leveraging contextual embeddings using BERT and Sentence-BERT to solve this problem and achieve semantically meaningful topics. We also discuss the results from traditional machine learning techniques and their limitations. NLP unsupervised topic modelling trend analysis LDA BERT Sentence-BERT TF-IDF transformer based language models document clustering Computer Sciences Datavetenskap (datalogi)
60	Analýza a získávání informací ze souboru dokumentů spojených do jednoho celku / Analysis and Data Extraction from a Set of Documents Merged Together Jarolím, Jordán January 2018 (has links) This thesis deals with mining of relevant information from documents and automatic splitting of multiple documents merged together. Moreover, it describes the design and implementation of software for data mining from documents and for automatic splitting of multiple documents. Methods for acquiring textual data from scanned documents, named entity recognition, document clustering, their supportive algorithms and metrics for automatic splitting of documents are described in this thesis. Furthermore, an algorithm of implemented software is explained and tools and techniques used by this software are described. Lastly, the success rate of the implemented software is evaluated. In conclusion, possible extensions and further development of this thesis are discussed at the end.

Search results