1 |
Improving Document Clustering by Refining Overlapping Cluster RegionsUpadhye, Akshata Rajendra January 2022 (has links)
No description available.
|
2 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
3 |
Distributed Document Clustering and Cluster Summarization in Peer-to-Peer EnvironmentsHammouda, Khaled M. January 2007 (has links)
This thesis addresses difficult challenges in distributed document clustering and cluster summarization. Mining large document collections poses many challenges, one of which is the extraction of topics or summaries from documents for the purpose of interpretation of clustering results. Another important challenge, which is caused by new trends in distributed repositories and peer-to-peer computing, is that document data is becoming more distributed.
We introduce a solution for interpreting document clusters using keyphrase extraction from multiple documents simultaneously. We also introduce two solutions for the problem of distributed document clustering in peer-to-peer environments, each satisfying a different goal: maximizing local clustering quality through collaboration, and maximizing global clustering quality through cooperation.
The keyphrase extraction algorithm efficiently extracts and scores candidate keyphrases from a document cluster. The algorithm is called CorePhrase and is based on modeling document collections as a graph upon which we can leverage graph mining to extract frequent and significant phrases, which are used to label the clusters. Results show that CorePhrase can extract keyphrases relevant to documents in a cluster with very high accuracy. Although this algorithm can be used to summarize centralized clusters, it is specifically employed within distributed clustering to both boost distributed clustering accuracy, and to provide summaries for distributed clusters.
The first method for distributed document clustering is called collaborative peer-to-peer document clustering, which models nodes in a peer-to-peer network as collaborative nodes with the goal of improving the quality of individual local clustering solutions. This is achieved through the exchange of local cluster summaries between peers, followed by recommendation of documents to be merged into remote clusters. Results on large sets of distributed document collections show that: (i) such collaboration technique achieves significant improvement in the final clustering of individual nodes; (ii) networks with larger number of nodes generally achieve greater improvements in clustering after collaboration relative to the initial clustering before collaboration, while on the other hand they tend to achieve lower absolute clustering quality than networks with fewer number of nodes; and (iii) as more overlap of the data is introduced across the nodes, collaboration tends to have little effect on improving clustering quality.
The second method for distributed document clustering is called hierarchically-distributed document clustering. Unlike the collaborative model, this model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as a hierarchy of node neighborhoods. Summarization of the global distributed clusters is achieved through a distributed version of the CorePhrase algorithm. Results on large document sets show that: (i) distributed clustering accuracy is not affected by increasing the number of nodes for networks of single level; (ii) we can achieve decent speedup by making the hierarchy taller, but on the expense of clustering quality which degrades as we go up the hierarchy; (iii) in networks that grow arbitrarily, data gets more fragmented across neighborhoods causing poor centroid generation, thus suggesting we should not increase the number of nodes in the network beyond a certain level without increasing the data set size; and (iv) distributed cluster summarization can produce accurate summaries similar to those produced by centralized summarization.
The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. Achieving the same results using current methodologies require centralization of the data first, which is sometimes not feasible.
|
4 |
Document Clustering InterfaceJohnson, Samuel January 2014 (has links)
This project created a first step prototype interface for a document clustering search engine. The goal is to facilitate the needs of people with reading difficulties as well as being a useful tool for general users when trying to find relevant but easy to read documents. The hypothesis is that minimizing the amount of text and focus on graphical representation will make the service easier to use for all users. The interface was developed using previously established persona and evaluated by general users (i.e. not users with reading disabilities) in order to see if the interface was easy to use and to understand without tooltips and tutorials. The results showed that even though the participants understood the interface and found it intuitive, there was still some information they thought were missing, such as an explanation for the reading indexes and how they determined readability.
|
5 |
Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learningMoura, Maria Fernanda 26 October 2009 (has links)
A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems
|
6 |
Contribuições para a construção de taxonomias de tópicos em domínios restritos utilizando aprendizado estatístico / Contributions to topic taxonomy construction in a specific domain using statistical learningMaria Fernanda Moura 26 October 2009 (has links)
A mineração de textos vem de encontro à realidade atual de se compreender e utilizar grandes massas de dados textuais. Uma forma de auxiliar a compreensão dessas coleções de textos é construir taxonomias de tópicos a partir delas. As taxonomias de tópicos devem organizar esses documentos, preferencialmente em hierarquias, identificando os grupos obtidos por meio de descritores. Construir manual, automática ou semi-automaticamente taxonomias de tópicos de qualidade é uma tarefa nada trivial. Assim, o objetivo deste trabalho é construir taxonomias de tópicos em domínios de conhecimento restrito, por meio de mineração de textos, a fim de auxiliar o especialista no domínio a compreender e organizar os textos. O domínio de conhecimento é restrito para que se possa trabalhar apenas com métodos de aprendizado estatístico não supervisionado sobre representações bag of words dos textos. Essas representações independem do contexto das palavras nos textos e, conseqüentemente, nos domínios. Assim, ao se restringir o domínio espera-se diminuir erros de interpretação dos resultados. A metodologia proposta para a construção de taxonomias de tópicos é uma instanciação do processo de mineração de textos. A cada etapa do processo propôem-se soluções adaptadas às necessidades específicas de construçao de taxonomias de tópicos, dentre as quais algumas contribuições inovadoras ao estado da arte. Particularmente, este trabalho contribui em três frentes no estado da arte: seleção de atributos n-gramas em tarefas de mineração de textos, dois modelos para rotulação de agrupamento hierárquico de documentos e modelo de validação do processo de rotulação de agrupamento hierárquico de documentos. Além dessas contribuições, ocorrem outras em adaptações e metodologias de escolha de processos de seleção de atributos, forma de geração de atributos, visualização das taxonomias e redução das taxonomias obtidas. Finalmente, a metodologia desenvolvida foi aplicada a problemas reais, tendo obtido bons resultados. / Text mining provides powerful techniques to help on the current needs of understanding and organizing huge amounts of textual documents. One way to do this is to build topic taxonomies from these documents. Topic taxonomies can be used to organize the documents, preferably in hierarchies, and to identify groups of related documents and their descriptors. Constructing high quality topic taxonomies, either manually, automatically or semi-automatically, is not a trivial task. This work aims to use text mining techniques to build topic taxonomies for well defined knowledge domains, helping the domain expert to understand and organize document collections. By using well defined knowledge domains, only unsupervised statistical methods are used, with a bag of word representation for textual documents. These representations are independent of the context of the words in the documents as well as in the domain. Thus, if the domain is well defined, a decrease of mistakes of the result interpretation is expected. The proposed methodology for topic taxonomy construction is an instantiation of the text mining process. At each step of the process, some solutions are proposed and adapted to the specific needs of topic taxonomy construction. Among these solutions there are some innovative contributions to the state of the art. Particularly, this work contributes to the state of the art in three different ways: the selection of n-grams attributes in text mining tasks, two models for hierarchical document cluster labeling and a validation model of the hierarchical document cluster labeling. Additional contributions include adaptations and methodologies of attribute selection process choices, attribute representation, taxonomy visualization and obtained taxonomy reduction. Finally, the proposed methodology was also validated by successfully applying it to real problems
|
Page generated in 0.0914 seconds