FP-growth approach for document clustering

Akbar, Monika. January 2008 (has links) (PDF)
Thesis (MS)--Montana State University--Bozeman, 2008. / Typescript. Chairperson, Graduate Committee: Rafal A. Angryk. Includes bibliographical references (leaves 58-61).

Feature Translation-based Multilingual Document Clustering Technique

Liao, Shan-Yu 08 August 2006 (has links)
Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a translation-based MLDC technique. Our empirical evaluation results show that the proposed multilingual document clustering technique achieves satisfactory clustering effectiveness measured by both cluster recall and cluster precision.

Le document de voyage : traces et cheminements hybrides comme médiateurs de savoirs / The travel document : traces and hybrid pathways as mediators of knowledge

Roux, Sabine 28 September 2012 (has links)
Étude, à partir d’une sélection de onze documents de voyage du XVIe siècle au XXe siècle (Léry. L’Histoire d’un voyage faict en la terre du Brésil (1578). Bougainville. Voyage autour du monde. Lapérouse. Voyage autour du monde sur l’Astrolabe et la Boussole (1785-1788). Baudin. Journal du voyage aux Antilles de La Belle Angélique (1796-1798). Darwin. Voyage d’un naturaliste autour du monde : fait à bord du navire le Beagle de 1831 à 1836. Arseniev. Aux confins de l’Amour et Dersou Ouzala. Charcot. Le Français au Pôle Sud. Lévi-Strauss. Tristes Tropiques. Leiris. L’Afrique fantôme Malaurie. Hummocks . et Les derniers rois de Thulé. Bonnerave. Carnets de terrain et Nouveaux Indiens) du document de voyage comme objet textuel qui associe science et littérature, fiction et documentaire pour produire des formes complexes de connaissances Des cheminements hybrides semblent permettre à des formes de connaissances de circuler à partir d’un document de voyage qui tente de rendre compte d’une expérience. Le document de voyage peut alors être envisagé comme un document matériau qui contient en puissance la capacité de générer d’autres documents. Comme un rhizome dont n’importe quel point peut être connecté avec n’importe quel autre, ce premier document (carnet d’ethnologue, ou journal de bord par exemple) qui contient des informations scientifiques entre en connexion avec d’autres documents hétérogènes (édition d’un journal de voyage destiné au public, article scientifique rédigé à partir du carnet ou du compte-rendu d’expédition, roman rédigé à partir de ce premier document, théorie scientifique, performance…) pour faire circuler des connaissances. / Study, from a selection of eleven travel documents of the sixteenth century to the twentieth century (Léry. L’Histoire d’un voyage faict en la terre du Brésil (1578). Bougainville. Voyage autour du monde. Lapérouse Voyage autour du monde sur l’Astrolabe et la Boussole (1785-1788). Baudin. Journal du voyage aux Antilles de La Belle Angélique (1796-1798). Darwin. Voyage d’un naturaliste autour du monde : fait à bord du navire le Beagle de 1831 à 1836. Arseniev. Aux confines de l’Amour et Dersou Ouzala. Charcot. Le Français au pôle Sud. Levi-Strauss. Tristes Tropiques. Leiris. L’Afrique fantôme. Malaurie. Hummocks and Les derniers rois de Thulé.. Bonnerave. Field Notes and Nouveaux Indiens ) document travel as text object that combines science and literature, fiction and documentary to produce complex forms of knowledge The paths seem to allow hybrid forms of knowledge to flow from a travel document which attempts to account for experience. The travel document can then be regarded as a document which contains material in power the ability to generate other documents. As a rhizome which any point can be connected with any other, the first document (book ethnologist, or logbook, for example) that contains scientific information comes into connection with other heterogeneous documents (edition of a travel journal for the public, scientific article written from the book or the minutes of dispatch, a novel written from the first document, scientific theory, performance ...) to circulate knowledge.

Recognition and representation of user interest

Badi, Rajiv Ravindranath 25 April 2007 (has links)
With the growth of the internet and other media of communication, locating information on the topic of interest is less a problem of finding related documents than determining which particular documents are valuable. Often, the desired information is obscured within a long list of resources. Users become inundated with so much information that the task of sifting through it takes the majority of time on a given information task. Users look at multiple documents at once to find answers to their questions, and switch between documents to get the “complete” picture. New systems are needed that help users cull through related documents to gain the information they need. As a part of the Document Triage Project, we have been looking at ways to help users in sifting through information. The Document Triage Project is developing tools to recognize, represent, communicate, and visualize user interest across applications. The topic of this thesis is recognizing user interest and providing an infrastructure to represent that interest so that it can be shared across the software applications involved in triage. Based on this inferred interest, applications can help users in their triage task by providing visualizations or other functionality. The applications could involve one or many reading interfaces (e.g., a browser, or an editor), an information organizing system (e.g., Visual Knowledge Builder) and search interfaces (the application providing the document collection; e.g., a search engine). To recognize user interest, data is gathered from the user’s reading, navigational and interpretive activities. Algorithms based on statistical models and qualitative analyses of user behavior in triage are used to infer interest. A light-weight infrastructure called Interest Profile Manager has been developed for the representation of interest values and the corresponding metadata. Interest Profile Manager also provides text processing capability, interest analysis functionality, sharing of data across applications and event propagation.

Measuring interestingness of documents using variability

KONDI CHANDRASEKARAN, PRADEEP KUMAR 01 February 2012 (has links)
The amount of data we are dealing with is being generated at an astronomical pace. With the rapid technological advances in the field of data storage techniques, storing and transmitting copious amounts of data has become very easy and hassle-free. However, exploring those abundant data and finding the interesting ones has always been a huge integral challenge and cumbersome process to people in all industrial sectors. A model to rank data by interest will help in saving the time spent on the large amount of data. In this research we concentrate specifically on ranking the text documents in corpora according to ``interestingness'' We design a state-of-the-art empirical model to rank documents according to ``interestingness''. The model is cost-efficient, fast and automated to an extent which requires minimal human intervention. We identify different categories of documents based on the word-usage pattern which in turn classifies them as being interesting, mundane or anomalous documents. The model is a novel approach which does not depend on the semantics of the words used in the document but is based on the repetition of words and rate of introduction of new words in the document. The model is a generic design which can be applied to a document corpus of any size from any domain. The model can be used to rank new documents introduced into the corpus. We formulate a couple of normalization techniques which can be used to neutralize the impact of variable document length. We use three approaches, namely dictionary-based data compression, analysis of the rate of new word occurrences and Singular Value Decomposition (SVD). To test the model we use a variety of corpora namely: US Diplomatic Cable releases by Wikileaks, US Presidents State of Union Addresses, Open American National Corpus and 20 Newsgroups articles. The techniques have various pre-processing steps which are totally automated. We compare the results of the three techniques and examine the level of agreement between pair of techniques using a statistical method called the Jaccard coefficient. This approach can also be used to detect the unusual and anomalous documents within the corpus. The results also contradict the assumptions made by Simon and Yule in deriving an equation for a general text generation model. / Thesis (Master, Computing) -- Queen's University, 2012-01-31 15:28:04.177

An historical commentary to the Thirteenth Sibylline Oracle

Potter, D. S. January 1984 (has links)
No description available.

The use of graph theory in modelling thematic structure in the content of documents

Farbey, B. A. January 1984 (has links)
No description available.

A software toolkit for handprinted form readers

Cracknell, Christopher Robert William January 1999 (has links)
No description available.

An associative text FILTER for micro-computer based document retrieval

Barros, Silvano Piedade Venacio January 1983 (has links)
No description available.

On Efficient processing of XML data and their applications

Shui, William Miao, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
The development of high-throughput genome sequencing and protein structure determination techniques have provided researchers with a wealth ofbiological data. However, providing an integrated analysis can be difficult due to the incompatibilities of data formats between providers and applications, the strict schema constraints imposed by data providers, and the lack ofinfrastructure for easily accommodating new semantic information. To address these issues, this thesis first proposes to use Extensible Markup Language (XML) [26] and its supporting query languages as the underlying technology to facilitate a seamless, integrated access to the sum of heterogeneous biological data and services. XML is used due to its semi-structured nature and its ability to easily encapsulate both contextual and semantic information. The tree representation of an XML document enables applications to easily traverse and access data within the document without prior knowledge of its schema. However, in the process ofconstructing the framework, we have identified a number of issues that are related to the performance ofXML technologies. More specifically, on the performance ofthe XML query processor, the data store and the transformation processor. Hence, this thesis also focuses on finding new solutions to address these issues. For the XML query processor, we proposes an efficient structural join algorithm that can be implemented on top of existing relational databases. Experiments show the proposed method outperforms previous work in both queries and updates. For complicated XML query patterns, a new twig join algorithm called CTwigStack is proposed in this thesis. In essence, the new approach only produces and merges partial solution nodes that satisfy the entire twig query pattern tree. Experiments show the proposed algorithm outperforms previous methods in most cases. For more general cases, a propose a mixed mode twig join is proposed, which combines CTwigStack with the existing twig join algorithms and the extensive experimental results have shown the superior effectiveness of both CTwigStack and the mixed mode twig join. By combining with existing system information, the mixed mode twig join can be served as a framework for plan selection during the process of XML query optimization. For the XML transfonnation component, a novel stand-alone, memory conscious XSLT processor is proposed in this thesis, such that the proposed XSLT processor only requires a single pass of the input XML dataset. Consequently, enabling fast transfonnation of streaming XML data and better handling of complicated XPath selection patterns, including aggregate predicate functions such as the XPath count function. Ultimately, based on the nature of the proposed framework, we believe that solving the perfonnance issues related to the underlying XML components can subsequently lead to a more robust framework for integrating heterogeneous biological data sources and services.

