Global ETD Search

1	Assisting Reading and Analysis of Text Documents by Visualization rmatycorp@iinet.net.au, Ross J. Maloney January 2005 (has links) The research reported here examined the use of computer generated graphics as a means to assist humans to analyse text documents which have not been subject to markup. The approach taken was to survey available visualization techniques in a broad selection of disciplines including applications to text documents, group those techniques using a taxonomy proposed in this research, then develop a selection of techniques that assist the text analysis objective. Development of the selected techniques from their fundamental basis, through their visualization, to their demonstration in application, comprises most of the body of this research. A scientific orientation employing measurements, combined with visual depiction and explanation of the technique with limited mathematics, is used as opposed to fully utilising any one of those resulting techniques for performing complete text document analysis. Visualization techniques which apply directly to the text and those which exploit measurements produced by associated techniques are considered. Both approaches employ visualization to assist the human viewer to discover patterns which are then used in the analysis of the document. In the measurement case, this requires consideration of data with dimensions greater than three, which imposes a visualization difficulty. Several techniques for overcoming this problem are proposed. Word frequencies, Zipf considerations, parallel coordinates, colour maps, Cusum plots, and fractal dimensions are some of the techniques considered. One direct application of visualization to text documents is to assist reading of that document by de-emphasising selected words by fading them on the display from which they are read. Three word selection techniques are proposed for the automatic selection of which words to use. An experiment is reported which used such word fading techniques. It indicated that some readers do have improved reading speed under such conditions, but others do not. The experimental design enabled the separation of that group which did decrease reading times from the remaining readers who did not. Measurement of comprehension errors made under different types of word fading were shown not to increase beyond that obtained under normal reading conditions. A visualization based on categorising the words in a text document is proposed which contrasts to visualization of measurements based on counts. The result is a visual impression of the word composition, and the evolution of that composition within that document. The text documents used to demonstrates these techniques include English novels and short stories, emails, and a series of eighteenth century newspaper articles known as the Federalist Papers. This range of documents was needed because all analysis techniques are not applicable to all types of documents. This research proposes that an interactive use of the techniques on hand in a non-prescribed order can yield useful results in a document analysis. An example of this is in author attribution, i.e. assigning authorship of documents via patterns characteristic of an individuals writing style. Different visual techniques can be used to explore the patterns of writing in given text documents. Asoftware toolkit as a platform for implementing the proposed interactive analysis of text documents is described. How the techniques could be integrated into such a toolkit is outlined. A prototype of software to implement such a toolkit is included in this research. Issues relating to implementation of each technique used are also outlined. ii visualization analysis text documents reading
2	Integritet och långsiktig användbarhet hos textdokument : En avvägningsproblematik vid digitalt bevarande / Integrity and long-term Usability in Text Documents : Trade-offs in the Context of Digital Preservation Pettersson, Karl January 2015 (has links) This thesis is about a potential trade-off between integrity and long-term usability in the choice of file formats for preservation of text documents. Five common formats are discussed: plain text, PDF/A, Office Open XML Document, Open Document Text, and Markdown. The formats are compared with respect to four criteria related to integrity and usability and to the records continuum model: support by widely used software, stability, rendering of contents and reusability. It is concluded that no single format is optimal with respect to all four criteria, when it comes to preserving typical documents in a modern environment, with more or less complex formatting and document structure. Therefore, the feasiblity of using two or more formats for preservation of a single document (e.g. PDF/A combined with Markdown and/or Office Open XML) is discussed. It is necessary to weigh the importance of integrity and long-term usability against the costs of preserving documents in multiple formats. This is a two years master's thesis in Archival Science, Library and Museum studies. Integrity Usability Text documents Markup languages Records continuum integritet användbarhet textdokument märkspråk records continuum
3	Shlukování textových dat / Text Data Clustering Leixner, Petr January 2010 (has links) Process of text data clustering can be used to analysis, navigation and structure large sets of texts or hypertext documents. The basic idea is to group the documents into a set of clusters on the basis of their similarity. The well-known methods of text clustering, however, do not really solve the specific problems of text clustering like high dimensionality of the input data, very large size of the databases and understandability of the cluster description. This work deals with mentioned problems and describes the modern method of text data clustering based on the use of frequent term sets, which tries to solve deficiencies of other clustering methods.
4	Metody extrakce informace z textových dokumentů / Methods for Information Extraction in Text Documents Sychra, Tomáš January 2008 (has links) Knowledge discovery in text documents is part of data mining. However, text documents have different properties in comparison to regular databases. This project contains an overview of methods for knowledge discovery in text documents. The most frequently used task in this area is document classification. Various approaches for text classification will be described. Finally, I will present algorithm Winnow that should perform better than any other algorithm for classification. There is a description of Winnow implementation and an overview of experimental results.
5	Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on Categorization Šabatka, Ondřej January 2010 (has links) The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.

1

Page generated in 0.4922 seconds