Spelling suggestions: "subject:"text documents"" "subject:"next documents""
1 |
Assisting Reading and Analysis of Text Documents by Visualizationrmatycorp@iinet.net.au, Ross J. Maloney January 2005 (has links)
The research reported here examined the use of computer generated graphics as a
means to assist humans to analyse text documents which have not been subject to
markup. The approach taken was to survey available visualization techniques in a
broad selection of disciplines including applications to text documents, group those
techniques using a taxonomy proposed in this research, then develop a selection of
techniques that assist the text analysis objective. Development of the selected techniques
from their fundamental basis, through their visualization, to their demonstration
in application, comprises most of the body of this research. A scientific orientation
employing measurements, combined with visual depiction and explanation of
the technique with limited mathematics, is used as opposed to fully utilising any one
of those resulting techniques for performing complete text document analysis.
Visualization techniques which apply directly to the text and those which exploit
measurements produced by associated techniques are considered. Both approaches
employ visualization to assist the human viewer to discover patterns which are then
used in the analysis of the document. In the measurement case, this requires consideration
of data with dimensions greater than three, which imposes a visualization
difficulty. Several techniques for overcoming this problem are proposed. Word frequencies,
Zipf considerations, parallel coordinates, colour maps, Cusum plots, and
fractal dimensions are some of the techniques considered.
One direct application of visualization to text documents is to assist reading of
that document by de-emphasising selected words by fading them on the display from
which they are read. Three word selection techniques are proposed for the automatic
selection of which words to use.
An experiment is reported which used such word fading techniques. It indicated
that some readers do have improved reading speed under such conditions, but others
do not. The experimental design enabled the separation of that group which did
decrease reading times from the remaining readers who did not. Measurement of
comprehension errors made under different types of word fading were shown not to
increase beyond that obtained under normal reading conditions.
A visualization based on categorising the words in a text document is proposed
which contrasts to visualization of measurements based on counts. The result is a
visual impression of the word composition, and the evolution of that composition
within that document.
The text documents used to demonstrates these techniques include English novels
and short stories, emails, and a series of eighteenth century newspaper articles known
as the Federalist Papers. This range of documents was needed because all analysis techniques
are not applicable to all types of documents. This research proposes that an
interactive use of the techniques on hand in a non-prescribed order can yield useful
results in a document analysis. An example of this is in author attribution, i.e. assigning
authorship of documents via patterns characteristic of an individuals writing
style. Different visual techniques can be used to explore the patterns of writing in
given text documents.
Asoftware toolkit as a platform for implementing the proposed interactive analysis
of text documents is described. How the techniques could be integrated into such a
toolkit is outlined. A prototype of software to implement such a toolkit is included
in this research. Issues relating to implementation of each technique used are also
outlined.
ii
|
2 |
Integritet och långsiktig användbarhet hos textdokument : En avvägningsproblematik vid digitalt bevarande / Integrity and long-term Usability in Text Documents : Trade-offs in the Context of Digital PreservationPettersson, Karl January 2015 (has links)
This thesis is about a potential trade-off between integrity and long-term usability in the choice of file formats for preservation of text documents. Five common formats are discussed: plain text, PDF/A, Office Open XML Document, Open Document Text, and Markdown. The formats are compared with respect to four criteria related to integrity and usability and to the records continuum model: support by widely used software, stability, rendering of contents and reusability. It is concluded that no single format is optimal with respect to all four criteria, when it comes to preserving typical documents in a modern environment, with more or less complex formatting and document structure. Therefore, the feasiblity of using two or more formats for preservation of a single document (e.g. PDF/A combined with Markdown and/or Office Open XML) is discussed. It is necessary to weigh the importance of integrity and long-term usability against the costs of preserving documents in multiple formats. This is a two years master's thesis in Archival Science, Library and Museum studies.
|
3 |
Shlukování textových dat / Text Data ClusteringLeixner, Petr January 2010 (has links)
Process of text data clustering can be used to analysis, navigation and structure large sets of texts or hypertext documents. The basic idea is to group the documents into a set of clusters on the basis of their similarity. The well-known methods of text clustering, however, do not really solve the specific problems of text clustering like high dimensionality of the input data, very large size of the databases and understandability of the cluster description. This work deals with mentioned problems and describes the modern method of text data clustering based on the use of frequent term sets, which tries to solve deficiencies of other clustering methods.
|
4 |
Metody extrakce informace z textových dokumentů / Methods for Information Extraction in Text DocumentsSychra, Tomáš January 2008 (has links)
Knowledge discovery in text documents is part of data mining. However, text documents have different properties in comparison to regular databases. This project contains an overview of methods for knowledge discovery in text documents. The most frequently used task in this area is document classification. Various approaches for text classification will be described. Finally, I will present algorithm Winnow that should perform better than any other algorithm for classification. There is a description of Winnow implementation and an overview of experimental results.
|
5 |
Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on CategorizationŠabatka, Ondřej January 2010 (has links)
The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.
|
6 |
Разработка системы валидации текстовых документов : магистерская диссертация / Development of a Validation System for Text DocumentsМайнгерт, В. А., Mayngert, V. A. January 2024 (has links)
Работа посвящена разработке системы валидации текстовых документов для оптимизации проверки на соответствие стандартам коммерческих организаций. / The work is dedicated to the development of a validation system for text documents to optimize compliance checking with the standards of commercial organizations.
|
Page generated in 0.0524 seconds