Global ETD Search

21	Semi-supervised document clustering with active learning. / CUHK electronic theses & dissertations collection January 2008 (has links) Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective. / This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model. / Huang, Ruizhang. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3600. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 117-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Cluster analysis--Computer programs Document clustering Text processing (Computer science)
22	Geometric and topological approaches to semantic text retrieval. / CUHK electronic theses & dissertations collection January 2007 (has links) In the first part of this thesis, we present a new understanding of the latent semantic space of a dataset from the dual perspective, which relaxes the above assumed conditions and leads naturally to a unified kernel function for a class of vector space models. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also show that the new methods possess the stable property on the rank choice, i.e., even if the selected rank is quite far away from the optimal one, the retrieval performance will not degrade much. The experimental results of our methods on the standard test sets are promising. / In the second part of this thesis, we propose that the mathematical structure of simplexes can be attached to a term-document matrix in the vector-space model (VSM) for information retrieval. The Q-analysis devised by R. H. Atkin may then be applied to effect an analysis of the topological structure of the simplexes and their corresponding dataset. Experimental results of this analysis reveal that there is a correlation between the effectiveness of LSI and the topological structure of the dataset. By using the information obtained from the topological analysis, we develop a new query expansion method. Experimental results show that our method can enhance the performance of VSM for datasets over which LSI is not effective. Finally, the notion of homology is introduced to the topological analysis of datasets and its possible relation to word sense disambiguation is studied through a simple example. / With the vast amount of textual information available today, the task of designing effective and efficient retrieval methods becomes more important and complex. The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, it can not retrieve all relevant documents since it is based on literal term matching. The Generalized Vector Space Model (GVSM) and the Latent Semantic Indexing (LSI) are two famous semantic retrieval methods, in which some underlying latent semantic structures in the dataset are assumed. However, their assumptions about where the semantic structure locates are a bit strong. Moreover, the performance of LSI can be very different for various datasets and the questions of what characteristics of a dataset and why these characteristics contribute to this difference have not been fully understood. The present thesis focuses on providing answers to these two questions. / Li , Dandan. / "August 2007." / Adviser: Chung-Ping Kwong. / Source: Dissertation Abstracts International, Volume: 69-02, Section: B, page: 1108. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2007. / Includes bibliographical references (p. 118-120). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract in English and Chinese. / School code: 1307. Information retrieval Semantics--Data processing Text processing (Computer science)
23	New learning strategies for automatic text categorization. January 2001 (has links) Lai Kwok-yin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2001. / Includes bibliographical references (leaves 125-130). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Automatic Textual Document Categorization --- p.1 / Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3 / Chapter 1.3 --- Contributions --- p.6 / Chapter 1.4 --- Organization of the Thesis --- p.7 / Chapter 2 --- Related Work --- p.9 / Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9 / Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14 / Chapter 2.3 --- Our Meta-Learning Approaches --- p.20 / Chapter 3 --- Document Pre-Processing --- p.22 / Chapter 3.1 --- Document Representation --- p.22 / Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25 / Chapter 4 --- Linear Combination Approach --- p.30 / Chapter 4.1 --- Overview --- p.30 / Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33 / Chapter 4.2.1 --- Equal Weighting Strategy --- p.34 / Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34 / Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35 / Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36 / Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36 / Chapter 4.3.2 --- LC versus BORG --- p.38 / Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38 / Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40 / Chapter 5.1 --- Overview --- p.41 / Chapter 5.2 --- Document Feature Characteristics --- p.42 / Chapter 5.3 --- Classification Errors --- p.44 / Chapter 5.4 --- Linear Regression Model --- p.45 / Chapter 5.5 --- The MUDOF Algorithm --- p.47 / Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52 / Chapter 6.1 --- Background --- p.52 / Chapter 6.2 --- Overview of MUDOF2 --- p.54 / Chapter 6.3 --- Major Components of the MUDOF2 --- p.57 / Chapter 6.4 --- The MUDOF2 Algorithm --- p.59 / Chapter 7 --- Experimental Setup --- p.66 / Chapter 7.1 --- Document Collection --- p.66 / Chapter 7.2 --- Evaluation Metric --- p.68 / Chapter 7.3 --- Component Classification Algorithms --- p.71 / Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72 / Chapter 8 --- Experimental Results and Analysis --- p.74 / Chapter 8.1 --- Performance of Linear Combination Approach --- p.74 / Chapter 8.2 --- Performance of the MUDOF Approach --- p.78 / Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87 / Chapter 9 --- Conclusions and Future Work --- p.96 / Chapter 9.1 --- Conclusions --- p.96 / Chapter 9.2 --- Future Work --- p.98 / Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99 / Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114 / Bibliography --- p.125 Text processing (Computer science) Computer algorithms
24	Language and representation : the recontextualisation of participants, activities and reactions Van Leeuwen, Theo January 1993 (has links) Doctor of Philosophy / This thesis proposes a model for the description of social practice which analyses social practices into the following elements: (1) the participants of the practice; (2) the activities which constitute the practice; (3) the performance indicators which stipulate how the activities are to be performed; (4) the dress and body grooming for the participants; (5) the times when, and (6)the locations where the activities take place; (7) the objects, tools and materials, required for performing the activities; and (8) the eligibility conditions for the participants and their dress, the objects, and the locations, that is, the characteristics these elements must have to be eligible to participate in, or be used in, the social practice. Applied linguistics. Language and culture. Sociolinguistics. Text processing (Computer science)
25	Contextual Advertising Online Pettersson, Jimmie January 2008 (has links) <p>The internet advertising market is growing much faster than any other advertising vertical. The technology for serving advertising online goes more and more towards automated processes that analyze the page content and the user’s preferences and then matches the ads with these parameters.</p><p>The task at hand was to research and find methods that could be suitable for matching web documents to ads automatically, build a prototype system, make an evaluation and suggest areas for further development. The goals of the system was high throughput, accurate ad matching and fast response times. A requirement on the system was that human input could only be done when adding ads into the system for the system to be scalable.</p><p>The prototype system is based on the vector space model and a td-idf weighting scheme. The cosines coefficient was used in the system to quantify the similarity between a web document and an ad.</p><p>A technique called stemming was also implemented in the system together with a clustering solution that aided the ad matching in cases where few matches could be done on the keywords attached to the ads. The system was built with a threaded structure to improve throughput and scalability.</p><p>The tests results show that you accurately can match ads to a website’s content using the vector space model and the cosines-coefficient. The tests also show that the stemming has a positive effect on the ad matching accuracy.</p> Advertising contextual online text processing Information technology Informationsteknik
26	A method for finding common attributes in hetrogenous DoD databases / Zobair, Hamza A. January 2004 (has links) (PDF) Thesis (M.S. in Software Engineering)--Naval Postgraduate School, June 2004. / Thesis advisor(s): Valdis Berzins. Includes bibliographical references (p. 179). Also available online.
27	Unsupervised partial parsing Ponvert, Elias Franchot 25 October 2011 (has links) The subject matter of this thesis is the problem of learning to discover grammatical structure from raw text alone, without access to explicit instruction or annotation -- in particular, by a computer or computational process -- in other words, unsupervised parser induction, or simply, unsupervised parsing. This work presents a method for raw text unsupervised parsing that is simple, but nevertheless achieves state-of-the-art results on treebank-based direct evaluation. The approach to unsupervised parsing presented in this dissertation adopts a different way to constrain learned models than has been deployed in previous work. Specifically, I focus on a sub-task of full unsupervised partial parsing called unsupervised partial parsing. In essence, the strategy is to learn to segment a string of tokens into a set of non-overlapping constituents or chunks which may be one or more tokens in length. This strategy has a number of advantages: it is fast and scalable, based on well-understood and extensible natural language processing techniques, and it produces predictions about human language structure which are useful for human language technologies. The models developed for unsupervised partial parsing recover base noun phrases and local constituent structure with high accuracy compared to strong baselines. Finally, these models may be applied in a cascaded fashion for the prediction of full constituent trees: first segmenting a string of tokens into local phrases, then re-segmenting to predict higher-level constituent structure. This simple strategy leads to an unsupervised parsing model which produces state-of-the-art results for constituent parsing of English, German and Chinese. This thesis presents, evaluates and explores these models and strategies. / text Computational linguistics Natural language processing Unsupervised Parsing Chunking Text processing
28	Latent semantic sentence clustering for multi-document summarization Geiss, Johanna January 2011 (has links) No description available. 004
29	The design considerations for display oriented proportional text editors using bit-mapped graphics display systems / Ganguli, Nitu. January 1987 (has links) No description available. Text editors (Computer programs) Text processing (Computer science)
30	Learning to Read Bushman: Automatic Handwriting Recognition for Bushman Languages Williams, Kyle 01 January 2012 (has links) The Bleek and Lloyd Collection contains notebooks that document the tradition, language and culture of the Bushman people who lived in South Africa in the late 19th century. Transcriptions of these notebooks would allow for the provision of services such as text-based search and text-to-speech. However, these notebooks are currently only available in the form of digital scans and the manual creation of transcriptions is a costly and time-consuming process. Thus, automatic methods could serve as an alternative approach to creating transcriptions of the text in the notebooks. In order to evaluate the use of automatic methods, a corpus of Bushman texts and their associated transcriptions was created. The creation of this corpus involved: the development of a custom method for encoding the Bushman script, which contains complex diacritics; the creation of a tool for creating and transcribing the texts in the notebooks; and the running of a series of workshops in which the tool was used to create the corpus. The corpus was used to evaluate the use of various techniques for automatically transcribing the texts in the corpus in order to determine which approaches were best suited to the complex Bushman script. These techniques included the use of Support Vector Machines, Artificial Neural Networks and Hidden Markov Models as machine learning algorithms, which were coupled with different descriptive features. The effect of the texts used for training the machine learning algorithms was also investigated as well as the use of a statistical language model. It was found that, for Bushman word recognition, the use of a Support Vector Machine with Histograms of Oriented Gradient features resulted in the best performance and, for Bushman text line recognition, Marti & Bunke features resulted in the best performance when used with Hidden Markov Models. The automatic transcription of the Bushman texts proved to be difficult and the performance of the different recognition systems was largely affected by the complexities of the Bushman script. It was also found that, besides having an influence on determining which techniques may be the most appropriate for automatic handwriting recognition, the texts used in a automatic handwriting recognition system also play a large role in determining whether or not automatic recognition should be attempted at all. I.7 DOCUMENT AND TEXT PROCESSING H.3 INFORMATION STORAGE AND RETRIEVAL

Search results