Global ETD Search

1	Probabilistic model-based clustering of complex data Zhong, Shi 28 August 2008 (has links) Not available / text Cluster analysis--Computer programs
2	Semi-supervised document clustering with active learning. / CUHK electronic theses & dissertations collection January 2008 (has links) Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective. / This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model. / Huang, Ruizhang. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3600. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 117-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Cluster analysis--Computer programs Document clustering Text processing (Computer science)
3	Automatic text summarization using lexical chains : algorithms and experiments Kolla, Maheedhar, University of Lethbridge. Faculty of Arts and Science January 2004 (has links) Summarization is a complex task that requires understanding of the document content to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chaings. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence. Headlines generated can be used in real world application to skim through the document collections in a digital library. Multi-document summarization is gaining demand with the explosive growth of online news sources. It requires identification of the several themes present in the collection to attain good compression and avoid redundancy. In this thesis, we propose methods to group the portions of the texts of a document collection into meaningful clusters. clustering enable us to extract the various themes of the document collection. Sentences from clusters can then be extracted to generate a summary for the multi-document collection. Clusters can also be used to generate summaries with respect to a given query. We designed a system to compute lexical chains for the given text and use them to extract the salient portions of the document. Some specific tasks considered are: headline generation, multi-document summarization, and query-based summarization. Our experimental evaluation shows that efficient summaries can be extracted for the above tasks. / viii, 80 leaves : ill. ; 29 cm. Dissertations, Academic Automatic abstracting Cluster analysis -- Computer programs Computational linguistics
4	Robust clustering algorithms Gupta, Pramod 05 April 2011 (has links) One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across any different fields ranging from computational biology to social sciences to computer vision in part because they are simple and their output is easy to interpret. However, many of these algorithms lack any performance guarantees when the data is noisy, incomplete or has outliers, which is the case for most real world data. It is well known that standard linkage algorithms perform extremely poorly in presence of noise. In this work we propose two new robust algorithms for bottom-up agglomerative clustering and give formal theoretical guarantees for their robustness. We show that our algorithms can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also extend our algorithms to an inductive setting with similar guarantees, in which we randomly choose a small subset of points from a much larger instance space and generate a hierarchy over this sample and then insert the rest of the points to it to generate a hierarchy over the entire instance space. We then do a systematic experimental analysis of various linkage algorithms and compare their performance on a variety of real world data sets and show that our algorithms do much better at handling various forms of noise as compared to other hierarchical algorithms in the presence of noise. Robust algorithms Hierarchical clustering Unsupervised learning Clustering Machine learning Cluster analysis Cluster analysis Computer programs Algorithms
5	Automatic text summarization in digital libraries Mlynarski, Angela, University of Lethbridge. Faculty of Arts and Science January 2006 (has links) A digital library is a collection of services and information objects for storing, accessing, and retrieving digital objects. Automatic text summarization presents salient information in a condensed form suitable for user needs. This thesis amalgamates digital libraries and automatic text summarization by extending the Greenstone Digital Library software suite to include the University of Lethbridge Summarizer. The tool generates summaries, nouns, and non phrases for use as metadata for searching and browsing digital collections. Digital collections of newspapers, PDFs, and eBooks were created with summary metadata. PDF documents were processed the fastest at 1.8 MB/hr, followed by the newspapers at 1.3 MB/hr, with eBooks being the slowest at 0.9 MV/hr. Qualitative analysis on four genres: newspaper, M.Sc. thesis, novel, and poetry, revealed narrative newspapers were most suitable for automatically generated summarization. The other genres suffered from incoherence and information loss. Overall, summaries for digital collections are suitable when used with newspaper documents and unsuitable for other genres. / xiii, 142 leaves ; 28 cm. Dissertations, Academic Digital libraries Automatic abstracting Cluster analysis -- Computer programs Computational linguistics
6	Building a high-resolution scalable visualization wall Li, Zhenni, Carlisle, W. Homer. January 2006 (has links) (PDF) Thesis(M.S.)--Auburn University, 2006. / Abstract. Vita. Includes bibliographic references (p.56-59).
7	A Robust Data Obfuscation Technique for Privacy Preserving Collaborative Filtering Parameswaran, Rupa 10 May 2006 (has links) Privacy is defined as the freedom from unauthorized intrusion. The availability of personal information through online databases, such as government records, medical records, and voters and #146; lists, pose a threat to personal privacy. The concern over individual privacy has led to the development of legal codes for safeguarding privacy in several countries. However, the ignorance of individuals as well as loopholes in the systems, have led to information breaches even in the presence of such rules and regulations. Protection against data privacy requires modification of the data itself. The term {em data obfuscation} is used to refer to the class of algorithms that modify the values of the data items without distorting the usefulness of the data. The main goal of this thesis is the development of a data obfuscation technique that provides robust privacy protection with minimal loss in usability of the data. Although medical and financial services are two of the major areas where information privacy is a concern, privacy breaches are not restricted to these domains. One of the areas where the concern over data privacy is of growing interest is collaborative filtering. Collaborative filtering systems are being widely used in E-commerce applications to provide recommendations to users regarding products that might be of interest to them. The prediction accuracy of these systems is dependent on the size and accuracy of the data provided by users. However, the lack of sufficient guidelines governing the use and distribution of user data raises concerns over individual privacy. Users often provide the minimal information that is required for accessing these E-commerce services. The lack of rules governing the use and distribution of data disallows sharing of data among different communities for collaborative filtering. The goals of this thesis are (a) the definition of a standard for classifying DO techniques, (b) the development of a robust cluster preserving data obfuscation algorithm, and (c) the design and implementation of a privacy-preserving shared collaborative filtering framework using the data obfuscation algorithm. Data privacy Data obfuscation Data mining Collaborative filtering Database security Data mining Cluster analysis Computer programs
8	A novel framework for binning environmental genomic fragments Yang, Bin, 杨彬 January 2010 (has links) published_or_final_version / Computer Science / Master / Master of Philosophy Genomics - Data processing. Genomes - Data processing. Microbial ecology - Data processing. Cluster analysis - Data processing. Cluster analysis - Computer programs.
9	Approximation algorithms for a graph-cut problem with applications to a clustering problem in bioinformatics Choudhury, Salimur Rashid, University of Lethbridge. Faculty of Arts and Science January 2008 (has links) Clusters in protein interaction networks can potentially help identify functional relationships among proteins. We study the clustering problem by modeling it as graph cut problems. Given an edge weighted graph, the goal is to partition the graph into a prescribed number of subsets obeying some capacity constraints, so as to maximize the total weight of the edges that are within a subset. Identification of a dense subset might shed some light on the biological function of all the proteins in the subset. We study integer programming formulations and exhibit large integrality gaps for various formulations. This is indicative of the difficulty in obtaining constant factor approximation algorithms using the primal-dual schema. We propose three approximation algorithms for the problem. We evaluate the algorithms on the database of interacting proteins and on randomly generated graphs. Our experiments show that the algorithms are fast and have good performance ratio in practice. / xiii, 71 leaves : ill. ; 29 cm. Cluster analysis -- Computer programs Proteins -- Research -- Data processing Graph algorithms Bioinformatics Dissertations, Academic Electronic dissertations

Search results