Global ETD Search

11	Development of Personalized Document Clustering Technique for Accommodating Hierarchical Categorization Preferences Lee, Kuan-yi 27 July 2006 (has links) With the advances in information and networking technologies and the proliferation of e-commerce and knowledge management applications, individuals and organizations generate and acquire tremendous amount of online information that is typically available as textual documents. To manage the ever-increasing volume of documents, an individual or organization frequently organizes his/her documents into a set or hierarchy of categories in order to facilitate document management and subsequent information access and browsing. Furthermore, document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document-clustering must consider individual preferences for supporting personalization in document categorization and should be capable of organizing documents into a category hierarchy. However, document-clustering research traditionally has been anchored in analyses of document content. As a consequence, most of existing document-clustering techniques are not tailored to individuals¡¦ preferences and therefore are unable to facilitate personalization. On the other hand, existing document-clustering techniques generally are designed to generate from a document collection a set of document clusters rather than a hierarchy of document clusters. In response, we develop in this study a hierarchical personalized document-clustering (HPEC) technique that takes into account an individual¡¦s folder hierarchy representing the individual¡¦s categorization preferences and produces document-clusters in a hierarchical structure for the target individual. Our empirical evaluation results suggest that the proposed HPEC technique outperformed its benchmark technique (i.e., HAC+P) in cluster recall while maintaining the same level of cluster precision and location discrepancy as its benchmark technique did. Hierarchical document management Personalized document clustering Text mining Personalization Document clustering
12	Semi-supervised document clustering with active learning. / CUHK electronic theses & dissertations collection January 2008 (has links) Most existing semi-supervised document clustering approaches are model-based clustering and can be treated as parametric model taking an assumption that the underlying clusters follow a certain pre-defined distribution. In our semi-supervised document clustering, each cluster is represented by a non-parametric probability distribution. Two approaches are designed for incorporating pairwise constraints in the document clustering approach. The first approach, term-to-term relationship approach (TR), uses pairwise constraints for capturing term-to-term dependence relationships. The second approach, linear combination approach (LC), combines the clustering objective function with the user-provided constraints linearly. Extensive experimental results show that our proposed framework is effective. / This thesis presents a new framework for automatically partitioning text documents taking into consideration of constraints given by users. Semi-supervised document clustering is developed based on pairwise constraints. Different from traditional semi-supervised document clustering approaches which assume pairwise constraints to be prepared by user beforehand, we develop a novel framework for automatically discovering pairwise constraints revealing the user grouping preference. Active learning approach for choosing informative document pairs is designed by measuring the amount of information that can be obtained by revealing judgments of document pairs. For this purpose, three models, namely, uncertainty model, generation error model, and term-to-term relationship model, are designed for measuring the informativeness of document pairs from different perspectives. Dependent active learning approach is developed by extending the active learning approach to avoid redundant document pair selection. Two models are investigated for estimating the likelihood that a document pair is redundant to previously selected document pairs, namely, KL divergence model and symmetric model. / Huang, Ruizhang. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 70-06, Section: B, page: 3600. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 117-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Cluster analysis--Computer programs Document clustering Text processing (Computer science)
13	Search Term Selection and Document Clustering for Query Suggestion Zhang, Xiaomin 06 1900 (has links) In order to improve a user's query and help the user quickly satisfy his/her information need, most search engines provide query suggestions that are meant to be relevant alternatives to the user's query. This thesis builds on the query suggestion system and evaluation methodology described in Shen Jiang's Masters thesis (2008). Jiang's system constructs query suggestions by searching for lexical aliases of web documents and then applying query search to the lexical aliases. A lexical alias for a web document is a list of terms that return the web document in a top-ranked position. Query search is a search process that finds useful combinations of search terms. The main focus of this thesis is to supply alternatives for the components of Jiang's system. We suggest three term scoring mechanisms and generalize Jiang's lexical alias search to be a general search for terms that are useful for constructing good query suggestions. We also replace Jiang's top-down query search by a bottom-up beam search method. We experimentally show that our query suggestion method improves Jiang's system by 30% for short queries and 90% for long queries using Jiang's evaluation method. In addition, we add new evidence supporting Jiang's conclusion that terms in the user's initial query terms are important to include in the query suggestions. In addition, we explore the usefulness of document clustering in creating query suggestions. Our experimental results are the opposite of what we expected: query suggestion based on clustering does not perform nearly as well, in terms of the "coverage" scores we are using for evaluation, as our best method that is not based on document clustering.
14	A clustering scheme for large high-dimensional document datasets Chen, Jing-wen 09 August 2007 (has links) Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method. Dimension reduction high-dimensional data clustering text mining Document clustering
15	Detecting Weak Signals by Internet-Based Environmental Scanning Tabatabaei, Nasim January 2011 (has links) Firms in highly dynamic environments focusing on innovation in their products and services, often encounter elevated amounts of uncertainty regarding the future direction of technological change. Finding reliable and imbedded information enhances a firm’s ability to tackle new markets and take advantage of possible hidden opportunities. To reduce uncertainty, obtain hidden knowledge, and gain competitive advantage, environmental scanning, which is one of the main components of foresight, is recommended by scholars of strategic management. The process of detecting weak signals for shedding light what one authority calls “blurry future zones” (Day & Schoemaker, 2005, p.1) has currently been receiving attention in environmental scanning studies. Some studies emphasize the importance of the subject; yet they offer few practical methodologies for actual cases. To help address this gap, this research introduces a new approach for detecting weak signals during Internet-based environmental scanning by applying the Cluto toolkit (see Section 4.7) plus using human judgment. This novel methodology is applied to the application of Micro Tiles, a recent innovative product of a digital display company located in Ontario, Canada, Christie Digital Company. In the conduct of this exploratory research, about 40,000 HTML pages were retrieved from the Internet in a search during 2009. To extract weak signals information from the retrieved unstructured texts, documents were grouped into a number of clusters by the CLUTO software. Two subject matter experts compared and evaluated the cluster results for the purpose of finding potentially relevant information in regard to the company’s strategic intent. Analyzing the clusters, the experts reduced the number of clustered documents from the original corpus into smaller sets with the goal of finding more relevant and unexpected documents (weak signals). The relevancy and expectedness of information in documents were two measurements as related to weak signals. The trends of the study indicate that as anticipated both experts found more unexpected documents in the smaller sets rather than the larger ones. Moreover, regarding one expert’s analysis, the smaller sets contain documents that are more relevant to the domain of interest. Overall, according to one expert, documents existing in the smaller sets display more weak signals. This emerging methodology offers a practical procedure to apply web-based information in the development of a company’s environmental scanning procedures. Using this methodology, managers can employ both computer tools and human sense-making methods to detect potential weak signals and reduce certain biases in the detection process. environmental scanning foresight weak signals document clustering CLUTO Management Sciences
16	Evolutionary Approach for Supporting Document Category Hierarchy Management Wu, Ming-jung 02 February 2004 (has links) Observations of textual document management by individuals and organizations have suggested the popularity of using categories (e.g., folders) to organize, archive and access documents. The document grouping behavior is intentional acts, reflecting a user¡¦s preferential perspective on semantic coherency or relevant groupings between subjects. Although becoming less adequate as new documents are accumulated, the existing category set or hierarchy may preserve to some extent the user¡¦s preferential perspective on document grouping. Thus, when deriving a new category set or hierarchy, the category set or hierarchy previously established by the user (i.e., semantic coherency of the documents embedded in the existing category set or category hierarchy) should be taken into consideration. In this study, we have proposed an evolution-based technique, Category Hierarchy Evolution (CHE), for managing category hierarchy rather than category set. Specifically, in CHE, the overall similarity between two documents is measured not only by their content similarity but also by their location similarity in the existing category hierarchy. Our empirical evaluation results suggest that the proposed CHE technique outperformed the discovery-based technique (i.e., the traditional content-based document-clustering technique). Data Mining Category Evolution Document Clustering Hierarchical Categorization
17	A Novelty-based Clustering Method for On-line Documents Khy, Sophoin, Ishikawa, Yoshiharu, Kitagawa, Hiroyuki January 2007 (has links) No description available. document clustering forgetting factor incremental processing novelty on-line documents
18	Individualiai klasifikuotų dokumentų klasterizavimo metodas / Clustering Method for Personally Classified Documents Žalinauskas, Marius 22 May 2006 (has links) Traditional clustering methods, where documents are represented by term frequency vectors, are not very suitable for Lithuanian document clustering as there is no any freely available morphological analyzer or stemmer to make compact term dictionaries. It is still possible though to cluster Lithuanian documents using loose term dictionaries, but as Lithuanian is a highly synthetic language significant increase in resources and possibly inaccurate or distorted results must be taken into account. In this master thesis a clustering method for personally classified documents is developed to overcome shortcomings of traditional document clustering stated above. In a new method documents are represented by tag frequency vectors, pair-wise similarities are measured by cosine coefficient and clustering itself is performed using experimentally selected bisecting K‑means algorithm. Experiments comparing developed method with traditional document clustering using loose term dictionary showed that former copes better with large document collections and/or large cluster number. At the same time subjective clustering estimation showed that even when new method demonstrates larger entropy and lower purity values, it still overcomes traditional method by clustering sense. Informatics Klasterizavimas Document clustering Žymės Tags Dokumentų klasterizavimas
19	Search Term Selection and Document Clustering for Query Suggestion Zhang, Xiaomin Unknown Date No description available.
20	Detecting Weak Signals by Internet-Based Environmental Scanning Tabatabaei, Nasim January 2011 (has links) Firms in highly dynamic environments focusing on innovation in their products and services, often encounter elevated amounts of uncertainty regarding the future direction of technological change. Finding reliable and imbedded information enhances a firm’s ability to tackle new markets and take advantage of possible hidden opportunities. To reduce uncertainty, obtain hidden knowledge, and gain competitive advantage, environmental scanning, which is one of the main components of foresight, is recommended by scholars of strategic management. The process of detecting weak signals for shedding light what one authority calls “blurry future zones” (Day & Schoemaker, 2005, p.1) has currently been receiving attention in environmental scanning studies. Some studies emphasize the importance of the subject; yet they offer few practical methodologies for actual cases. To help address this gap, this research introduces a new approach for detecting weak signals during Internet-based environmental scanning by applying the Cluto toolkit (see Section 4.7) plus using human judgment. This novel methodology is applied to the application of Micro Tiles, a recent innovative product of a digital display company located in Ontario, Canada, Christie Digital Company. In the conduct of this exploratory research, about 40,000 HTML pages were retrieved from the Internet in a search during 2009. To extract weak signals information from the retrieved unstructured texts, documents were grouped into a number of clusters by the CLUTO software. Two subject matter experts compared and evaluated the cluster results for the purpose of finding potentially relevant information in regard to the company’s strategic intent. Analyzing the clusters, the experts reduced the number of clustered documents from the original corpus into smaller sets with the goal of finding more relevant and unexpected documents (weak signals). The relevancy and expectedness of information in documents were two measurements as related to weak signals. The trends of the study indicate that as anticipated both experts found more unexpected documents in the smaller sets rather than the larger ones. Moreover, regarding one expert’s analysis, the smaller sets contain documents that are more relevant to the domain of interest. Overall, according to one expert, documents existing in the smaller sets display more weak signals. This emerging methodology offers a practical procedure to apply web-based information in the development of a company’s environmental scanning procedures. Using this methodology, managers can employ both computer tools and human sense-making methods to detect potential weak signals and reduce certain biases in the detection process. environmental scanning foresight weak signals document clustering CLUTO Management Sciences

Search results