Spelling suggestions: "subject:"[een] TEXT MINING"" "subject:"[enn] TEXT MINING""
161 |
Event Episode Discovery from Document Sequences: A Temporal-based ApproachChiang, Yu-Sheng 07 September 2005 (has links)
Recent advances in information and networking technologies have contributed significantly to global connectivity and greatly facilitated and fostered information creation, distribution, and access. The resultant ever-increasing volume of online textual documents creates an urgent need for new text mining techniques that can intelligently and automatically extract implicit and potentially useful knowledge from these documents for decision support. This research focuses on identifying and discovering event episodes together with their temporal relationships that occur frequently (referred to as evolution patterns in this study) in sequences of documents. The discovery of such evolution patterns can be applied in such domains as knowledge management and used to facilitate existing document management and retrieval techniques (e.g., event tracking). Specifically, we propose and design an evolution pattern (EP) discovery technique for mining evolution patterns from sequences of documents. We experimentally evaluate our proposed EP technique in the context of facilitating event tracking. Measured by miss and false alarm rates, the evolution-pattern supported event-tracking (EPET) technique exhibits better tracking effectiveness than a traditional event-tracking technique. The encouraging performance of the EPET technique demonstrates the potential usefulness of evolution patterns in supporting event tracking and suggests that the proposed EP technique could effectively discover event episodes and evolution patterns in sequences of documents.
|
162 |
Clustering Multilingual Documents: A Latent Semantic Indexing Based ApproachLin, Chia-min 09 February 2006 (has links)
Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a Latent Semantic Indexing (LSI) based MLDC technique. Our empirical evaluation results show that the proposed LSI-based multilingual document clustering technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision.
|
163 |
Development of Personalized Document Clustering Technique for Accommodating Hierarchical Categorization PreferencesLee, Kuan-yi 27 July 2006 (has links)
With the advances in information and networking technologies and the proliferation of e-commerce and knowledge management applications, individuals and organizations generate and acquire tremendous amount of online information that is typically available as textual documents. To manage the ever-increasing volume of documents, an individual or organization frequently organizes his/her documents into a set or hierarchy of categories in order to facilitate document management and subsequent information access and browsing. Furthermore, document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document-clustering must consider individual preferences for supporting personalization in document categorization and should be capable of organizing documents into a category hierarchy. However, document-clustering research traditionally has been anchored in analyses of document content. As a consequence, most of existing document-clustering techniques are not tailored to individuals¡¦ preferences and therefore are unable to facilitate personalization. On the other hand, existing document-clustering techniques generally are designed to generate from a document collection a set of document clusters rather than a hierarchy of document clusters. In response, we develop in this study a hierarchical personalized document-clustering (HPEC) technique that takes into account an individual¡¦s folder hierarchy representing the individual¡¦s categorization preferences and produces document-clusters in a hierarchical structure for the target individual. Our empirical evaluation results suggest that the proposed HPEC technique outperformed its benchmark technique (i.e., HAC+P) in cluster recall while maintaining the same level of cluster precision and location discrepancy as its benchmark technique did.
|
164 |
Cross-Lingual Category Integration TechniqueTzeng, Guo-han 30 August 2006 (has links)
With the emergence of the Internet, many innovative and interesting applications from different countries have been stimulated and e-commerce is also getting more and more pervasive. Under this scenario, tremendous amount of information expressed in different languages are exchanged and shared by not only organizations but also individuals in the modern global environment. A large proportion of information is typically formatted and available as textual documents and managed by using categories. Consequently, the development of a practical and effective technique to deal with the problem of cross-lingual category integration (CLCI) becomes a very essential and important issue. Several category integration techniques have been proposed, but all of them deal with category integration involving only monolingual documents. In response, in this study, we combine the existing cross-lingual text categorization techniques with an existing monolingual category integration technique (specifically, Enhanced Naive Bayes) and proposed a CLCI solution to address cross-lingual category integration. Our empirical evaluation results show that our proposed CLCI technique demonstrates its feasibility and superior effectiveness.
|
165 |
Preference-Anchored Document Clustering Technique: Effects of Term Relationships and ThesaurusLin, Hao-hsiang 30 August 2006 (has links)
According to the context theory of classification, the document-clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is doing the task and in what context. Thus, effective document-clustering techniques need to be able to take into account users¡¦ categorization preferences and thus can generate document clusters from different preferential perspectives. The Preference-Anchored Document Clustering (PAC) technique was proposed for supporting preference-based document-clustering. Specifically, PAC takes a user¡¦s categorization preference into consideration and subsequently generates a set of document clusters from this specific preferential perspective. In this study, we attempt to investigate two research questions concerning the PAC technique. The first research question investigates ¡§whether the incorporation of the broader-term expansion (i.e., the proposed PAC2 technique in this study) will improve the effectiveness of preference-based document-clustering, whereas the second research question is ¡§whether the use of a statistical-based thesaurus constructed from a larger document corpus will improve the effectiveness of preference-based document-clustering.¡¨ Compared with the effectiveness achieved by PAC, our empirical results show that the proposed PAC2 technique neither improves nor deteriorates the effectiveness of preference-based document-clustering when the complete set of anchoring terms is used. However, when only a partial set of anchoring terms is provided, PAC2 cannot improve and even deteriorate the effectiveness of preference-based document-clustering. As to the second research question, our empirical results suggest the use of a statistical-based thesaurus constructed from a larger document corpus (i.e., the ACM corpus consisting of 14,729 documents) does not improve the effectiveness of PAC and PAC2 for preference-based document-clustering.
|
166 |
Personalized and Context-aware Document ClusteringYang, Chin-Sheng 15 July 2007 (has links)
To manage the ever-increasing volume of documents, organizations and individuals typically organize documents into categories (or category hierarchies) to facilitate their document management and support subsequent document retrieval and access. Document clustering is an intentional act that should reflect individuals¡¦ preferences with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective document clustering techniques need to take into account a user¡¦s categorization context defined by or relevant to the target task under consideration. However, existing document clustering techniques generally anchor in pure content-based analysis and therefore are not able to facilitate personalized or context-aware document clustering. In response, we design, implement and empirically evaluate three document clustering techniques capable of facilitating personalized or contextual document clustering. First, we extend an existing document clustering technique (specifically, the partial-clustering-based personalized document-clustering (PEC) approach) and propose the Collaborative Filtering¡Vbased personalized document-Clustering (CFC) technique to overcome the problem of small-sized partial clustering encountered by the PEC technique. Particularly, the CFC technique expands the size of a user¡¦s partial clustering based on the partial clusterings of other users with similar categorization preferences. Second, to support contextual document clustering, we design and implement a Context-Aware document-Clustering (CAC) technique by taking into consideration a user¡¦s categorization preference (i.e., a set of anchoring terms) relevant to the context of a target task and a statistical-based thesaurus constructed from the World Wide Web (WWW) via a search engine. Third, in response to the problem of small-sized set of anchoring terms which can greatly degrade the effectiveness of the CAC technique, we extend CAC and propose a Collaborative Filtering-based Context-Aware document Clustering (CF-CAC) technique. Our empirical evaluation results suggest that our proposed CFC, CAC, and CF-CAC techniques better support the need of personalized and contextual document clustering than do their benchmark techniques.
|
167 |
Text Mining: A Burgeoning Quality Improvement ToolJ. Mohammad, Mohammad Alkin Cihad 01 November 2007 (has links) (PDF)
While the amount of textual data available to us is constantly increasing, managing
the texts by human effort is clearly inadequate for the volume and complexity of the
information involved. Consequently, requirement for automated extraction of useful
knowledge from huge amounts of textual data to assist human analysis is apparent.
Text mining (TM) is mostly an automated technique that aims to discover knowledge
from textual data. In this thesis, the notion of text mining, its techniques, applications
are presented. In particular, the study provides the definition and overview of
concepts in text categorization. This would include document representation models,
weighting schemes, feature selection methods, feature extraction, performance
measure and machine learning techniques. The thesis details the functionality of text
mining as a quality improvement tool. It carries out an extensive survey of text
mining applications within service sector and manufacturing industry. It presents two
broad experimental studies tackling the potential use of text mining for the hotel
industry (the comment card analysis), and in automobile manufacturer (miles per
gallon analysis).
Keywords: Text Mining, Text Categorization, Quality Improvement, Service Sector,
Manufacturing Industry.
|
168 |
Sentiment Analysis In TurkishErogul, Umut 01 June 2009 (has links) (PDF)
Sentiment analysis is the automatic classification of a text, trying to determine the attitude of the writer with respect to a specific topic. The attitude may be either their judgment or evaluation, their feelings or the intended emotional communication.
The recent increase in the use of review sites and blogs, has made a great amount of subjective data available. Nowadays, it is nearly impossible to manually process all the relevant data available, and as a consequence, the importance given to the automatic classification of unformatted data, has increased.
Up to date, all of the research carried on sentiment analysis was focused on English language. In this thesis, two Turkish datasets tagged with sentiment information is introduced and existing methods for English are applied on these datasets. This thesis also suggests new methods for Turkish sentiment analysis.
|
169 |
Improving Search Result Clustering By Integrating Semantic Information From WikipediaCalli, Cagatay 01 September 2010 (has links) (PDF)
Suffix Tree Clustering (STC) is a search result clustering (SRC) algorithm focused on generating overlapping clusters with meaningful labels in linear time. It showed the feasibility of SRC but in time, subsequent studies introduced description-first algorithms that generate better labels and achieve higher precision. Still, STC remained as the fastest SRC algorithm and there appeared studies concerned with different problems of STC. In this thesis, semantic relations between cluster labels and documents are exploited to filter out noisy labels and improve merging phase of STC. Wikipedia is used to identify these relations and methods for integrating semantic information to STC are suggested. Semantic features are shown to be effective for SRC task when used together with term frequency vectors. Furthermore, there were no SRC studies on Turkish up to now. In this thesis, a dataset for Turkish is introduced and a number of methods are tested on Turkish.
|
170 |
Acquisition Of Liver Specific Parasites-bacteria-drugs-diseases-genes Knowledge From MedlineYildirim, Pinar 01 April 2011 (has links) (PDF)
Biomedical literature such as MEDLINE articles are rich resources for discovering and tracking disease and drug knowledge. For example, information regarding the drugs that are used with a particular disease or the changes in drug usage over time is valulable. However, this information is buried in thousands of MEDLINE articles. Acquiring knowledge from these articles requires complex processes depending on the biomedical text mining techniques. Today, parasitic and bacterial diseases affect hundreds of millions of people worldwide. They result in significant mortality and devastating social and economic consequences. There are many control and eradication programs conducted in the world. Also, many drugs are developed for diseases caused from parasites and bacteria. In this study, research was conducted of parasites (bacteria affecting the liver) and treatment drugs were tested. Also, relationships between these diseases and genes, along with parasites and bacteria were searched through data and biomedical text mining techniques. This study reveals that the treatment of parasites and bacteria seems to be stable over the last four decades. The methodology introduced in this study also presents a reference model to acquire medical knowledge from the literature.
|
Page generated in 0.049 seconds