191 |
Preference-Anchored Document Clustering Technique: Effects of Term Relationships and ThesaurusLin, Hao-hsiang 30 August 2006 (has links)
According to the context theory of classification, the document-clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is doing the task and in what context. Thus, effective document-clustering techniques need to be able to take into account users¡¦ categorization preferences and thus can generate document clusters from different preferential perspectives. The Preference-Anchored Document Clustering (PAC) technique was proposed for supporting preference-based document-clustering. Specifically, PAC takes a user¡¦s categorization preference into consideration and subsequently generates a set of document clusters from this specific preferential perspective. In this study, we attempt to investigate two research questions concerning the PAC technique. The first research question investigates ¡§whether the incorporation of the broader-term expansion (i.e., the proposed PAC2 technique in this study) will improve the effectiveness of preference-based document-clustering, whereas the second research question is ¡§whether the use of a statistical-based thesaurus constructed from a larger document corpus will improve the effectiveness of preference-based document-clustering.¡¨ Compared with the effectiveness achieved by PAC, our empirical results show that the proposed PAC2 technique neither improves nor deteriorates the effectiveness of preference-based document-clustering when the complete set of anchoring terms is used. However, when only a partial set of anchoring terms is provided, PAC2 cannot improve and even deteriorate the effectiveness of preference-based document-clustering. As to the second research question, our empirical results suggest the use of a statistical-based thesaurus constructed from a larger document corpus (i.e., the ACM corpus consisting of 14,729 documents) does not improve the effectiveness of PAC and PAC2 for preference-based document-clustering.
|
192 |
Personalized and Context-aware Document ClusteringYang, Chin-Sheng 15 July 2007 (has links)
To manage the ever-increasing volume of documents, organizations and individuals typically organize documents into categories (or category hierarchies) to facilitate their document management and support subsequent document retrieval and access. Document clustering is an intentional act that should reflect individuals¡¦ preferences with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective document clustering techniques need to take into account a user¡¦s categorization context defined by or relevant to the target task under consideration. However, existing document clustering techniques generally anchor in pure content-based analysis and therefore are not able to facilitate personalized or context-aware document clustering. In response, we design, implement and empirically evaluate three document clustering techniques capable of facilitating personalized or contextual document clustering. First, we extend an existing document clustering technique (specifically, the partial-clustering-based personalized document-clustering (PEC) approach) and propose the Collaborative Filtering¡Vbased personalized document-Clustering (CFC) technique to overcome the problem of small-sized partial clustering encountered by the PEC technique. Particularly, the CFC technique expands the size of a user¡¦s partial clustering based on the partial clusterings of other users with similar categorization preferences. Second, to support contextual document clustering, we design and implement a Context-Aware document-Clustering (CAC) technique by taking into consideration a user¡¦s categorization preference (i.e., a set of anchoring terms) relevant to the context of a target task and a statistical-based thesaurus constructed from the World Wide Web (WWW) via a search engine. Third, in response to the problem of small-sized set of anchoring terms which can greatly degrade the effectiveness of the CAC technique, we extend CAC and propose a Collaborative Filtering-based Context-Aware document Clustering (CF-CAC) technique. Our empirical evaluation results suggest that our proposed CFC, CAC, and CF-CAC techniques better support the need of personalized and contextual document clustering than do their benchmark techniques.
|
193 |
Text Mining: A Burgeoning Quality Improvement ToolJ. Mohammad, Mohammad Alkin Cihad 01 November 2007 (has links) (PDF)
While the amount of textual data available to us is constantly increasing, managing
the texts by human effort is clearly inadequate for the volume and complexity of the
information involved. Consequently, requirement for automated extraction of useful
knowledge from huge amounts of textual data to assist human analysis is apparent.
Text mining (TM) is mostly an automated technique that aims to discover knowledge
from textual data. In this thesis, the notion of text mining, its techniques, applications
are presented. In particular, the study provides the definition and overview of
concepts in text categorization. This would include document representation models,
weighting schemes, feature selection methods, feature extraction, performance
measure and machine learning techniques. The thesis details the functionality of text
mining as a quality improvement tool. It carries out an extensive survey of text
mining applications within service sector and manufacturing industry. It presents two
broad experimental studies tackling the potential use of text mining for the hotel
industry (the comment card analysis), and in automobile manufacturer (miles per
gallon analysis).
Keywords: Text Mining, Text Categorization, Quality Improvement, Service Sector,
Manufacturing Industry.
|
194 |
Sentiment Analysis In TurkishErogul, Umut 01 June 2009 (has links) (PDF)
Sentiment analysis is the automatic classification of a text, trying to determine the attitude of the writer with respect to a specific topic. The attitude may be either their judgment or evaluation, their feelings or the intended emotional communication.
The recent increase in the use of review sites and blogs, has made a great amount of subjective data available. Nowadays, it is nearly impossible to manually process all the relevant data available, and as a consequence, the importance given to the automatic classification of unformatted data, has increased.
Up to date, all of the research carried on sentiment analysis was focused on English language. In this thesis, two Turkish datasets tagged with sentiment information is introduced and existing methods for English are applied on these datasets. This thesis also suggests new methods for Turkish sentiment analysis.
|
195 |
Improving Search Result Clustering By Integrating Semantic Information From WikipediaCalli, Cagatay 01 September 2010 (has links) (PDF)
Suffix Tree Clustering (STC) is a search result clustering (SRC) algorithm focused on generating overlapping clusters with meaningful labels in linear time. It showed the feasibility of SRC but in time, subsequent studies introduced description-first algorithms that generate better labels and achieve higher precision. Still, STC remained as the fastest SRC algorithm and there appeared studies concerned with different problems of STC. In this thesis, semantic relations between cluster labels and documents are exploited to filter out noisy labels and improve merging phase of STC. Wikipedia is used to identify these relations and methods for integrating semantic information to STC are suggested. Semantic features are shown to be effective for SRC task when used together with term frequency vectors. Furthermore, there were no SRC studies on Turkish up to now. In this thesis, a dataset for Turkish is introduced and a number of methods are tested on Turkish.
|
196 |
Acquisition Of Liver Specific Parasites-bacteria-drugs-diseases-genes Knowledge From MedlineYildirim, Pinar 01 April 2011 (has links) (PDF)
Biomedical literature such as MEDLINE articles are rich resources for discovering and tracking disease and drug knowledge. For example, information regarding the drugs that are used with a particular disease or the changes in drug usage over time is valulable. However, this information is buried in thousands of MEDLINE articles. Acquiring knowledge from these articles requires complex processes depending on the biomedical text mining techniques. Today, parasitic and bacterial diseases affect hundreds of millions of people worldwide. They result in significant mortality and devastating social and economic consequences. There are many control and eradication programs conducted in the world. Also, many drugs are developed for diseases caused from parasites and bacteria. In this study, research was conducted of parasites (bacteria affecting the liver) and treatment drugs were tested. Also, relationships between these diseases and genes, along with parasites and bacteria were searched through data and biomedical text mining techniques. This study reveals that the treatment of parasites and bacteria seems to be stable over the last four decades. The methodology introduced in this study also presents a reference model to acquire medical knowledge from the literature.
|
197 |
Ontology Based Text Mining In Turkish Radiology ReportsDeniz, Onur 01 January 2012 (has links) (PDF)
Vast amount of radiology reports are produced in hospitals. Being in free text format and having errors due to rapid production, it continuously gets more complicated for radiologists and physicians to reach meaningful information. Though application of ontologies into bio-medical text mining has gained increasing interest in recent years, less work has been offered for ontology based retrieval tasks in Turkish language. In this work, an information extraction and retrieval system based on SNOMED-CT ontology has been proposed for Turkish radiology reports. Main purpose of this work is to utilize semantic relations in ontology to improve precision and recall rates of search results in domain. Practical problems encountered such as spelling errors, segmentation and tokenization of unstructured medical reports has also been addressed during the work.
|
198 |
Emotion Analysis Of Turkish Texts By Using Machine Learning MethodsBoynukalin, Zeynep 01 July 2012 (has links) (PDF)
Automatically analysing the emotion in texts is in increasing interest in today&rsquo / s research fields.
The aim is to develop a machine that can detect type of user&rsquo / s emotion from his/her text.
Emotion classification of English texts is studied by several researchers and promising results
are achieved. In this thesis, an emotion classification study on Turkish texts is introduced.
To the best of our knowledge, this is the first study on emotion analysis of Turkish texts. In
English there exists some well-defined datasets for the purpose of emotion classification, but
we could not find datasets in Turkish suitable for this study. Therefore, another important
contribution is the generating a new data set in Turkish for emotion analysis. The dataset is
generated by combining two types of sources. Several classification algorithms are applied
on the dataset and results are compared. Due to the nature of Turkish language, new features
are added to the existing methods to improve the success of the proposed method.
|
199 |
大学生における「就職しないこと」イメージの構造と進路未決定 : テキストマイニングを用いた検討SUGIMOTO, Hideharu, 杉本, 英晴 31 March 2009 (has links)
No description available.
|
200 |
Discovery of Evolution Patterns from Sequences of DocumentsChang, Yu-Hsiu 06 August 2001 (has links)
Due to the ever-increasing volume of textual documents, text mining is a rapidly growing application of knowledge discovery in databases. Past text mining techniques predominately concentrated on discovering intra-document patterns from textual documents, such as text categorization, document clustering, query expansion, and event tracking. Mining inter-document patterns from textual documents has been largely ignored in the literature. This research focuses on discovering inter-document patterns, called evolution patterns, from document-sequences and proposed the evolution pattern discovery (EPD) technique for mining evolution patterns from a set of ordered sequences of documents. The discovery of evolution patterns can be applied in such domains as environmental scanning and knowledge management, and can be used to facilitate existing document management and retrieval techniques (e.g., event tracking).
|
Page generated in 0.2684 seconds