Spelling suggestions: "subject:"datent semantic indexing"" "subject:"iatent semantic indexing""
1 |
Latent semantic analysis and classification modeling in applications for social movement theory /Spomer, Judith E., January 2008 (has links) (PDF)
Thesis (M.S.) -- Central Connecticut State University, 2008. / Thesis advisor: Roger Bilisoly. "... in partial fulfillment of the requirements for the degree of Master of Science in Data Mining." Includes bibliographical references (leaves 122-127). Also available via the World Wide Web.
|
2 |
A machine learning approach for plagiarism detectionAlsallal, M. January 2016 (has links)
Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists.
|
3 |
Novelty Detection by Latent Semantic IndexingZhang, Xueshan January 2013 (has links)
As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources.
To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected.
We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.
|
4 |
Summary-based document categorization with LSILiu, Hsiao-Wen 14 February 2007 (has links)
Text categorization to automatically assign documents into the appropriate pre-defined category or categories is essential to facilitating the retrieval of desired documents efficiently and effectively from a huge text depository, e.g., the world-wide web. Most techniques, however, suffer from the feature selection problem and the vocabulary mismatch problem. A few research works have addressed on text categorization via text summarization to reduce the size of documents, and consequently the number of features to consider, while some proposed using latent semantic indexing (LSI) to reveal the true meaning of a term via its association with other terms. Few works, however, have studied the joint effect of text summarization and the semantic dimension reduction technique in the literature. The objective of this research is thus to propose a practical approach, SBDR to deal with the above difficulties in text categorization tasks.
Two experiments are conducted to validate our proposed approach. In the first experiment, the results show that text summarization does improve the performance in categorization. In addition, to construct important sentences, the association terms of both noun-noun and noun-verb pairs should be considered. Results of the second experiment indicate slight better performance with the approach of adopting LSI exclusively (i.e. no summarization) than that with SBDR (i.e. with summarization). Nonetheless, the minor accuracy reduction can be largely compensated for the computational time saved using LSI with text summarized. The feasibility of the SBDR approach is thus justified.
|
5 |
Latent semantic web service directory and composition framework a thesis /Yick, (Winnie) Yuki B. Haungs, Michael L. January 1900 (has links)
Thesis (M.S.)--California Polytechnic State University, 2009. / Mode of access: Internet. Title from PDF title page; viewed on Jan. 6, 2010. Major professor: Dr. Michael Haungs. "Presented to the faculty of California Polytechnic State University, San Luis Obispo." "In partial fulfillment of the requirements for the degree [of] Master of Science in Computer Science." "Aug 2009." Includes bibliographical references (p. 76-78).
|
6 |
Novelty Detection by Latent Semantic IndexingZhang, Xueshan January 2013 (has links)
As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources.
To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected.
We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure.
|
7 |
Text clustering and active learning using a LSI subspace signature model and query expansion /Zhu, Weizhong. Allen, Robert B. January 2009 (has links)
Thesis (Ph.D.)--Drexel University, 2009. / Includes abstract and vita. Includes bibliographical references (leaves 115-121).
|
8 |
Computational modelling of the language production system : semantic memory, conflict monitoring, and cognitive control processes /Hockey, Andrew. January 2006 (has links) (PDF)
Thesis (M.Phil.) - University of Queensland, 2007. / Includes bibliography.
|
9 |
Augmenting expertise: Toward computer-enhanced clinical comprehensionCohen, Trevor January 2007 (has links)
Cognitive studies of clinical comprehension reveal that expert clinicians are distinguished by their superior ability to recognize meaningful patterns of data in clinical narratives. For example, in psychiatry, the findings of hallucinations and delusions suggest the subdiagnostic hypothesis of a psychotic episode, which in turn suggests several diagnoses, including schizophrenia. This dissertation describes the design and evaluation of a system that aims to simulate an important aspect of expert comprehension: the ability to recognize clusters of findings that support sub-diagnostic hypotheses. The broad range of content in psychiatric narrative presents a formidable barrier to achieving this goal, as it contains general concepts and descriptions of the subjective experience of psychiatric patients in addition to general medical and psychiatric concepts. Lexically driven language processing of such narrative would require the exhaustive predefinition of every concept likely to be encountered. In contrast, Latent Semantic Analysis (LSA) is a corpus-based statistical model of language that learns human-like estimates of the similarity between concepts from a text corpus. LSA is adapted to create trainable models of sub-diagnostic hypotheses, which are then used to recognize related elements in psychiatric discharge summary text. The system is evaluated against an independently annotated set of psychiatric discharge summaries. System-rater agreement approached rater-rater agreement, providing support for the practical application of vector-based models of meaning in domains with broad conceptual territory. Other applications and implications are discussed, including the presentation of a prototype user interface designed to enhance novice comprehension of psychiatric discourse.
|
10 |
Incremental Aspect Model Learning on Streaming¡@DocumentsWu, Cheng-Wei 16 August 2010 (has links)
Owing to the development of Internet, excessive online data drive users to apply tools to assist them in obtaining desired and useful information. Information retrieval techniques serve as one of the major assistance tools that ease users¡¦ information processing loads. However, most current IR models do not consider processing streaming information which essentially characterizes today¡¦s Web environment. The approach to re-building models based on the full knowledge of data at hand triggered by the new incoming information every time is impractical, inefficient, and costly.
Instead, IR models that can be adapted to streaming information incrementally should be considered under the dynamic environment.
Therefore, this research is to propose an IR related technique, the incremental aspect model (ISM), which not only uncovers latent aspects from the collected
documents but also adapts the aspect model on streaming documents chronologically.
There are two stages in ISM: in Stage I, we employ probabilistic latent semantic indexing (PLSI) technique to build a primary aspect model; and in Stage II, with out-of-date data removing and new data folding-in, the aspect model can be expanded using the derived spectral method if new aspects significantly exist.
Three experiments are conducted accordingly to verify ISM. Results from the first two experiments show the robust performance of ISM in incremental text clustering tasks. In Experiment III, ISM performs the task of storylines tracking on the 2010 Soccer World Cup event. It illustrates ISM¡¦s incremental learning ability to discover different themes around the event at any time. The feasibility of our proposed approach in real applications is thus justified.
|
Page generated in 0.1201 seconds