Global ETD Search

81	Rival penalized competitive learning for content-based indexing. January 1998 (has links) by Lau Tak Kan. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1998. / Includes bibliographical references (leaves 100-108). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.2 --- Problem Defined --- p.5 / Chapter 1.3 --- Contributions --- p.5 / Chapter 1.4 --- Thesis Organization --- p.7 / Chapter 2 --- Content-based Retrieval Multimedia Database Background and Indexing Problem --- p.8 / Chapter 2.1 --- Feature Extraction --- p.8 / Chapter 2.2 --- Nearest-neighbor Search --- p.10 / Chapter 2.3 --- Content-based Indexing Methods --- p.15 / Chapter 2.4 --- Indexing Problem --- p.22 / Chapter 3 --- Data Clustering Methods for Indexing --- p.25 / Chapter 3.1 --- Proposed Solution to Indexing Problem --- p.25 / Chapter 3.2 --- Brief Description of Several Clustering Methods --- p.26 / Chapter 3.2.1 --- K-means --- p.26 / Chapter 3.2.2 --- Competitive Learning (CL) --- p.27 / Chapter 3.2.3 --- Rival Penalized Competitive Learning (RPCL) --- p.29 / Chapter 3.2.4 --- General Hierarchical Clustering Methods --- p.31 / Chapter 3.3 --- Why RPCL? --- p.32 / Chapter 4 --- Non-hierarchical RPCL Indexing --- p.33 / Chapter 4.1 --- The Non-hierarchical Approach --- p.33 / Chapter 4.2 --- Performance Experiments --- p.34 / Chapter 4.2.1 --- Experimental Setup --- p.35 / Chapter 4.2.2 --- Experiment 1: Test for Recall and Precision Performance --- p.38 / Chapter 4.2.3 --- Experiment 2: Test for Different Sizes of Input Data Sets --- p.45 / Chapter 4.2.4 --- Experiment 3: Test for Different Numbers of Dimensions --- p.49 / Chapter 4.2.5 --- Experiment 4: Compare with Actual Nearest-neighbor Results --- p.53 / Chapter 4.3 --- Chapter Summary --- p.55 / Chapter 5 --- Hierarchical RPCL Indexing --- p.56 / Chapter 5.1 --- The Hierarchical Approach --- p.56 / Chapter 5.2 --- The Hierarchical RPCL Binary Tree (RPCL-b-tree) --- p.58 / Chapter 5.3 --- Insertion --- p.61 / Chapter 5.4 --- Deletion --- p.63 / Chapter 5.5 --- Searching --- p.63 / Chapter 5.6 --- Experiments --- p.69 / Chapter 5.6.1 --- Experimental Setup --- p.69 / Chapter 5.6.2 --- Experiment 5: Test for Different Node Sizes --- p.72 / Chapter 5.6.3 --- Experiment 6: Test for Different Sizes of Data Sets --- p.75 / Chapter 5.6.4 --- Experiment 7: Test for Different Data Distributions --- p.78 / Chapter 5.6.5 --- Experiment 8: Test for Different Numbers of Dimensions --- p.80 / Chapter 5.6.6 --- Experiment 9: Test for Different Numbers of Database Ob- jects Retrieved --- p.83 / Chapter 5.6.7 --- Experiment 10: Test with VP-tree --- p.86 / Chapter 5.7 --- Discussion --- p.90 / Chapter 5.8 --- A Relationship Formula --- p.93 / Chapter 5.9 --- Chapter Summary --- p.96 / Chapter 6 --- Conclusion --- p.97 / Chapter 6.1 --- Future Works --- p.97 / Chapter 6.2 --- Conclusion --- p.98 / Bibliography --- p.100 Multimedia systems Indexing Cluster analysis Information retrieval
82	An effective Chinese indexing method based on partitioned signature files. January 1998 (has links) Wong Chi Yin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1998. / Includes bibliographical references (leaves 107-114). / Abstract also in Chinese. / Abstract --- p.ii / Acknowledgements --- p.vi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Introduction to Chinese IR --- p.1 / Chapter 1.2 --- Contributions --- p.3 / Chapter 1.3 --- Organization of this Thesis --- p.5 / Chapter 2 --- Background --- p.6 / Chapter 2.1 --- Indexing methods --- p.6 / Chapter 2.1.1 --- Full-text scanning --- p.7 / Chapter 2.1.2 --- Inverted files --- p.7 / Chapter 2.1.3 --- Signature files --- p.9 / Chapter 2.1.4 --- Clustering --- p.10 / Chapter 2.2 --- Information Retrieval Models --- p.10 / Chapter 2.2.1 --- Boolean model --- p.11 / Chapter 2.2.2 --- Vector space model --- p.11 / Chapter 2.2.3 --- Probabilistic model --- p.13 / Chapter 2.2.4 --- Logical model --- p.14 / Chapter 3 --- Investigation of Segmentation on the Vector Space Retrieval Model --- p.15 / Chapter 3.1 --- Segmentation of Chinese Texts --- p.16 / Chapter 3.1.1 --- Character-based segmentation --- p.16 / Chapter 3.1.2 --- Word-based segmentation --- p.18 / Chapter 3.1.3 --- N-Gram segmentation --- p.21 / Chapter 3.2 --- Performance Evaluation of Three Segmentation Approaches --- p.23 / Chapter 3.2.1 --- Experimental Setup --- p.23 / Chapter 3.2.2 --- Experimental Results --- p.24 / Chapter 3.2.3 --- Discussion --- p.29 / Chapter 4 --- Signature File Background --- p.32 / Chapter 4.1 --- Superimposed coding --- p.34 / Chapter 4.2 --- False drop probability --- p.36 / Chapter 5 --- Partitioned Signature File Based On Chinese Word Length --- p.39 / Chapter 5.1 --- Fixed Weight Block (FWB) Signature File --- p.41 / Chapter 5.2 --- Overview of PSFC --- p.45 / Chapter 5.3 --- Design Considerations --- p.50 / Chapter 6 --- New Hashing Techniques for Partitioned Signature Files --- p.59 / Chapter 6.1 --- Direct Division Method --- p.61 / Chapter 6.2 --- Random Number Assisted Division Method --- p.62 / Chapter 6.3 --- Frequency-based hashing method --- p.64 / Chapter 6.4 --- Chinese character-based hashing method --- p.68 / Chapter 7 --- Experiments and Results --- p.72 / Chapter 7.1 --- Performance evaluation of partitioned signature file based on Chi- nese word length --- p.74 / Chapter 7.1.1 --- Retrieval Performance --- p.75 / Chapter 7.1.2 --- Signature Reduction Ratio --- p.77 / Chapter 7.1.3 --- Storage Requirement --- p.79 / Chapter 7.1.4 --- Discussion --- p.81 / Chapter 7.2 --- Performance evaluation of different dynamic signature generation methods --- p.82 / Chapter 7.2.1 --- Collision --- p.84 / Chapter 7.2.2 --- Retrieval Performance --- p.86 / Chapter 7.2.3 --- Discussion --- p.89 / Chapter 8 --- Conclusions and Future Work --- p.91 / Chapter 8.1 --- Conclusions --- p.91 / Chapter 8.2 --- Future work --- p.95 / Chapter A --- Notations of Signature Files --- p.96 / Chapter B --- False Drop Probability --- p.98 / Chapter C --- Experimental Results --- p.103 / Bibliography --- p.107 Chinese language--Data processing Indexing Information retrieval
83	A machine learning approach for plagiarism detection Alsallal, M. January 2016 (has links) Plagiarism detection is gaining increasing importance due to requirements for integrity in education. The existing research has investigated the problem of plagrarim detection with a varying degree of success. The literature revealed that there are two main methods for detecting plagiarism, namely extrinsic and intrinsic. This thesis has developed two novel approaches to address both of these methods. Firstly a novel extrinsic method for detecting plagiarism is proposed. The method is based on four well-known techniques namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The LSA application was fine-tuned to take in the stylometric features (most common words) in order to characterise the document authorship as described in chapter 4. The results revealed that LSA based stylometry has outperformed the traditional LSA application. Support vector machine based algorithms were used to perform the classification procedure in order to predict which author has written a particular book being tested. The proposed method has successfully addressed the limitations of semantic characteristics and identified the document source by assigning the book being tested to the right author in most cases. Secondly, the intrinsic detection method has relied on the use of the statistical properties of the most common words. LSA was applied in this method to a group of most common words (MCWs) to extract their usage patterns based on the transitivity property of LSA. The feature sets of the intrinsic model were based on the frequency of the most common words, their relative frequencies in series, and the deviation of these frequencies across all books for a particular author. The Intrinsic method aims to generate a model of author “style” by revealing a set of certain features of authorship. The model’s generation procedure focuses on just one author as an attempt to summarise aspects of an author’s style in a definitive and clear-cut manner. The thesis has also proposed a novel experimental methodology for testing the performance of both extrinsic and intrinsic methods for plagiarism detection. This methodology relies upon the CEN (Corpus of English Novels) training dataset, but divides that dataset up into training and test datasets in a novel manner. Both approaches have been evaluated using the well-known leave-one-out-cross-validation method. Results indicated that by integrating deep analysis (LSA) and Stylometric analysis, hidden changes can be identified whether or not a reference collection exists. 006.3
84	Continuous memories for representing sets of vectors and image collections / Mémoires continues représentant des ensembles de vecteurs et des collections d’images Iscen, Ahmet 25 September 2017 (has links) Cette thèse étudie l'indexation et le mécanisme d'expansion de requête en recherche d'image. L'indexation sacrifie la qualité de la recherche pour une plus grande efficacité; l'expansion de requête prend ce compromis dans l'autre sens : il améliore la qualité de la recherche avec un coût en complexité additionnel. Nous proposons des solutions pour les deux approches qui utilisent une représentation continue d'un ensemble de vecteurs. Pour l'indexation, notre solution est basée sur le test par groupe. Chaque vecteur image est assigné à un groupe, et chaque groupe est représenté par un seul vecteur. C'est la représentation continue de l'ensemble des vecteur du groupe. L'optimisation de cette représentation pour produire un bon test d'appartenance donne une solution basée sur la pseudo-inverse de Moore-Penrose. Elle montre des performances supérieures à celles d'une somme basique des vecteurs du groupe. Nous proposons aussi une alternative suivant au plus près les vecteurs-images de la base. Elle optimise conjointement l'assignation des vecteurs images à des groupes ainsi que la représentation vectorielle de ces groupes. La deuxième partie de la thèse étudie le mécanisme d'expansion de requête au moyen d'un graphe pondéré représentant les vecteurs images. Cela permet de retrouver des images similaires le long d'une même variété géométrique, mais éloignées en distance Euclidienne. Nous donnons une implémentation ultra-rapide de ce mécanisme en créant des représentations vectorielles incorporant la diffusion. Ainsi, le mécanisme d'expansion se réduit à un simple produit scalaire entre les représentations vectorielles lors de la requête. Les deux parties de la thèse fournissent une analyse théorique et un travail expérimental approfondi utilisant les protocoles et les jeux de données standards en recherche d'images. Les méthodes proposées ont des performances supérieures à l'état de l'art. / In this thesis, we study the indexing and query expansion problems in image retrieval. The former sacrifices the accuracy for efficiency, whereas the latter takes the opposite perspective and improves accuracy with additional cost. Our proposed solutions to both problems consist of utilizing continuous representations of a set of vectors. We turn our attention to indexing first, and follow the group testing scheme. We assign each dataset vector to a group, and represent each group with a single vector representation. We propose memory vectors, whose solution is optimized under the membership test hypothesis. The optimal solution for this problem is based on Moore-Penrose pseudo-inverse, and shows superior performance compared to basic sum pooling. We also provide a data-driven approach optimizing the assignment and representation jointly. The second half of the transcript focuses on the query expansion problem, representing a set of vectors with weighted graphs. This allows us to retrieve objects that lie on the same manifold, but further away in Euclidean space. We improve the efficiency of our technique even further, creating high-dimensional diffusion embeddings offline, so that they can be compared with a simple dot product in the query time. For both problems, we provide thorough experiments and analysis in well-known image retrieval benchmarks and show the improvements achieved by proposed methods. Vision par ordinateur Indexation Computer vision Indexing
85	Indexing methods for multimedia data objects given pair-wise distances. January 1997 (has links) by Chan Mei Shuen Polly. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1997. / Includes bibliographical references (leaves 67-70). / Abstract --- p.ii / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Definitions --- p.3 / Chapter 1.2 --- Thesis Overview --- p.5 / Chapter 2 --- Background and Related Work --- p.6 / Chapter 2.1 --- Feature-Based Index Structures --- p.6 / Chapter 2.2 --- Distance Preserving Methods --- p.8 / Chapter 2.3 --- Distance-Based Index Structures --- p.9 / Chapter 2.3.1 --- The Vantage-Point Tree Method --- p.10 / Chapter 3 --- The Problem of Distance Preserving Methods in Querying --- p.12 / Chapter 3.1 --- Some Experimental Results --- p.13 / Chapter 3.2 --- Discussion --- p.15 / Chapter 4 --- Nearest Neighbor Search in VP-trees --- p.17 / Chapter 4.1 --- The sigma-factor Algorithm --- p.18 / Chapter 4.2 --- The Constant-α Algorithm --- p.22 / Chapter 4.3 --- The Single-Pass Algorithm --- p.24 / Chapter 4.4 --- Discussion --- p.25 / Chapter 4.5 --- Performance Evaluation --- p.26 / Chapter 4.5.1 --- Experimental Setup --- p.27 / Chapter 4.5.2 --- Results --- p.28 / Chapter 5 --- Update Operations on VP-trees --- p.41 / Chapter 5.1 --- Insert --- p.41 / Chapter 5.2 --- Delete --- p.48 / Chapter 5.3 --- Performance Evaluation --- p.51 / Chapter 6 --- Minimizing Distance Computations --- p.57 / Chapter 6.1 --- A Single Vantage Point per Level --- p.58 / Chapter 6.2 --- Reuse of Vantage Points --- p.59 / Chapter 6.3 --- Performance Evaluation --- p.60 / Chapter 7 --- Conclusions and Future Work --- p.63 / Chapter 7.1 --- Future Work --- p.65 / Bibliography --- p.67 Database management Indexing Multimedia systems Information retrieval
86	Indexing techniques for object-oriented databases. January 1996 (has links) by Frank Hing-Wah Luk. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 92-95). / Abstract --- p.ii / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- The Problem in Object-Oriented Database Indexing --- p.2 / Chapter 1.3 --- Contributions --- p.3 / Chapter 1.4 --- Thesis Organization --- p.4 / Chapter 2 --- Object-oriented Data Model --- p.5 / Chapter 2.1 --- Object-oriented Data Model --- p.5 / Chapter 2.2 --- Object and Object Identifiers --- p.6 / Chapter 2.3 --- Complex Attributes and Methods --- p.6 / Chapter 2.4 --- Class --- p.8 / Chapter 2.4.1 --- Inheritance Hierarchy --- p.8 / Chapter 2.4.2 --- Aggregation Hierarchy --- p.8 / Chapter 2.5 --- Sample Object-Oriented Database Schema --- p.9 / Chapter 3 --- Indexing in Object-Oriented Databases --- p.10 / Chapter 3.1 --- Introduction --- p.10 / Chapter 3.2 --- Indexing on Inheritance Hierarchy --- p.10 / Chapter 3.3 --- Indexing on Aggregation Hierarchy --- p.13 / Chapter 3.4 --- Indexing on Integrated Support --- p.16 / Chapter 3.5 --- Indexing on Method Invocation --- p.18 / Chapter 3.6 --- Indexing on Overlapping Path Expressions --- p.19 / Chapter 4 --- Triple Node Hierarchy --- p.23 / Chapter 4.1 --- Introduction --- p.23 / Chapter 4.2 --- Triple Node --- p.25 / Chapter 4.3 --- Triple Node Hierarchy --- p.26 / Chapter 4.3.1 --- Construction of the Triple Node Hierarchy --- p.26 / Chapter 4.3.2 --- Updates in the Triple Node Hierarchy --- p.31 / Chapter 4.4 --- Cost Model --- p.33 / Chapter 4.4.1 --- Storage --- p.33 / Chapter 4.4.2 --- Query Cost --- p.35 / Chapter 4.4.3 --- Update Cost --- p.35 / Chapter 4.5 --- Evaluation --- p.37 / Chapter 4.6 --- Summary --- p.42 / Chapter 5 --- Triple Node Hierarchy in Both Aggregation and Inheritance Hierarchies --- p.43 / Chapter 5.1 --- Introduction --- p.43 / Chapter 5.2 --- Preliminaries --- p.44 / Chapter 5.3 --- Class-Hierarchy Tree --- p.45 / Chapter 5.4 --- The Nested CH-tree --- p.47 / Chapter 5.4.1 --- Construction --- p.47 / Chapter 5.4.2 --- Retrieval --- p.48 / Chapter 5.4.3 --- Update --- p.48 / Chapter 5.5 --- Cost Model --- p.49 / Chapter 5.5.1 --- Assumptions --- p.51 / Chapter 5.5.2 --- Storage --- p.52 / Chapter 5.5.3 --- Query Cost --- p.52 / Chapter 5.5.4 --- Update Cost --- p.53 / Chapter 5.6 --- Evaluation --- p.55 / Chapter 5.6.1 --- Storage Cost --- p.55 / Chapter 5.6.2 --- Query Cost --- p.57 / Chapter 5.6.3 --- Update Cost --- p.62 / Chapter 5.7 --- Summary --- p.63 / Chapter 6 --- Decomposition of Path Expressions --- p.65 / Chapter 6.1 --- Introduction --- p.65 / Chapter 6.2 --- Configuration on Path Expressions --- p.67 / Chapter 6.2.1 --- Single Path Expression --- p.67 / Chapter 6.2.2 --- Overlapping Path Expressions --- p.68 / Chapter 6.3 --- New Algorithm --- p.70 / Chapter 6.3.1 --- Example --- p.72 / Chapter 6.4 --- Evaluation --- p.75 / Chapter 6.5 --- Summary --- p.76 / Chapter 7 --- Conclusion and Future Research --- p.77 / Chapter 7.1 --- Conclusion --- p.77 / Chapter 7.2 --- Future Research --- p.78 / Chapter A --- Evaluation of some Parameters in Chapter5 --- p.79 / Chapter B --- Cost Model for Nested-Inherited Index --- p.82 / Chapter B.1 --- Storage --- p.82 / Chapter B.2 --- Query Cost --- p.84 / Chapter B.3 --- Update --- p.84 / Chapter C --- Algorithm constructing a minimum auxiliary set of J Is --- p.87 / Chapter D --- Estimation on the number of possible combinations --- p.89 / Bibliography --- p.92 Indexing Database management Object-oriented databases
87	A leitura documentária de bibliotecários jurídicos : um estudo realizado a partir de aspectos da semiose e teoria da inferência observados na estrutura textual de doutrina / Reis, Daniela Majorie Akama dos January 2019 (has links) Orientadora: Mariângela Spotti Lopes Fujita / Banca: Carlos Cândido de Almeida / Banca: Franciele Marques Redigolo / Banca: Dulce Amélia de Brito Neves / Banca: Gercina Ângela Borém de Oliveira Lima / Resumo: A leitura documentária é realizada durante a análise de assunto, considerada a primeira etapa de vários processos, incluindo a indexação e a catalogação de assunto. Seu objetivo é desvendar o aboutness de documentos. Diversos são seus produtos, como termos extraídos de um documento para compor um índice (no caso da indexação) ou para compor registros bibliográficos em um catálogo (no caso da catalogação de assunto). Cada profissional que efetua a prática da leitura documentária é único e, como consequência disso, a análise do documento nunca ocorrerá da mesma forma. Vários fatores devem ser levados em conta, quando se estuda o processo de leitura documentária feito por profissionais da informação, como estratégias de leitura, conhecimento prévio, domínio de atuação e tipo de estrutura do documento analisado. O problema da pesquisa consiste na necessidade de avançar em estudos sobre processos metacognitivos, na leitura documentária de bibliotecários do domínio jurídico, utilizando teorias associadas à construção de significados. Aspectos que relacionam a semiótica à leitura viabilizaram a proposta de examinar a leitura documentária de livros do domínio jurídico, por meio de aspectos da teoria da inferência, especificamente os conceitos de abdução, dedução e indução. Com esses conceitos, busca-se mapear os processos mentais interpretativos dos profissionais nesse domínio, durante a leitura documentária. A coleta de dados foi realizada adotando-se a técnica introspectiva de Prot... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: Documentary reading is performed during subject analysis, considered the first stage of various processes, including indexing and subject cataloging. It aims to unveil the aboutness of documents. Results in several products, such as terms extracted from a document to compose an index (in the case of indexing), or to compose bibliographic records in a catalog (in the case of subject cataloging). Each professional who performs the practice of documentary reading is unique, and as a consequence, the analysis of the document will never occur in the same way. Several factors should be considered when studying the process of documentary reading carried out by information professionals, such as reading strategies, previous knowledge, domain, and type of document structure analyzed. The problem consists in the need to advance in studies on metacognitive processes about the documentary reading of law librarians, using theories related to the construction of meaning. Aspects that relate semiotics to reading, enabled the proposal to examine the documentary reading of books in the legal domain through aspects of the inference theory, specifically the concepts of abduction, deduction and induction. With these concepts, we seek to map the interpretive mental processes of professionals in this field during documentary reading. The data collection was performed using the introspective technique of Individual Verbal Protocol, applied to law librarians. The data obtained was analyzed using cat... (Complete abstract click electronic access below) / Doutor Direito - Documentação. Indexação. Catalogação. Análise documentária. Indexing
88	A DHT-Based Grid Resource Indexing and Discovery Scheme Teo, Yong Meng, March, Verdi, Wang, Xianbing 01 1900 (has links) This paper presents a DHT-based grid resource indexing and discovery (DGRID) approach. With DGRID, resource-information data is stored on its own administrative domain and each domain, represented by an index server, is virtualized to several nodes (virtual servers) subjected to the number of resource types it has. Then, all nodes are arranged as a structured overlay network or distributed hash table (DHT). Comparing to existing grid resource indexing and discovery schemes, the benefits of DGRID include improving the security of domains, increasing the availability of data, and eliminating stale data. / Singapore-MIT Alliance (SMA) Grid resource indexing and discovery DHT availability
89	Novelty Detection by Latent Semantic Indexing Zhang, Xueshan January 2013 (has links) As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure. novelty detection latent semantic indexing Statistics
90	Indexing Compressed Text He, Meng January 2003 (has links) As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically interesting techniques. Various compression and indexing techniques are presented, and we also present two compressed text indices. Based on these techniques, we implement an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts. The experiments show that our index is compact and supports fast search. Computer Science text compression text indexing

Search results