1 |
Cross-Lingual Text CategorizationLin, Yen-Ting 29 July 2004 (has links)
With the emergence and proliferation of Internet services and e-commerce applications, a tremendous amount of information is accessible online, typically as textual documents. To facilitate subsequent access to and leverage from this information, the efficient and effective management¡Xspecifically, text categorization¡Xof the ever-increasing volume of textual documents is essential to organizations and person. Existing text categorization techniques focus mainly on categorizing monolingual documents. However, with the globalization of business environments and advances in Internet technology, an organization or person often retrieves and archives documents in different languages, thus creating the need for cross-lingual text categorization. Motivated by the significance of and need for such a cross-lingual text categorization technique, this thesis designs a technique with two different category assignment methods, namely, individual- and cluster-based. The empirical evaluation results show that the cross-lingual text categorization technique performs well and the cluster-based method outperforms the individual-based method.
|
2 |
Cross-Lingual Text Categorization: A Training-corpus Translation-based ApproachHsu, Kai-hsiang 21 July 2005 (has links)
Text categorization deals with the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the assignment of unclassified documents to appropriate categories. Most of existing text categorization techniques deal with monolingual documents (i.e., all documents are written in one language) during the text categorization model learning and category assignment (or prediction). However, with the globalization of business environments and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for cross-lingual text categorization (CLTC). Existing studies on CLTC focus on the prediction-corpus translation-based approach that lacks of a systematic mechanism for reducing translation noises; thus, limiting their cross-lingual categorization effectiveness. Motivated by the needs of providing more effective CLTC support, we design a training-corpus translation-based CLTC approach. Using the prediction-corpus translation-based approach as the performance benchmark, our empirical evaluation results show that our proposed CLTC approach achieves significantly better classification effectiveness than the benchmark approach does in both Chinese
|
3 |
Poly-Lingual Text CategorizationShih, Hui-Hua 09 August 2006 (has links)
With the rapid emergence and proliferation of Internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online. Efficiently and effectively managing these textual documents written different languages is essential to organizations and individuals. Although poly-lingual text categorization (PLTC) can be approached as a set of independent monolingual classifiers, this naïve approach employs only the training documents of the same language to construct to construct a monolingual classifier and fails to utilize the opportunity offered by poly-lingual training documents. Motivated by the significance of and need for such a poly-lingual text categorization technique, we propose a PLTC technique that takes into account all training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) technique as our performance benchmark, our empirical evaluation results show that our proposed PLTC technique achieves higher classification accuracy than the benchmark technique does in both English and Chinese corpora. In addition, our empirical results also suggest the robustness of the proposed PLTC technique with respect to the range of training sizes investigated.
|
4 |
litsift: Automated Text Categorization in Bibliographic SearchFaulstich, Lukas C., Stadler, Peter F., Thurner, Caroline, Witwer, Christina 07 January 2019 (has links)
In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose. In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We
evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter literature on other virus groups. Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches.
|
5 |
Mining-Based Category Evolution for Text DatabasesDong, Yuan-Xin 18 July 2000 (has links)
As text repositories grow in number and size and global connectivity improves, the amount of online information in the form of free-format text is growing extremely rapidly. In many large organizations, huge volumes of textual information are created and maintained, and there is a pressing need to support efficient and effective information retrieval, filtering, and management. Text categorization is essential to the efficient management and retrieval of documents. Past research on text categorization mainly focused on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns from a training set of manually categorized documents. However, as documents accumulate, the pre-defined categories may not capture the characteristics of the documents. In this study, we proposed a mining-based category evolution (MiCE) technique to adjust the categories based on the existing categories and their associated documents. According to the empirical evaluation results, the proposed technique, MiCE, was more effective than the discovery-based category management approach, insensitive to the quality of original categories, and capable of improving classification accuracy.
|
6 |
Machine learning for text categorization: Experiments using clustering and classificationBikki, Poojitha January 1900 (has links)
Master of Science / Department of Computer Science / William H. Hsu / This work describes a comparative study of empirical methods for categorization of new articles within text corpora: unsupervised learning for an unlabeled corpus of text documents and supervised learning for hand-labeled corpus. The goal of text categorization is to organize natural language (i.e. human language) documents into categories that are either predefined or that are inherently grouped by similar meaning. The first approach, automatic classification of texts, can be handy when handling massive amounts of data and has many applications such as automated indexing of scientific articles, spam filtering, classification of news articles etc. Classification using supervised or semi-supervised inductive learning involves labeled data, which can be expensive to acquire and may require semantically deep understanding of the meaning of texts. The second approach falls under the general rubric of document clustering, based on the statistical distribution and co-occurrence of words in a full-text document. Developing a full pipeline for document categorization draws on methods from information retrieval (IR), natural language processing (NLP), and machine learning (ML).
In this project, experiments are conducted on two text corpora: news aggregator data, which contains news headlines collected from a web aggregator and a news data set consisting of original news articles from the British Broadcasting Corporation (BBC). First, the training data is developed from these corpora. Next, common types of supervised classifiers, such as linear, Bayesian, ensemble models and support vector machines (SVM) are trained, on the labelled data and the trained classification models are used to predict the category of an article, given the related text. The results obtained are analyzed and compared to determine the best performing model. Then, two unsupervised learning techniques – k-means and Latent Dirichlet Allocation (LDA) are applied to obtain clusters of data points. k-means separates the documents into disjoint clusters of similar news. Additionally, LDA was used, which treats documents as a mixture of topics, to find latent topics in text. Finally, visualizations of the results are produced for evaluation: to allow qualitative assessment of cluster separation in the case of unsupervised learning, or to understand the confusion matrix for the supervised classification task by heat map visualization as well as precision, recall, and other holistic metrics. From an application standpoint, the unsupervised techniques applied can be used to find news that are similar in content and can be categorized under a specific topic.
|
7 |
Reengineering PhysNet in the uPortal frameworkZhou, Ye 11 July 2003 (has links)
A Digital Library (DL) is an electronic information storage system focused on meeting the information seeking needs of its constituents.
As modern DLs often stay in synchronization with the latest progress of technologies in all fields, interoperability among DLs is often hard to achieve. With the advent of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Open Digital Libraries (ODL), lightweight protocols show a promising future in promoting DL interoperability. Furthermore, DL is envisaged as a network of independent components working collaboratively through simple standardized protocols. Prior work with ODL shows the feasibility of building componentized DLs with techniques that are a precursor to web services designs.
In our study, we elaborate the feasibility to apply web services to DL design. DL services are modeled as a set of web services offering information dissemination through the Simple Object Access Protocol (SOAP). Additionally, a flexible DL user interface assembly framework is offered in order to build DLs with customizations and personalizations. Our hypothesis is proven and demonstrated in the PhysNet reengineering project. / Master of Science
|
8 |
Técnicas de classificação textual utilizando grafos / Text classification techniques using graphsSilva, Allef Páblo Araújo da 15 March 2019 (has links)
O grande volume de informação textual sendo gerado a todo momento torna necessário o aprimoramento constante de sistemas capazes de classificar textos em categorias específicas. Essa categorização visa, por exemplo, separar notícias indexadas por mecanismos de buscas, identificar a autoria de livros e cartas antigas ou detectar plágio em artigos científicos. As técnicas de classificação textual existentes, baseadas em conteúdo, apesar de conseguirem uma boa performance quantitativamente, ainda apresentam dificuldades em lidar com aspectos semânticos presentes nos textos escritos em língua natural. Neste sentido, abordagens alternativas vem sendo propostas, como as baseadas em redes complexas, que levam em consideração apenas o relacionamento entre as palavras. Neste estudo, aplicamos a modelagem de textos como redes complexas e utilizamos as métricas extraídas como atributos para classificação, utilizando um problema de reconhecimento de autoria para ilustrar a aplicação das técnicas descritas ao longo deste texto / The large volume of textual information being generated at all times makes it necessary to constantly improve systems capable of classifying texts into specific categories. This categorization aims, for example, to separate news items indexed by search engines, identify authorship of old books and letters, or detect plagiarism in scientific articles. Existing textual classification techniques, based on content, despite achieving good quantitative performance, still present difficulties in dealing with semantic aspects present in texts written in natural language. In this sense, alternative approaches have been proposed, such as those based on complex networks, which take into account only the relationship between words. In this study, we applied text modeling as graphs and extracted metrics typically used in the study of complex networks to be used as classifier attributes. To illustrate these techniques, a problem of authorship recognition in small texts was chosen as an example
|
9 |
Induction in Hierarchical Multi-label Domains with Focus on Text CategorizationDendamrongvit, Sareewan 02 May 2011 (has links)
Induction of classifiers from sets of preclassified training examples is one of the most popular machine learning tasks. This dissertation focuses on the techniques needed in the field of automated text categorization. Here, each document can be labeled with more than one class, sometimes with many classes. Moreover, the classes are hierarchically organized, the mutual relations being typically expressed in terms of a generalization tree. Both aspects (multi-label classification and hierarchically organized classes) have so far received inadequate attention. Existing literature work largely assumes that it is enough to induce a separate binary classifier for each class, and the question of class hierarchy is rarely addressed. This, however, ignores some serious problems. For one thing, induction of thousands of classifiers from hundreds of thousands of examples described by tens of thousands of features (a common case in automated text categorization) incurs prohibitive computational costs---even a single binary classifier in domains of this kind often takes hours, even days, to induce. For another, the circumstance that the classes are hierarchically organized affects the way we view the classification performance of the induced classifiers. The presented work proposes a technique referred to by the acronym "H-kNN-plus." The technique combines support vector machines and nearest neighbor classifiers with the intention to capitalize on the strengths of both. As for performance evaluation, a variety of measures have been used to evaluate hierarchical classifiers, including the standard non-hierarchical criteria that assign the same weight to different types of error. The author proposes a performance measure that overcomes some of their weaknesses. The dissertation begins with a study of (non-hierarchical) multi-label classification. One of the reasons for the poor performance of earlier techniques is the class-imbalance problem---a small number of positive examples being outnumbered by a great many negative examples. Another difficulty is that each of the classes tends to be characterized by a different set of characteristic features. This means that most of the binary classifiers are induced from examples described by predominantly irrelevant features. Addressing these weaknesses by majority-class undersampling and feature selection, the proposed technique significantly improves the overall classification performance. Even more challenging is the issue of hierarchical classification. Here, the dissertation introduces a new induction mechanism, H-kNN-plus, and subjects it to extensive experiments with two real-world datasets. The results indicate its superiority, in these domains, over earlier work in terms of prediction performance as well as computational costs.
|
10 |
Topic-Oriented Collaborative Web CrawlingChung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
|
Page generated in 0.1324 seconds