Global ETD Search

1	Cross-Lingual Text Categorization Lin, Yen-Ting 29 July 2004 (has links) With the emergence and proliferation of Internet services and e-commerce applications, a tremendous amount of information is accessible online, typically as textual documents. To facilitate subsequent access to and leverage from this information, the efficient and effective management¡Xspecifically, text categorization¡Xof the ever-increasing volume of textual documents is essential to organizations and person. Existing text categorization techniques focus mainly on categorizing monolingual documents. However, with the globalization of business environments and advances in Internet technology, an organization or person often retrieves and archives documents in different languages, thus creating the need for cross-lingual text categorization. Motivated by the significance of and need for such a cross-lingual text categorization technique, this thesis designs a technique with two different category assignment methods, namely, individual- and cluster-based. The empirical evaluation results show that the cross-lingual text categorization technique performs well and the cluster-based method outperforms the individual-based method. Document management Cross-lingual text categorization Text categorization Text mining
2	Cross-Lingual Text Categorization: A Training-corpus Translation-based Approach Hsu, Kai-hsiang 21 July 2005 (has links) Text categorization deals with the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the assignment of unclassified documents to appropriate categories. Most of existing text categorization techniques deal with monolingual documents (i.e., all documents are written in one language) during the text categorization model learning and category assignment (or prediction). However, with the globalization of business environments and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for cross-lingual text categorization (CLTC). Existing studies on CLTC focus on the prediction-corpus translation-based approach that lacks of a systematic mechanism for reducing translation noises; thus, limiting their cross-lingual categorization effectiveness. Motivated by the needs of providing more effective CLTC support, we design a training-corpus translation-based CLTC approach. Using the prediction-corpus translation-based approach as the performance benchmark, our empirical evaluation results show that our proposed CLTC approach achieves significantly better classification effectiveness than the benchmark approach does in both Chinese Text mining Document management Cross-lingual text categorization Text categorization
3	litsift: Automated Text Categorization in Bibliographic Search Faulstich, Lukas C., Stadler, Peter F., Thurner, Caroline, Witwer, Christina 07 January 2019 (has links) In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose. In this study we use the topic of conserved secondary RNA structures in viral genomes as running example. Pubmed result sets for two virus groups, Picornaviridae and Flaviviridae, have been manually labeled by human experts. We evaluated various classifiers from the Weka toolkit together with different feature selection methods to assess whether classifiers trained on documents dedicated to one virus group can be successfully applied to filter literature on other virus groups. Our results indicate that in this domain a bibliographic search tool trained on a reference corpus may significantly reduce the amount of time needed for extensive literature recherches.
4	Reengineering PhysNet in the uPortal framework Zhou, Ye 11 July 2003 (has links) A Digital Library (DL) is an electronic information storage system focused on meeting the information seeking needs of its constituents. As modern DLs often stay in synchronization with the latest progress of technologies in all fields, interoperability among DLs is often hard to achieve. With the advent of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Open Digital Libraries (ODL), lightweight protocols show a promising future in promoting DL interoperability. Furthermore, DL is envisaged as a network of independent components working collaboratively through simple standardized protocols. Prior work with ODL shows the feasibility of building componentized DLs with techniques that are a precursor to web services designs. In our study, we elaborate the feasibility to apply web services to DL design. DL services are modeled as a set of web services offering information dissemination through the Simple Object Access Protocol (SOAP). Additionally, a flexible DL user interface assembly framework is offered in order to build DLs with customizations and personalizations. Our hypothesis is proven and demonstrated in the PhysNet reengineering project. / Master of Science Digital Libraries OAI PhysNet uPortal Text Categorization
5	Técnicas de classificação textual utilizando grafos / Text classification techniques using graphs Silva, Allef Páblo Araújo da 15 March 2019 (has links) O grande volume de informação textual sendo gerado a todo momento torna necessário o aprimoramento constante de sistemas capazes de classificar textos em categorias específicas. Essa categorização visa, por exemplo, separar notícias indexadas por mecanismos de buscas, identificar a autoria de livros e cartas antigas ou detectar plágio em artigos científicos. As técnicas de classificação textual existentes, baseadas em conteúdo, apesar de conseguirem uma boa performance quantitativamente, ainda apresentam dificuldades em lidar com aspectos semânticos presentes nos textos escritos em língua natural. Neste sentido, abordagens alternativas vem sendo propostas, como as baseadas em redes complexas, que levam em consideração apenas o relacionamento entre as palavras. Neste estudo, aplicamos a modelagem de textos como redes complexas e utilizamos as métricas extraídas como atributos para classificação, utilizando um problema de reconhecimento de autoria para ilustrar a aplicação das técnicas descritas ao longo deste texto / The large volume of textual information being generated at all times makes it necessary to constantly improve systems capable of classifying texts into specific categories. This categorization aims, for example, to separate news items indexed by search engines, identify authorship of old books and letters, or detect plagiarism in scientific articles. Existing textual classification techniques, based on content, despite achieving good quantitative performance, still present difficulties in dealing with semantic aspects present in texts written in natural language. In this sense, alternative approaches have been proposed, such as those based on complex networks, which take into account only the relationship between words. In this study, we applied text modeling as graphs and extracted metrics typically used in the study of complex networks to be used as classifier attributes. To illustrate these techniques, a problem of authorship recognition in small texts was chosen as an example Classificação textual Complex networks Grafos Graphs Redes complexas Text categorization
6	Induction in Hierarchical Multi-label Domains with Focus on Text Categorization Dendamrongvit, Sareewan 02 May 2011 (has links) Induction of classifiers from sets of preclassified training examples is one of the most popular machine learning tasks. This dissertation focuses on the techniques needed in the field of automated text categorization. Here, each document can be labeled with more than one class, sometimes with many classes. Moreover, the classes are hierarchically organized, the mutual relations being typically expressed in terms of a generalization tree. Both aspects (multi-label classification and hierarchically organized classes) have so far received inadequate attention. Existing literature work largely assumes that it is enough to induce a separate binary classifier for each class, and the question of class hierarchy is rarely addressed. This, however, ignores some serious problems. For one thing, induction of thousands of classifiers from hundreds of thousands of examples described by tens of thousands of features (a common case in automated text categorization) incurs prohibitive computational costs---even a single binary classifier in domains of this kind often takes hours, even days, to induce. For another, the circumstance that the classes are hierarchically organized affects the way we view the classification performance of the induced classifiers. The presented work proposes a technique referred to by the acronym "H-kNN-plus." The technique combines support vector machines and nearest neighbor classifiers with the intention to capitalize on the strengths of both. As for performance evaluation, a variety of measures have been used to evaluate hierarchical classifiers, including the standard non-hierarchical criteria that assign the same weight to different types of error. The author proposes a performance measure that overcomes some of their weaknesses. The dissertation begins with a study of (non-hierarchical) multi-label classification. One of the reasons for the poor performance of earlier techniques is the class-imbalance problem---a small number of positive examples being outnumbered by a great many negative examples. Another difficulty is that each of the classes tends to be characterized by a different set of characteristic features. This means that most of the binary classifiers are induced from examples described by predominantly irrelevant features. Addressing these weaknesses by majority-class undersampling and feature selection, the proposed technique significantly improves the overall classification performance. Even more challenging is the issue of hierarchical classification. Here, the dissertation introduces a new induction mechanism, H-kNN-plus, and subjects it to extensive experiments with two real-world datasets. The results indicate its superiority, in these domains, over earlier work in terms of prediction performance as well as computational costs. Induction Text categorization Hierarchical classification Multi-label examples Imbalanced classes
7	Topic-Oriented Collaborative Web Crawling Chung, Chiasen January 2001 (has links) A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering. Computer Science Web Crawling Distributed System Text Categorization
8	Topic-Oriented Collaborative Web Crawling Chung, Chiasen January 2001 (has links) A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering. Computer Science Web Crawling Distributed System Text Categorization
9	Semantic Relationship Annotation for Knowledge Documents in Knowledge Sharing Environments Pai, Yi-chung 29 July 2004 (has links) A typical online knowledge-sharing environment would generate vast amount of formal knowledge elements or interactions that generally available as textual documents. Thus, an effective management of the ever-increasing volume of online knowledge documents is essential to organizational knowledge sharing. Reply-semantic relationships between knowledge documents may exist either explicitly or implicitly. Such reply-semantic relationships between knowledge documents, once discovered or identified, would facilitate subsequent knowledge access by providing a novel and more semantic retrieval mechanism. In this study, we propose a preliminary taxonomy of reply-semantic relationships for documents organized in reply-replied structures and develop a SEmantic Enrichment between Knowledge documents (SEEK) technique for automatically annotating reply-semantic relationships between reply-pair documents. Based on the content-based text categorization techniques and genre classification techniques, we propose and evaluate different feature-set models, combinations of keyword features, POS statistics features, and/or given/new information (GI/NI) features. Our empirical evaluation results show that the proposed SEEK technique can achieve a satisfactory classification accuracy. Furthermore, use of keyword and GI/NI features by the proposed SEEK technique resulted in the best classification accuracy for the Answer/Comment classification task. On the other hand, the use of keyword features only can best differentiate Explanation and Instruction relationships. Genre Classification Text Categorization Reply-semantic Relationship Knowledge Sharing
10	Poly-Lingual Text Categorization Shih, Hui-Hua 09 August 2006 (has links) With the rapid emergence and proliferation of Internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online. Efficiently and effectively managing these textual documents written different languages is essential to organizations and individuals. Although poly-lingual text categorization (PLTC) can be approached as a set of independent monolingual classifiers, this naïve approach employs only the training documents of the same language to construct to construct a monolingual classifier and fails to utilize the opportunity offered by poly-lingual training documents. Motivated by the significance of and need for such a poly-lingual text categorization technique, we propose a PLTC technique that takes into account all training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) technique as our performance benchmark, our empirical evaluation results show that our proposed PLTC technique achieves higher classification accuracy than the benchmark technique does in both English and Chinese corpora. In addition, our empirical results also suggest the robustness of the proposed PLTC technique with respect to the range of training sizes investigated. Text categorization Document management Text mining Poly-lingual text categorization

Search results