11 |
Topic-Oriented Collaborative Web CrawlingChung, Chiasen January 2001 (has links)
A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering.
|
12 |
Semantic Relationship Annotation for Knowledge Documents in Knowledge Sharing EnvironmentsPai, Yi-chung 29 July 2004 (has links)
A typical online knowledge-sharing environment would generate vast amount of formal knowledge elements or interactions that generally available as textual documents. Thus, an effective management of the ever-increasing volume of online knowledge documents is essential to organizational knowledge sharing. Reply-semantic relationships between knowledge documents may exist either explicitly or implicitly. Such reply-semantic relationships between knowledge documents, once discovered or identified, would facilitate subsequent knowledge access by providing a novel and more semantic retrieval mechanism. In this study, we propose a preliminary taxonomy of reply-semantic relationships for documents organized in reply-replied structures and develop a SEmantic Enrichment between Knowledge documents (SEEK) technique for automatically annotating reply-semantic relationships between reply-pair documents. Based on the content-based text categorization techniques and genre classification techniques, we propose and evaluate different feature-set models, combinations of keyword features, POS statistics features, and/or given/new information (GI/NI) features. Our empirical evaluation results show that the proposed SEEK technique can achieve a satisfactory classification accuracy. Furthermore, use of keyword and GI/NI features by the proposed SEEK technique resulted in the best classification accuracy for the Answer/Comment classification task. On the other hand, the use of keyword features only can best differentiate Explanation and Instruction relationships.
|
13 |
Text Categorization for E-Government Applications: The Case of City Mayor¡¦s MailboxKuo, Chiung-Jung 29 August 2006 (has links)
The central government and most of local governments in Taiwan have adopted the e-mail services to provide citizens for requesting services or expressing their opinions through Internet. Traditionally, these requests/opinions need to be manually classified into appropriate departments for service rendering. However, due to the ever-increasing number of requests/opinions received, the manual classification approach is time consuming and becomes impractical. Therefore, in this study, we attempt to apply text categorization techniques for constructing automatically a classification mechanism in order to establish an efficient e-government service portal.
The purpose of this thesis is to investigate effectiveness of different text categorization methods in supporting automatic classification of service requests/opinions emails sent to Mayor¡¦s mailbox. Specifically, in each phase of text categorization learning, we adopt and evaluate two methods commonly employed in prior research. In the feature selection phase, both the maximal x2¡@statistic method and the weighted average x2¡@statistic method of x2¡@statistic are evaluated. We consider the Binary and TFxIDF representation schemes in the document representation phase. Finally, we adopt the decision tree induction technique and the support vector machines (SVM) technique for inducing a text categorization model for our target e-government application. Our empirical evaluation results show that the text categorization method that employs the maximal x2 statistic method for feature selection, the Binary representation scheme, and the support vector machines as the underlying induction algorithm can reach an accuracy rate of 77.28% and an recall and precision rates of more than 77%. Such satisfactory classification effectiveness suggests that the text categorization approach can be employed to establish an effective and intelligent e-government service portal.
|
14 |
Development of Information Extraction-based Event Detection TechniqueLee, Yen-Hsien 30 July 2000 (has links)
Environmental scanning is an important process, which acquires and uses the information about events, trends, and relationships in an organization's external environment. It permits an organization to adapt to its environment and to develop effective responses to secure or improve their position in the future. Event detection technique that identifies the onset of new events from streams of news stories would facilitate the process of organization's environmental scanning. However, traditional feature-based event detection techniques, which identify whether a news story contains an unseen event by comparing the similarity of words between the news story and past news stories, incur some limitations (e.g., the features shown in news document cannot actually represent the event described in it.). Thus, in this study, we developed an information extraction-based event detection (NEED) technique that combines information extraction and text categorization techniques to address the problems inherent to traditional feature-based event detection techniques. The empirical evaluation results showed that the NEED technique outperformed the traditional feature-based event detection techniques in miss rate and false alarm rate and achieved comparable event association accuracy rate to its counterpart.
|
15 |
Using Text mining Techniques for automatically classifying Public Opinion DocumentsChen, Kuan-hsien 19 January 2009 (has links)
In a democratic society, the number of public opinion documents increase with days, and there is a pressing need for automatically classifying these documents. Traditional approach for classifying documents involves the techniques of segmenting words and the use of stop words, corpus, and grammar analysis for retrieving the key terms of documents. However, with the emergence of new terms, the traditional methods that leverage dictionary or thesaurus may incur lower accuracy. Therefore, this study proposes a new method that does not require the prior establishment of a dictionary or thesaurus, and is applicable to documents written in any language and documents containing unstructured text. Specifically, the classification method employs genetic algorithm for achieving this goal.
In this method, each training document is represented by several chromosomes, and based on the gene values of these chromosomes, the characteristic terms of the document are determined. The fitness function, which is required by the genetic algorithm for evaluating the fitness of an evolved chromosome, considers the similarity to the chromosomes of documents of other types.
This study used data FAQ of e-mail box of Taipei city mayor for evaluating the proposed method by varying the length of documents. The results show that the proposed method achieves the average accuracy rate of 89%, the average precision rate of 47%, and the average recall rate of 45%. In addition, F-measure can reach up to 0.7.
The results confirms that the number of training documents, content of training documents, the similarity between the types of documents, and the length of the documents all contribute to the effectiveness of the proposed method.
|
16 |
The textcat Package for n-Gram Based Text Categorization in RFeinerer, Ingo, Buchta, Christian, Geiger, Wilhelm, Rauch, Johannes, Mair, Patrick, Hornik, Kurt 02 1900 (has links) (PDF)
Identifying the language used will typically be the first step in most natural language
processing tasks. Among the wide variety of language identification methods discussed
in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text
categorization based on character n-gram frequencies have been particularly successful.
This paper presents the R extension package textcat for n-gram based text categorization
which implements both the Cavnar and Trenkle approach as well as a reduced n-gram
approach designed to remove redundancies of the original approach. A multi-lingual
corpus obtained from the Wikipedia pages available on a selection of topics is used to
illustrate the functionality of the package and the performance of the provided language
identification methods. (authors' abstract)
|
17 |
Teksto turinio analizė dirbtinių neuronų tinklais / Textual analysis using artificial neural networksŠatas, Arūnas 11 June 2006 (has links)
The theme of Master project is a posibility to use arificial neural networks for textual analysis and automatic categorization of textual documents in editorial programs. The task of the work was to analyze diferent methods of text clasification using diferent neural networks (SOM, Feed Forward, Learning Vector Quantization, etc.). There are much researchers who works on text clasification and artificial neural networks, but there is no practical fitting of such research. In this work I tried to find posibilities and dificulties of practical use of text clasification. I find that very important thing is initial amount and quality of information and not all neural networks fits for solving text categorization problems.
|
18 |
ANALYZING AND CATEGORIZING FLOOD DISASTER-RELATED TWEETS FOR EMERGENCY RESPONSE / 危機対応を目的とした洪水災害関連ツイートの分析と分類Shi, Yongxue 25 March 2019 (has links)
付記する学位プログラム名: グローバル生存学大学院連携プログラム / 京都大学 / 0048 / 新制・課程博士 / 博士(工学) / 甲第21735号 / 工博第4552号 / 新制||工||1710(附属図書館) / 京都大学大学院工学研究科社会基盤工学専攻 / (主査)教授 堀 智晴, 教授 寶 馨, 准教授 佐山 敬洋, 教授 立川 康人 / 学位規則第4条第1項該当 / Doctor of Philosophy (Engineering) / Kyoto University / DGAM
|
19 |
Improving Feature Selection Techniques for Machine LearningTan, Feng 27 November 2007 (has links)
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases.
|
20 |
On Travel Article Classification Based on Consumer Information Search Process ModelHsiao, Yung-Lin 27 July 2011 (has links)
The information overload problem becomes imperative with the explosion of information, and people need some agents to facilitate them to filter the information to meet their personal need. In this work, we conduct a research for the article classification in the tourism domain so as to identify articles that meet users¡¦ information need. We propose an information need orientation model in tourism, which consists of four goals: Initiation, Attraction, Accommodation, and Route planning. These goals can be characterized by 13 features. Some of the identified features can be enhanced by WordNet and Named Entity Recognition techniques as supplement techniques. To test the effectiveness of using the 13 features for classification and the relevant methods, we collected 15,797 articles from TripAdvisor.com, the world's largest travel site, and randomly selected 600 articles as training data labeled by two labelers. The experimental results show that our approach generally has comparable or better performance than that of using purely lexical features, namely TF-IDF, for classification, with fewer features.
|
Page generated in 0.2171 seconds