Global ETD Search

1	An ID-Tree Index Strategy for Information Filtering in Web-Based Systems Wang, Yi-Siang 10 July 2006 (has links) With the booming development of WWW, many search engines have been developed to help users to find useful information from a great quantity of data. However, users may have different needs in different situations. Opposite to the Information Retrieval where users retrieve data actively, Information Filtering (IF) sends information from servers to passive users through broadcast mediums, rather than being searched by them. Therefore, each user has his (or her) profile stored in the database, where a profile records a set of interest items that can present his (or her) interests or habits. To efficiently store many user profiles in servers and filter irrelevant users, many signature-based index techniques are applied in IF systems. By using signatures, IF does not need to compare each item of profiles to filter out irrelevant ones. However, because signatures are incomplete information of profiles, it is very hard to answer the complex queries by using only the signatures. Therefore, a critical issue of the signature-based IF service is how to index the signatures of user profiles for an efficient filtering process. There are often two types of queries in the signature-based IF systems, the inexact filtering and the similarity search queries. In the inexact filtering, a query is an incoming document and it needs to find the profiles whose interest items are all included in the query. On the other hand, in the similarity search, a query is a user profile and it needs to find the users whose interest items are similar to the query user. In this thesis, we propose an ID-tree index strategy, which indexes signatures of user profiles by partitioning them into subgroups using a binary tree structure according to all of the different items among them. Basically, our ID-tree index strategy is a kind of the signature tree. In an ID-tree, each path from the root to a leaf node is the signature of the profile pointed by the leaf node. Because each profile is pointed only by one leaf node of the ID-tree, there will be no collision in the structure. In other words, there will be no two profiles assigned to the same signature. Moreover, only the different items among subgroups of profiles will be checked at one time to filter out irrelevant profiles for queries. Therefore, our strategy can answer the inexact filtering and the similarity search queries with less number of accessed profiles as compared to the previous strategies. Moreover, to build the index of signatures, it needs less time to batch a great deal of database profiles. From our simulation results, we show that our strategy can access less number of profiles to answer the queries than Chen's signature tree strategy for the inexact filtering and Aggarwal et al.'s SG-table strategy for the similarity search. Signature Inexact Filtering Information Filtering Data Partition Similarity Search
2	SVM-Based Negative Data Mining to Binary Classification Jiang, Fuhua 03 August 2006 (has links) The properties of training data set such as size, distribution and the number of attributes significantly contribute to the generalization error of a learning machine. A not well-distributed data set is prone to lead to a partial overfitting model. Two approaches proposed in this dissertation for the binary classification enhance useful data information by mining negative data. First, an error driven compensating hypothesis approach is based on Support Vector Machines (SVMs) with (1+k)-iteration learning, where the base learning hypothesis is iteratively compensated k times. This approach produces a new hypothesis on the new data set in which each label is a transformation of the label from the negative data set, further producing the positive and negative child data subsets in subsequent iterations. This procedure refines the base hypothesis by the k child hypotheses created in k iterations. A prediction method is also proposed to trace the relationship between negative subsets and testing data set by a vector similarity technique. Second, a statistical negative example learning approach based on theoretical analysis improves the performance of the base learning algorithm learner by creating one or two additional hypotheses audit and booster to mine the negative examples output from the learner. The learner employs a regular Support Vector Machine to classify main examples and recognize which examples are negative. The audit works on the negative training data created by learner to predict whether an instance is negative. However, the boosting learning booster is applied when audit does not have enough accuracy to judge learner correctly. Booster works on training data subsets with which learner and audit do not agree. The classifier for testing is the combination of learner, audit and booster. The classifier for testing a specific instance returns the learner's result if audit acknowledges learner's result or learner agrees with audit's judgment, otherwise returns the booster's result. The error of the classifier is decreased to O(e^2) comparing to the error O(e) of a base learning algorithm. Data partition Data classification Vector similarity Multiple passes learning Machine learning Bagging Boosting Support vector machines Data preparation Computer Sciences

Search results

An ID-Tree Index Strategy for Information Filtering in Web-Based Systems

SVM-Based Negative Data Mining to Binary Classification