Global ETD Search

471	Using biased support vector machine in image retrieval with self-organizing map. January 2005 (has links) Chan Chi Hang. / Thesis submitted in: August 2004. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 105-114). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Problem Statement --- p.3 / Chapter 1.2 --- Major Contributions --- p.5 / Chapter 1.3 --- Publication List --- p.6 / Chapter 1.4 --- Thesis Organization --- p.7 / Chapter 2 --- Background Survey --- p.9 / Chapter 2.1 --- Relevance Feedback Framework --- p.9 / Chapter 2.1.1 --- Relevance Feedback Types --- p.11 / Chapter 2.1.2 --- Data Distribution --- p.12 / Chapter 2.1.3 --- Training Set Size --- p.14 / Chapter 2.1.4 --- Inter-Query Learning and Intra-Query Learning --- p.15 / Chapter 2.2 --- History of Relevance Feedback Techniques --- p.16 / Chapter 2.3 --- Relevance Feedback Approaches --- p.19 / Chapter 2.3.1 --- Vector Space Model --- p.19 / Chapter 2.3.2 --- Ad-hoc Re-weighting --- p.26 / Chapter 2.3.3 --- Distance Optimization Approach --- p.29 / Chapter 2.3.4 --- Probabilistic Model --- p.33 / Chapter 2.3.5 --- Bayesian Approach --- p.39 / Chapter 2.3.6 --- Density Estimation Approach --- p.42 / Chapter 2.3.7 --- Support Vector Machine --- p.48 / Chapter 2.4 --- Presentation Set Selection --- p.52 / Chapter 2.4.1 --- Most-probable strategy --- p.52 / Chapter 2.4.2 --- Most-informative strategy --- p.52 / Chapter 3 --- Biased Support Vector Machine for Content-Based Image Retrieval --- p.57 / Chapter 3.1 --- Motivation --- p.57 / Chapter 3.2 --- Background --- p.58 / Chapter 3.2.1 --- Regular Support Vector Machine --- p.59 / Chapter 3.2.2 --- One-class Support Vector Machine --- p.61 / Chapter 3.3 --- Biased Support Vector Machine --- p.63 / Chapter 3.4 --- Interpretation of parameters in BSVM --- p.67 / Chapter 3.5 --- Soft Label Biased Support Vector Machine --- p.69 / Chapter 3.6 --- Interpretation of parameters in Soft Label BSVM --- p.73 / Chapter 3.7 --- Relevance Feedback Using Biased Support Vector Machine --- p.74 / Chapter 3.7.1 --- Advantages of BSVM in Relevance Feedback . . --- p.74 / Chapter 3.7.2 --- Relevance Feedback Algorithm By BSVM --- p.75 / Chapter 3.8 --- Experiments --- p.78 / Chapter 3.8.1 --- Synthetic Dataset --- p.80 / Chapter 3.8.2 --- Real-World Dataset --- p.81 / Chapter 3.8.3 --- Experimental Results --- p.83 / Chapter 3.9 --- Conclusion --- p.86 / Chapter 4 --- Self-Organizing Map-based Inter-Query Learning --- p.88 / Chapter 4.1 --- Motivation --- p.88 / Chapter 4.2 --- Algorithm --- p.89 / Chapter 4.2.1 --- Initialization and Replication of SOM --- p.89 / Chapter 4.2.2 --- SOM Training for Inter-Query Learning --- p.90 / Chapter 4.2.3 --- Incorporate with Intra-Query Learning --- p.92 / Chapter 4.3 --- Experiments --- p.93 / Chapter 4.3.1 --- Synthetic Dataset --- p.95 / Chapter 4.3.2 --- Real-World Dataset --- p.95 / Chapter 4.3.3 --- Experimental Results --- p.97 / Chapter 4.4 --- Conclusion --- p.98 / Chapter 5 --- Conclusion --- p.102 / Bibliography --- p.104 Image processing--Digital techniques Machine learning Self-organizing maps
472	Information discovery from semi-structured record sets on the Web. January 2012 (has links) 万维网(World Wide Web ，简称Web) 从上世纪九十年代出现以来在深度和广度上都得到了巨大的发展，大量的Web应用前所未有地改变了人们的生活。Web的发展形成了个庞大而有价值的信息资源，然而由于Web 内容异质性给自动信息抽取所造成的困难，这个信息源并没有被充分地利用。因此， Web信息抽取是Web信息应用过程中非常关键的一环。一般情况下，一个网页用来描述一个单独的对象或者一组相似的对象。例如，关于某款数码相机的网页描述了该相机的各方面特征，而一个院系的教授列表则描述了一组教授的基本信息。相应地， Web信息抽取可以分为两大类，即面向单个对象细节的信息抽取和面向组对象记录的信息抽取。本文集中讨论后者，即从单的网页中抽取组半结构化的数据记录。 / 本文提出了两个框架来解决半结构化数据记录的抽取问题。首先介绍一个基于数据记录切分树的框架RST 。该框架中提出了个新的搜索结构即数据记录切分树。基于所设计的搜索策略，数据记录切分树可以有效地从网页中抽取数据记录。在数据记录切分树中，对应于可能的数据记录的DOM子树组是在搜索过程中动态生成的，这使得RST框架比已有的方法更具灵活性。比如在MDR和DEPTA 中， DOM子树组是根据预定义的方式静态生成的，未能考虑当前数据记录区域的特征。另外， RST框架中提出了一个基于"HTML Token" 单元的相似度计算方法。i衷方法可以综合MDR中基于字符串编辑距离的方法之优点和DEPTA 中基于树结构编辑距离的方法之优点。 / 很多解决数据记录抽取问题的已有方法(包括RST框架)都需要预定义若干硬性的条件，并且他们通过遍历DOM树结构来在一个网页中穷举搜索可能存在的数据记录区域。这些方法不能很好地处理大量的含有复杂数据记录结构的网页。因此，本文提出了第二个解决框架Skoga。 Skoga框架由一个DOM结构知识驱动的模型和一个记录切分树模型组成。Skoga框架可以对DOM结构进行全局的分析，进而实现更加有效的、鲁棒的记录识别。DOM结构知识包含DOM 背景知识和DOM统计知识。前者描述DOM结构中的一些逻辑关系，这些关系对DOM 的逻辑结构进行限制。而后者描述一个DOM节点或者一组DOM节点的特点，由一组经过巧妙设计的特征(Feature) 来表示。特征的权重是由参数估计算法在一个开发数据集上学习得到的。基于面向结构化输出的支持向量机( Structuredoutput Support Vector Machine) 模型，本参数估计算法可以很好地处理DOM节点之间的依赖关系。另外，本文提出了一个基于分治策略的优化方法来搜索一个网页的最优化记录识别。 / 最后，本文提出了一个利用半结构化数据记录来进行维基百科类目(Wikipedia Category) 扩充的框架。該框架首先从某个维基百科类目中获取几个已有的实体(Entity) 作为种子，然后利用这些种子及其信息框(Infobox) 中的属性来从Web上发掘更多的同一类目的实体及其属性信息。该框架的一个特点是它利用半结构化的数据记录来进行新实体和属性的抽取，而这些半结构化的数据记录是通过自动的方法从Web上获取的。该框架提出了一个基于条件随机场(Conditional Random Fields) 的半监督学习模型来利用有限的标注样本进行目标信息抽取。这个半监督学习模型定义了一个记录相似关系图来指导学习过程，从而利用大量非标注样本来获得更好的信息抽取效果。 / The World Wide Web has been extensively developed since its first appearance two decades ago. Various applications on theWeb have unprecedentedly changed humans' life. Although the explosive growth and spread of the Web have resulted in a huge information repository, yet it is still under-utilized due to the difficulty in automated information extraction (IE) caused by the heterogeneity of Web content. Thus, Web IE is an essential task in the utilization of Web information. Typically, a Web page may describe either a single object or a group of similar objects. For example, the description page of a digital camera describes different aspects of the camera. On the contrary, the faculty list page of a department presents the information of a group of professors. Corresponding to the above two types, Web IE methods can be broadly categorized into two classes, namely, description details oriented extraction and object records oriented extraction. In this thesis, we focus on the later task, namely semi-structured data record extraction from a single Web page. / In this thesis, we develop two frameworks to tackle the task of data record extraction. We first present a record segmentation search tree framework in which a new search structure, named Record Segmentation Tree (RST), is designed and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. The subtree groups corresponding to possible data records are dynamically generated in the RST structure during the search process. Therefore, this framework is more exible compared with existing methods such as MDR and DEPTA that have a static manner of generating subtree groups. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. / Many existing methods, including the RST framework, for data record extraction from Web pages contain pre-coded hard criteria and adopt an exhaustive search strategy for traversing the DOM tree. They fail to handle many challenging pages containing complicated data records and record regions. In this thesis, we also present another framework Skoga which can perform robust detection of different kinds of data records and record regions. Skoga, composed of a DOM structure knowledge driven detection model and a record segmentation search tree model, can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions as exhibited in the DOM structure. Specifically, the background knowledge encodes some logical relations governing certain structural constraints in the DOM structure. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development data set via a parameter estimation algorithm based on structured output Support Vector Machine model which can tackle the inter-dependency among the labels on the nodes of the DOM structure. An optimization method based on divide and conquer principle is developed making use of the DOM structure knowledge to quantitatively infer the best record and region recognition. / Finally, we present a framework that can make use of the detected data records to automatically populate existing Wikipedia categories. This framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of this framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised leaning process. The graph captures alignment similarity among data records. Then the semisupervised learning process can leverage the benefit of the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Bing, Lidong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 114-123). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Web Era and Web IE --- p.1 / Chapter 1.2 --- Semi-structured Record and Region Detection --- p.3 / Chapter 1.2.1 --- Problem Setting --- p.3 / Chapter 1.2.2 --- Observations and Challenges --- p.5 / Chapter 1.2.3 --- Our Proposed First Framework - Record Segmentation Tree --- p.9 / Chapter 1.2.4 --- Our Proposed Second Framework - DOM Structure Knowledge Oriented Global Analysis --- p.10 / Chapter 1.3 --- Entity Expansion and Attribute Acquisition with Semi-structured Data Records --- p.13 / Chapter 1.3.1 --- Problem Setting --- p.13 / Chapter 1.3.2 --- Our Proposed Framework - Semi-supervised CRF Regularized by Proximate Graph --- p.15 / Chapter 1.4 --- Outline of the Thesis --- p.17 / Chapter 2 --- Literature Survey --- p.19 / Chapter 2.1 --- Semi-structured Record Extraction --- p.19 / Chapter 2.2 --- Entity Expansion and Attribute Acquisition --- p.23 / Chapter 3 --- Record Segmentation Tree (RST) Framework --- p.27 / Chapter 3.1 --- Overview --- p.27 / Chapter 3.2 --- Record Segmentation Tree --- p.29 / Chapter 3.2.1 --- Basic Record Segmentation Tree --- p.29 / Chapter 3.2.2 --- Slimmed Segmentation Tree --- p.30 / Chapter 3.2.3 --- Utilize RST in Record Extraction --- p.31 / Chapter 3.3 --- Search Pruning Strategies --- p.33 / Chapter 3.3.1 --- Threshold-Based Top k Search --- p.33 / Chapter 3.3.2 --- Complexity Analysis --- p.35 / Chapter 3.3.3 --- Composite Node Pruning --- p.37 / Chapter 3.3.4 --- More Challenging Record Region Discussion --- p.37 / Chapter 3.4 --- Similarity Measure --- p.41 / Chapter 3.4.1 --- Encoding Subtree with Tokens --- p.42 / Chapter 3.4.2 --- Tandem Repeat Detection and Distance-based Measure --- p.42 / Chapter 4 --- DOM Structure Knowledge Oriented Global Analysis (Skoga) Framework --- p.45 / Chapter 4.1 --- Overview --- p.45 / Chapter 4.2 --- Design of DOM Structure Knowledge --- p.49 / Chapter 4.2.1 --- Background Knowledge --- p.49 / Chapter 4.2.2 --- Statistical Knowledge --- p.51 / Chapter 4.3 --- Finding Optimal Label Assignment --- p.54 / Chapter 4.3.1 --- Inference for Bottom Subtrees --- p.55 / Chapter 4.3.2 --- Recursive Inference for Higher Subtree --- p.57 / Chapter 4.3.3 --- Backtracking for the Optimal Label Assignment --- p.59 / Chapter 4.3.4 --- Second Optimal Label Assignment --- p.60 / Chapter 4.4 --- Statistical Knowledge Acquisition --- p.62 / Chapter 4.4.1 --- Finding Feature Weights via Structured Output SVM Learning --- p.62 / Chapter 4.4.2 --- Region-oriented Loss --- p.63 / Chapter 4.4.3 --- Cost Function Optimization --- p.65 / Chapter 4.5 --- Record Segmentation and Reassembling --- p.66 / Chapter 5 --- Experimental Results of Data Record Extraction --- p.68 / Chapter 5.1 --- Evaluation Data Set --- p.68 / Chapter 5.2 --- Experimental Setup --- p.70 / Chapter 5.3 --- Experimental Results on TBDW --- p.73 / Chapter 5.4 --- Experimental Results on Hybrid Data Set with Nested Region --- p.76 / Chapter 5.5 --- Experimental Results on Hybrid Data Set with Intertwined Region --- p.78 / Chapter 5.6 --- Empirical Case Studies --- p.79 / Chapter 5.6.1 --- Case Study One --- p.80 / Chapter 5.6.2 --- Case Study Two --- p.83 / Chapter 6 --- Semi-supervised CRF Regularized by Proximate Graph --- p.85 / Chapter 6.1 --- Overview --- p.85 / Chapter 6.2 --- Semi-structured Data Record Set Collection --- p.88 / Chapter 6.3 --- Semi-supervised Learning Model for Extraction --- p.89 / Chapter 6.3.1 --- Proximate Record Graph Construction --- p.91 / Chapter 6.3.2 --- Semi-Markov CRF and Features --- p.94 / Chapter 6.3.3 --- Posterior Regularization --- p.95 / Chapter 6.3.4 --- Inference with Regularized Posterior --- p.97 / Chapter 6.3.5 --- Semi-supervised Training --- p.97 / Chapter 6.3.6 --- Result Ranking --- p.98 / Chapter 6.4 --- Derived Training Example Generation --- p.99 / Chapter 6.5 --- Experiments --- p.100 / Chapter 6.5.1 --- Experiment Setting --- p.100 / Chapter 6.5.2 --- Entity Expansion --- p.103 / Chapter 6.5.3 --- Attribute Extraction --- p.107 / Chapter 7 --- Conclusions and Future Work --- p.110 / Chapter 7.1 --- Conclusions --- p.110 / Chapter 7.2 --- Future Work --- p.112 / Bibliography --- p.113 Computer network resources Information organization Information retrieval
473	Probabilistic models for information extraction: from cascaded approach to joint approach. / CUHK electronic theses & dissertations collection January 2010 (has links) Based on these observations and analysis, we propose a joint discriminative probabilistic framework to optimize all relevant subtasks simultaneously. This framework defines a joint probability distribution for both segmentations in sequence data and relations of segments in the form of an exponential family. This model allows tight interactions between segmentations and relations of segments and it offers a natural way for IE tasks. Since exact parameter estimation and inference are prohibitively intractable, a structured variational inference algorithm is developed to perform parameter estimation approximately. For inference, we propose a strong bi-directional MH approach to find the MAP assignments for joint segmentations and relations to explore mutual benefits on both directions, such that segmentations can aid relations, and vice-versa. / Information Extraction (IE) aims at identifying specific pieces of information (data) in a unstructured or semi-structured textual document and transforming unstructured information in a corpus of documents or Web pages into a structured database. There are several representative tasks in IE: named entity recognition (NER), which aims at identifying phrases that denote types of named entities, entity relation extraction, which aims at discovering the events or relations related to the entities, and the task of coreference resolution, aims at determining whether two extracted mentions of entities refer to the same object. IE is useful for a wide variety of applications. / The end-to-end performance of high-level IE systems for compound tasks is often hampered by the use of cascaded frameworks. The integrated model we proposed can alleviate some of these problems, but it is only loosely coupled. Parameter estimation is performed independently and it only allows information to flow in one direction. In this top-down integration model, the decision of the bottom sub-model could guide the decision of the upper sub-model, but not vice-versa. Thus, deep interactions and dependencies between different tasks can hardly be well captured. / We have investigated and developed a cascaded framework in an attempt to consider entity extraction and qualitative domain knowledge based on undirected, discriminatively-trained probabilistic graphical models. This framework consists of two stages and it is the combination of statistical learning and first-order logic. As a pipeline model, the first stage is a base model and the second stage is used to validate and correct the errors made in the base model. We incorporated domain knowledge that can be well formulated into first-order logic to extract entity candidates from the base model. We have applied this framework and achieved encouraging results in Chinese NER on the People's Daily corpus. / We perform extensive experiments on three important IE tasks using real-world datasets, namely Chinese NER, entity identification and relationship extraction from Wikipedia's encyclopedic articles, and citation matching, to test our proposed models, including the bidirectional model, the integrated model, and the joint model. Experimental results show that our models significantly outperform current state-of-the-art probabilistic models, such as decoupled and joint models, illustrating the feasibility and promise of our proposed approaches. (Abstract shortened by UMI.) / We present a general, strongly-coupled, and bidirectional architecture based on discriminatively trained factor graphs for information extraction, which consists of two components---segmentation and relation. First we introduce joint factors connecting variables of relevant subtasks to capture dependencies and interactions between them. We then propose a strong bidirectional Markov chain Monte Carlo (MCMC) sampling inference algorithm which allows information to flow in both directions to find the approximate maximum a posteriori (MAP) solution for all subtasks. Notably, our framework is considerably simpler to implement, and outperforms previous ones. / Yu, Xiaofeng. / Adviser: Zam Wai. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 109-123). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Graphical modeling (Statistics) Names, Chinese Random fields Text processing (Computer science)
474	Information fusion for monolingual and cross-language spoken document retrieval. / CUHK electronic theses & dissertations collection / Digital dissertation consortium January 2002 (has links) Lo Wai-kit. / "October 2002." / Thesis (Ph.D.)--Chinese University of Hong Kong, 2002. / Includes bibliographical references (p. 170-184). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Mode of access: World Wide Web. / Abstracts in English and Chinese. Speech processing systems Cross-language information retrieval Information retrieval
475	Parameter free document stream classification. / CUHK electronic theses & dissertations collection January 2006 (has links) Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible. / For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted. / For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U. / In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically. / In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval. / Fung Pui Cheong Gabriel. / "August 2006." / Adviser: Jeffrey Xu Yu. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 122-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Classification Data mining--Mathematical models Database management
476	Statistical machine learning for data mining and collaborative multimedia retrieval. / CUHK electronic theses & dissertations collection January 2006 (has links) Another issue studied in the framework is Distance Metric Learning (DML). Learning distance metrics is critical to many machine learning tasks, especially when contextual information is available. To learn effective metrics from pairwise contextual constraints, two novel methods, Discriminative Component Analysis (DCA) and Kernel DCA, are proposed to learn both linear and nonlinear distance metrics. Empirical results on data clustering validate the advantages of the algorithms. / Based on this unified learning framework, a novel scheme is suggested for learning Unified Kernel Machines (UKM). The UKM scheme combines supervised kernel machine learning, unsupervised kernel de sign, semi-supervised kernel learning, and active learning in an effective fashion. A key component in the UKM scheme is to learn kernels from both labeled and unlabeled data. To this purpose; a new Spectral Kernel Learning (SKL) algorithm is proposed, which is related to a quadratic program. Empirical results show that the UKM technique is promising for classification tasks. / In addition to the above methodologies, this thesis also addresses some practical issues in applying machine learning techniques to real-world applications. For example, in a time-dependent data mining application, in order to design a domain-specific kernel, marginalized kernel techniques are suggested to formulate an effective kernel aimed at web data mining tasks. / Last, the thesis investigates statistical machine learning techniques with applications to multimedia retrieval and addresses some practical issues, such as robustness to noise and scalability. To bridge semantic gap issues of multimedia retrieval, a Collaborative Multimedia Retrieval (CMR) scheme is proposed to exploit historical log data of users' relevance feedback for improving retrieval tasks. Two types of learning tasks in the CMR scheme are identified and two innovative algorithms are proposed to effectively solve the problems respectively. / Statistical machine learning techniques have been widely applied in data mining and multimedia information retrieval. While traditional methods; such as supervised learning, unsupervised learning, and active learning, have been extensively studied separately, there are few comprehensive schemes to investigate these techniques in a unified approach. This thesis proposes a unified learning paradigm (ULP) framework that integrates several machine learning techniques including supervised learning; unsupervised learning, semi-supervised learning, active learning and metric learning in a synergistic way to maximize the effectiveness of a learning task. / Within the unified learning framework, this thesis further explores two important challenging tasks. One is Batch Mode Active Learning (BMAL). In contrast to traditional approaches, the BMAL method searches a batch of informative examples for labeling. To develop an effective algorithm, the BMAL task is formulated into a convex optimization problem and a novel bound optimization algorithm is proposed to efficiently solve it with global optima. Extensive evaluations on text categorization tasks show that the BMAL algorithm is superior to traditional methods. / Hoi Chu Hong. / "September 2006." / Adviser: Michael R. Lyu. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1723. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 203-223). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Data mining Machine learning--Statistical methods Multimedia systems
477	Redundancy on content-based indexing. January 1997 (has links) by Cheung King Lum Kingly. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1997. / Includes bibliographical references (leaves 108-110). / Abstract --- p.ii / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Problems in Content-Based Indexing --- p.2 / Chapter 1.3 --- Contributions --- p.3 / Chapter 1.4 --- Thesis Organization --- p.4 / Chapter 2 --- Content-Based Indexing Structures --- p.5 / Chapter 2.1 --- R-Tree --- p.6 / Chapter 2.2 --- R+-Tree --- p.8 / Chapter 2.3 --- R-Tree --- p.11 / Chapter 3 --- Searching in Both R-Tree and R-Tree --- p.15 / Chapter 3.1 --- Exact Search --- p.15 / Chapter 3.2 --- Nearest Neighbor Search --- p.19 / Chapter 3.2.1 --- Definition of Searching Metrics --- p.19 / Chapter 3.2.2 --- Pruning Heuristics --- p.21 / Chapter 3.2.3 --- Nearest Neighbor Search Algorithm --- p.24 / Chapter 3.2.4 --- Generalization to N-Nearest Neighbor Search --- p.25 / Chapter 4 --- An Improved Nearest Neighbor Search Algorithm for R-Tree --- p.29 / Chapter 4.1 --- Introduction --- p.29 / Chapter 4.2 --- New Pruning Heuristics --- p.31 / Chapter 4.3 --- An Improved Nearest Neighbor Search Algorithm --- p.34 / Chapter 4.4 --- Replacing Heuristics --- p.36 / Chapter 4.5 --- N-Nearest Neighbor Search --- p.41 / Chapter 4.6 --- Performance Evaluation --- p.45 / Chapter 5 --- Overlapping Nodes in R-Tree and R*-Tree --- p.53 / Chapter 5.1 --- Overlapping Nodes --- p.54 / Chapter 5.2 --- Problem Induced By Overlapping Nodes --- p.57 / Chapter 5.2.1 --- Backtracking --- p.57 / Chapter 5.2.2 --- Inefficient Exact Search --- p.57 / Chapter 5.2.3 --- Inefficient Nearest Neighbor Search --- p.60 / Chapter 6 --- Redundancy On R-Tree --- p.64 / Chapter 6.1 --- Motivation --- p.64 / Chapter 6.2 --- Adding Redundancy on Index Tree --- p.65 / Chapter 6.3 --- R-Tree with Redundancy --- p.66 / Chapter 6.3.1 --- Previous Models of R-Tree with Redundancy --- p.66 / Chapter 6.3.2 --- Redundant R-Tree --- p.70 / Chapter 6.3.3 --- Level List --- p.71 / Chapter 6.3.4 --- Inserting Redundancy to R-Tree --- p.72 / Chapter 6.3.5 --- Properties of Redundant R-Tree --- p.77 / Chapter 7 --- Searching in Redundant R-Tree --- p.82 / Chapter 7.1 --- Exact Search --- p.82 / Chapter 7.2 --- Nearest Neighbor Search --- p.86 / Chapter 7.3 --- Avoidance of Multiple Accesses --- p.89 / Chapter 8 --- Experiment --- p.90 / Chapter 8.1 --- Experimental Setup --- p.90 / Chapter 8.2 --- Exact Search --- p.91 / Chapter 8.2.1 --- Clustered Data --- p.91 / Chapter 8.2.2 --- Real Data --- p.93 / Chapter 8.3 --- Nearest Neighbor Search --- p.95 / Chapter 8.3.1 --- Clustered Data --- p.95 / Chapter 8.3.2 --- Uniform Data --- p.98 / Chapter 8.3.3 --- Real Data --- p.100 / Chapter 8.4 --- Discussion --- p.102 / Chapter 9 --- Conclusions and Future Research --- p.105 / Chapter 9.1 --- Conclusions --- p.105 / Chapter 9.2 --- Future Research --- p.106 / Bibliography --- p.108 Automatic indexing Trees (Graph theory) Data structures (Computer Science)
478	Fuzzy clustering for content-based indexing in multimedia databases. January 2001 (has links) Yue Ho-Yin. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2001. / Includes bibliographical references (leaves 129-137). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Problem Definition --- p.7 / Chapter 1.2 --- Contributions --- p.8 / Chapter 1.3 --- Thesis Organization --- p.10 / Chapter 2 --- Literature Review --- p.11 / Chapter 2.1 --- "Content-based Retrieval, Background and Indexing Problem" --- p.11 / Chapter 2.1.1 --- Feature Extraction --- p.12 / Chapter 2.1.2 --- Nearest-neighbor Search --- p.13 / Chapter 2.1.3 --- Content-based Indexing Methods --- p.15 / Chapter 2.2 --- Indexing Problems --- p.25 / Chapter 2.3 --- Data Clustering Methods for Indexing --- p.26 / Chapter 2.3.1 --- Probabilistic Clustering --- p.27 / Chapter 2.3.2 --- Possibilistic Clustering --- p.34 / Chapter 3 --- Fuzzy Clustering Algorithms --- p.37 / Chapter 3.1 --- Fuzzy Competitive Clustering --- p.38 / Chapter 3.2 --- Sequential Fuzzy Competitive Clustering --- p.40 / Chapter 3.3 --- Experiments --- p.43 / Chapter 3.3.1 --- Experiment 1: Data set with different number of samples --- p.44 / Chapter 3.3.2 --- Experiment 2: Data set on different dimensionality --- p.46 / Chapter 3.3.3 --- Experiment 3: Data set with different number of natural clusters inside --- p.55 / Chapter 3.3.4 --- Experiment 4: Data set with different noise level --- p.56 / Chapter 3.3.5 --- Experiment 5: Clusters with different geometry size --- p.60 / Chapter 3.3.6 --- Experiment 6: Clusters with different number of data instances --- p.67 / Chapter 3.3.7 --- Experiment 7: Performance on real data set --- p.71 / Chapter 3.4 --- Discussion --- p.72 / Chapter 3.4.1 --- "Differences Between FCC, SFCC, and Others Clustering Algorithms" --- p.72 / Chapter 3.4.2 --- Variations on SFCC --- p.75 / Chapter 3.4.3 --- Why SFCC? --- p.75 / Chapter 4 --- Hierarchical Indexing based on Natural Clusters Information --- p.77 / Chapter 4.1 --- The Hierarchical Approach --- p.77 / Chapter 4.2 --- The Sequential Fuzzy Competitive Clustering Binary Tree (SFCC- b-tree) --- p.79 / Chapter 4.2.1 --- Data Structure of SFCC-b-tree --- p.80 / Chapter 4.2.2 --- Tree Building of SFCC-b-Tree --- p.82 / Chapter 4.2.3 --- Insertion of SFCC-b-tree --- p.83 / Chapter 4.2.4 --- Deletion of SFCC-b-Tree --- p.84 / Chapter 4.2.5 --- Searching in SFCC-b-Tree --- p.84 / Chapter 4.3 --- Experiments --- p.88 / Chapter 4.3.1 --- Experimental Setting --- p.88 / Chapter 4.3.2 --- Experiment 8: Test for different leaf node sizes --- p.90 / Chapter 4.3.3 --- Experiment 9: Test for different dimensionality --- p.97 / Chapter 4.3.4 --- Experiment 10: Test for different sizes of data sets --- p.104 / Chapter 4.3.5 --- Experiment 11: Test for different data distributions --- p.109 / Chapter 4.4 --- Summary --- p.113 / Chapter 5 --- A Case Study on SFCC-b-tree --- p.114 / Chapter 5.1 --- Introduction --- p.114 / Chapter 5.2 --- Data Collection --- p.115 / Chapter 5.3 --- Data Pre-processing --- p.116 / Chapter 5.4 --- Experimental Results --- p.119 / Chapter 5.5 --- Summary --- p.121 / Chapter 6 --- Conclusion --- p.122 / Chapter 6.1 --- An Efficiency Formula --- p.122 / Chapter 6.1.1 --- Motivation --- p.122 / Chapter 6.1.2 --- Regression Model --- p.123 / Chapter 6.1.3 --- Discussion --- p.124 / Chapter 6.2 --- Future Directions --- p.127 / Chapter 6.3 --- Conclusion --- p.128 / Bibliography --- p.129 Multimedia systems Cluster analysis Fuzzy systems Indexing
479	Peer clustering and firework query model in peer-to-peer networks. January 2003 (has links) Ng, Cheuk Hang. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 89-95). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Problem Definition --- p.2 / Chapter 1.2 --- Main Contributions --- p.4 / Chapter 1.3 --- Thesis Organization --- p.5 / Chapter 2 --- Background --- p.6 / Chapter 2.1 --- Background of Peer-to-Peer --- p.6 / Chapter 2.2 --- Background of Content-Based Image Retrieval System --- p.9 / Chapter 2.3 --- Literature Review of Peer-to-Peer Application --- p.10 / Chapter 2.4 --- Literature Review of Discovery Mechanisms for Peer-to-Peer Applications --- p.13 / Chapter 2.4.1 --- Centralized Search --- p.13 / Chapter 2.4.2 --- Distributed Search - Flooding --- p.15 / Chapter 2.4.3 --- Distributed Search - Distributed Hash Table --- p.21 / Chapter 3 --- Peer Clustering and Firework Query Model --- p.25 / Chapter 3.1 --- Peer Clustering --- p.26 / Chapter 3.1.1 --- Peer Clustering - Simplified Version --- p.27 / Chapter 3.1.2 --- Peer Clustering - Single Cluster Version --- p.29 / Chapter 3.1.3 --- "Peer Clustering - Single Cluster, Multiple Layers of Con- nection Version" --- p.34 / Chapter 3.1.4 --- Peer Clustering - Multiple Clusters Version --- p.35 / Chapter 3.2 --- Firework Query Model Over Clustered Network --- p.38 / Chapter 4 --- Experiments and Results --- p.43 / Chapter 4.1 --- Simulation Model of Peer-to-Peer Network --- p.43 / Chapter 4.2 --- Performance Metrics --- p.45 / Chapter 4.3 --- Experiment Results --- p.47 / Chapter 4.3.1 --- Performances in different Number of Peers in P2P Network --- p.47 / Chapter 4.3.2 --- Performances in different TTL value of query packet in P2P Network --- p.52 / Chapter 4.3.3 --- "Performances in different different data sets, synthetic data and real data" --- p.55 / Chapter 4.3.4 --- Performances in different number of local clusters of each peer in P2P Network --- p.58 / Chapter 4.4 --- Evaluation of different clustering algorithms --- p.64 / Chapter 5 --- Distributed COntent-based Visual Information Retrieval (DIS- COVIR) --- p.67 / Chapter 5.1 --- Architecture of DISCOVIR and Functionality of DISCOVIR Components --- p.68 / Chapter 5.2 --- Flow of Operations --- p.72 / Chapter 5.2.1 --- Preprocessing (1) --- p.73 / Chapter 5.2.2 --- Connection Establishment (2) --- p.75 / Chapter 5.2.3 --- "Query Message Routing (3,4,5)" --- p.75 / Chapter 5.2.4 --- "Query Result Display (6,7)" --- p.78 / Chapter 5.3 --- Gnutella Message Modification --- p.78 / Chapter 5.4 --- DISCOVIR EVERYWHERE --- p.81 / Chapter 5.4.1 --- Design Goal of DISCOVIR Everywhere --- p.82 / Chapter 5.4.2 --- Architecture and System Components of DISCOVIR Ev- erywhere --- p.83 / Chapter 5.4.3 --- Flow of Operations --- p.84 / Chapter 5.4.4 --- Advantages of DISCOVIR Everywhere over Prevalent Web-based Search Engine --- p.86 / Chapter 6 --- Conclusion --- p.87 / Bibliography --- p.89 Image processing--Digital techniques
480	Automatic construction and adaptation of wrappers for semi-structured web documents. January 2003 (has links) Wong Tak Lam. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 88-94). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Wrapper Induction for Semi-structured Web Documents --- p.1 / Chapter 1.2 --- Adapting Wrappers to Unseen Web Sites --- p.6 / Chapter 1.3 --- Thesis Contributions --- p.7 / Chapter 1.4 --- Thesis Organization --- p.8 / Chapter 2 --- Related Work --- p.10 / Chapter 2.1 --- Related Work on Wrapper Induction --- p.10 / Chapter 2.2 --- Related Work on Wrapper Adaptation --- p.16 / Chapter 3 --- Automatic Construction of Hierarchical Wrappers --- p.20 / Chapter 3.1 --- Hierarchical Record Structure Inference --- p.22 / Chapter 3.2 --- Extraction Rule Induction --- p.30 / Chapter 3.3 --- Applying Hierarchical Wrappers --- p.38 / Chapter 4 --- Experimental Results for Wrapper Induction --- p.40 / Chapter 5 --- Adaptation of Wrappers for Unseen Web Sites --- p.52 / Chapter 5.1 --- Problem Definition --- p.52 / Chapter 5.2 --- Overview of Wrapper Adaptation Framework --- p.55 / Chapter 5.3 --- Potential Training Example Candidate Identification --- p.58 / Chapter 5.3.1 --- Useful Text Fragments --- p.58 / Chapter 5.3.2 --- Training Example Generation from the Unseen Web Site --- p.60 / Chapter 5.3.3 --- Modified Nearest Neighbour Classification --- p.63 / Chapter 5.4 --- Machine Annotated Training Example Discovery and New Wrap- per Learning --- p.64 / Chapter 5.4.1 --- Text Fragment Classification --- p.64 / Chapter 5.4.2 --- New Wrapper Learning --- p.69 / Chapter 6 --- Case Study and Experimental Results for Wrapper Adapta- tion --- p.71 / Chapter 6.1 --- Case Study on Wrapper Adaptation --- p.71 / Chapter 6.2 --- Experimental Results --- p.73 / Chapter 6.2.1 --- Book Domain --- p.74 / Chapter 6.2.2 --- Consumer Electronic Appliance Domain --- p.79 / Chapter 7 --- Conclusions and Future Work --- p.83 / Bibliography --- p.88 / Chapter A --- Detailed Performance of Wrapper Induction for Book Do- main --- p.95 / Chapter B --- Detailed Performance of Wrapper Induction for Consumer Electronic Appliance Domain --- p.99 Text processing (Computer science) World Wide Web

Search results