Spelling suggestions: "subject:"data mining - amathematical models"" "subject:"data mining - dmathematical models""
1 |
Mining a shared concept space for domain adaptation in text mining. / CUHK electronic theses & dissertations collectionJanuary 2011 (has links)
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based. on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques. / Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains. / We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches. / We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap. / Chen, Bo. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 87-95). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
2 |
Budget-limited data disambiguationYang, Xuan, 楊譞 January 2013 (has links)
The problem of data ambiguity exists in a wide range of applications. In this thesis, we study “cost-aware" methods to alleviate the data ambiguity problems in uncertain databases and social-tagging data.
In database applications, ambiguous (or uncertain) data may originate from data integration and measurement error of devices. These ambiguous data are maintained by uncertain databases. In many situations, it is possible to “clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement error, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In practice, a cleaning activity often involves a cost, may fail and may not remove all ambiguities. Moreover, the statistical information about how likely database entities can be cleaned may not be precisely known. We model the above aspects with the uncertain database cleaning problem, which requires us to make sensible decisions in selecting entities to clean in order to maximize the amount of ambiguous information removed under a limited budget. To solve this problem, we propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness.
Social tagging data capture web users' textual annotations, called tags, for resources (e.g., webpages and photos). Since tags are given by casual users, they often contain noise (e.g., misspelled words) and may not be able to cover all the aspects of each resource. In this thesis, we design a metric to systematically measure the tagging quality of each resource based on the tags it has received. We propose an incentive-based tagging framework in order to improve the tagging quality. The main idea is to award users some incentive for giving (relevant) tags to resources. The challenge is, how should we allocate incentives to a large set of resources, so as to maximize the improvement of their tagging quality under a limited budget? To solve this problem, we propose a few efficient incentive allocation strategies. Experiments shows that our best strategy provides resources with a close-to-optimal gain in tagging quality.
To summarize, we study the problem of budget-limited data disambiguation for uncertain databases and social tagging data | given a set of objects (entities from uncertain databases or web resources), how can we make sensible decisions about which object to \disambiguate" (to perform a cleaning activity on the entity or ask a user to tag the resource), in order to maximize the amount of ambiguous information reduced under a limited budget. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
3 |
Entity discovery by exploiting contextual structures. / CUHK electronic theses & dissertations collectionJanuary 2011 (has links)
In text mining, being able to recognize and extract named entities, e.g. Locations, Persons, Organizations, is very useful in many applications. This is usually referred to named entity recognition (NER). This thesis presents a cascaded framework for extracting named entities from text documents. We automatically derive features on a set of documents from different feature templates. To avoid high computational cost incurred by a single-phase approach, we divide the named entity extraction task into a segmentation task and a classification task, reducing the computational cost by an order of magnitude. / To handle cascaded errors that often occur in a sequence of tasks, we investigate and develop three models: maximum-entropy margin-based (MEMB) model, isomeric conditional random field (ICRF) model, and online cascaded reranking (OCR) model. MEMB model makes use of the concept of margin in maximizing log-likelihood. Parameters are trained in a way that they can maximize the "margin" between the decision boundary and the nearest training data points. ICRF model makes use of the concept of joint training. Instead of training each model independently, we design the segmentation and classification models in a way that they can be efficiently trained together under a soft constraint. OCR model is developed by using an online training method to maximize a margin without considering any probability measures, which greatly reduces the training time. It reranks all of the possible outputs from a previous stage based on a total output score. The best output with the highest total score is the final output. / We report experimental evaluations on the GENIA Corpus available from the BioNLP/NLPBA (2004) shared task and the Reuters Corpus available from the CoNLL-2003 shared tasks, which demonstrate the state-of-the-art performance achieved by the proposed models. / Chan, Shing Kit. / Advisers: Wai Lam; Kai Pui Lam. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 126-133). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
4 |
Algorithmic aspects of social network mining. / 社会网络挖掘的算法问题研究 / CUHK electronic theses & dissertations collection / Algorithmic aspects of social network mining. / She hui wang luo wa jue de suan fa wen ti yan jiuJanuary 2013 (has links)
Li, Ronghua = 社会网络挖掘的算法问题研究 / 李荣华. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 157-171). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / Li, Ronghua = She hui wang luo wa jue de suan fa wen ti yan jiu / Li Ronghua.
|
5 |
Incremental algorithms for multilinear principal component analysis of tensor objectsCao, Zisheng, 曹子晟 January 2013 (has links)
In recent years, massive data sets are generated in many areas of science and business, and are gathered by using advanced data acquisition techniques. New approaches are therefore required to facilitate effective data management and data analysis in this big data era, especially to analyze multidimensional data for real-time applications. This thesis aims at developing generic and effective algorithms for compressing and recovering online multidimensional data, and applying such algorithms in image processing and other related areas.
Since multidimensional data are usually represented by tensors, this research uses multilinear algebra as the mathematical foundation to facilitate development. After reviewing the techniques of singular value decomposition (SVD), principal component analysis (PCA) and tensor decomposition, this thesis deduces an effective multilinear principal component analysis (MPCA) method to process such data by seeking optimal orthogonal basis functions that map the original tensor space to a tensor subspace with minimal reconstruction error. Two real examples, 3D data compression for positron emission tomography (PET) and offline fabric defect detection, are used to illustrate the tensor decomposition method and the deduced MPCA method, respectively. Based on the deduced MPCA method, this research develops an incremental MPCA (IMPCA) algorithm which targets at compressing and recovering online tensor objects.
To reduce computational complexity of the IMPCA algorithm, this research investigates the low-rank updates of singular values in the matrix and tensor domains, which leads to the development of a sequential low-rank update scheme similar to the sequential Karhunen-Loeve algorithm (SKL) for incremental matrix singular value decomposition, a sequential low-rank update scheme for incremental tensor decomposition, and a quick subspace tracking (QST) algorithm to further enhance the low-rank updates of singular values if the matrix is positive-symmetric definite. Although QST is slightly inferior to the SKL algorithm in terms of accuracy in estimating eigenvector and eigenvalue, the algorithm has lower computational complexity. Two fast incremental MPCA
(IMPCA) algorithms are then developed by incorporating the SKL algorithm and the QST algorithm separately into the IMPCA algorithm. Results obtained from applying the developed IMPCA algorithms to detect anomalies from online multidimensional data in a number of numerical experiments, and to track and reconstruct the global surface temperature anomalies over the past several decades clearly confirm the excellent performance of the algorithms.
This research also applies the developed IMPCA algorithms to solve an online fabric defect inspection problem. Unlike existing pixel-wise detection schemes, the developed algorithms employ a scanning window to extract tensor objects from fabric images, and to detect the occurrence of anomalies. The proposed method is unsupervised because no pre-training is needed. Two image processing techniques, selective local Gabor binary patterns (SLGBP) and multi-channel feature combination, are developed to accomplish the feature extraction of textile patterns and represent the features as tensor objects. Results of experiments conducted by using a real textile dataset confirm that the developed algorithms are comparable to existing supervised methods in terms of accuracy and computational complexity. A cost-effective parallel implementation scheme is developed to solve the problem in real-time. / published_or_final_version / Industrial and Manufacturing Systems Engineering / Doctoral / Doctor of Philosophy
|
6 |
Logic knowledge base refinement using unlabeled or limited labeled data. / CUHK electronic theses & dissertations collectionJanuary 2010 (has links)
In many text mining applications, knowledge bases incorporating expert knowledge are beneficial for intelligent decision making. Refining an existing knowledge base from a source domain to a different target domain solving the same task would greatly reduce the effort required for preparing labeled training data in constructing a new knowledge base. We investigate a new framework of refining a kind of logic knowledge base known as Markov Logic Networks (MLN). One characteristic of this adaptation problem is that since the data distributions of the two domains are different, there should be different tailor-made MLNs for each domain. On the other hand, the two knowledge bases should share certain amount of similarities due to the same goal. We investigate the refinement in two situations, namely, using unlabeled target domain data, and using limited amount of labeled target domain data. / When manual annotation of a limited amount of target domain data is possible, we exploit how to actively select the data for annotation and develop two active learning approaches. The first approach is a pool-based active learning approach taking into account of the differences between the source and the target domains. A theoretical analysis on the sampling bound of the approach is conducted to demonstrate that informative data can be actively selected. The second approach is an error-driven approach that is designed to provide estimated labels for the target domain and hence the quality of the logic formulae captured for the target domain can be improved. An error analysis on the cluster-based active learning approach is presented. We have conducted extensive experiments on two different text mining tasks, namely, pronoun resolution and segmentation of citation records, showing consistent ii improvements in both situations of using unlabeled target domain data, and with a limited amount of labeled target domain data. / When there is no manual label given for the target domain data, we re-fine an existing MLN via two components. The first component is the logic formula weight adaptation that jointly maximizes the likelihood of the observations of the target domain unlabeled data and considers the differences between the two domains. Two approaches are designed to capture the differences between the two domains. One approach is to analyze the distribution divergence between the two domains and the other approach is to incorporate a penalized degree of difference. The second component is logic formula refinement where logic formulae specific to the target domain are discovered to further capture the characteristics of the target domain. / Chan, Ki Cecia. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 73-02, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 120-128). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
7 |
Improving opinion mining with feature-opinion association and human computation. / 利用特徵意見結合及人類運算改進意見挖掘 / Li yong te zheng yi jian jie he ji ren lei yun suan gai jin yi jian wa jueJanuary 2009 (has links)
Chan, Kam Tong. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves [101]-113). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Major Topic --- p.1 / Chapter 1.1.1 --- Opinion Mining --- p.1 / Chapter 1.1.2 --- Human Computation --- p.2 / Chapter 1.2 --- Major Work and Contributions --- p.3 / Chapter 1.3 --- Thesis Outline --- p.4 / Chapter 2 --- Literature Review --- p.6 / Chapter 2.1 --- Opinion Mining --- p.6 / Chapter 2.1.1 --- Feature Extraction --- p.6 / Chapter 2.1.2 --- Sentiment Analysis --- p.9 / Chapter 2.2 --- Social Computing --- p.15 / Chapter 2.2.1 --- Social Bookmarking --- p.15 / Chapter 2.2.2 --- Social Games --- p.18 / Chapter 3 --- Feature-Opinion Association for Sentiment Analysis --- p.25 / Chapter 3.1 --- Motivation --- p.25 / Chapter 3.2 --- Problem Definition --- p.27 / Chapter 3.2.1 --- Definitions --- p.27 / Chapter 3.3 --- Closer look at the problem --- p.28 / Chapter 3.3.1 --- Discussion --- p.29 / Chapter 3.4 --- Proposed Approach --- p.29 / Chapter 3.4.1 --- Nearest Opinion Word (DIST) --- p.31 / Chapter 3.4.2 --- Co-Occurrence Frequency (COF) --- p.31 / Chapter 3.4.3 --- Co-Occurrence Ratio (COR) --- p.32 / Chapter 3.4.4 --- Likelihood-Ratio Test (LHR) --- p.32 / Chapter 3.4.5 --- Combined Method --- p.34 / Chapter 3.4.6 --- Feature-Opinion Association Algorithm --- p.35 / Chapter 3.4.7 --- Sentiment Lexicon Expansion --- p.36 / Chapter 3.5 --- Evaluation --- p.37 / Chapter 3.5.1 --- Corpus Data Set --- p.37 / Chapter 3.5.2 --- Test Data set --- p.37 / Chapter 3.5.3 --- Feature-Opinion Association Accuracy --- p.38 / Chapter 3.6 --- Summary --- p.45 / Chapter 4 --- Social Game for Opinion Mining --- p.46 / Chapter 4.1 --- Motivation --- p.46 / Chapter 4.2 --- Social Game Model --- p.47 / Chapter 4.2.1 --- Definitions --- p.48 / Chapter 4.2.2 --- Social Game Problem --- p.51 / Chapter 4.2.3 --- Social Game Flow --- p.51 / Chapter 4.2.4 --- Answer Extraction Procedure --- p.52 / Chapter 4.3 --- Social Game Properties --- p.53 / Chapter 4.3.1 --- Type of Information --- p.53 / Chapter 4.3.2 --- Game Structure --- p.55 / Chapter 4.3.3 --- Verification Method --- p.59 / Chapter 4.3.4 --- Game Mechanism --- p.60 / Chapter 4.3.5 --- Player Requirement --- p.62 / Chapter 4.4 --- Design Guideline --- p.63 / Chapter 4.5 --- Opinion Mining Game Design --- p.65 / Chapter 4.5.1 --- OpinionMatch --- p.65 / Chapter 4.5.2 --- FeatureGuess --- p.68 / Chapter 4.6 --- Summary --- p.71 / Chapter 5 --- Tag Sentiment Analysis for Social Bookmark Recommendation System --- p.72 / Chapter 5.1 --- Motivation --- p.72 / Chapter 5.2 --- Problem Statement --- p.74 / Chapter 5.2.1 --- Social Bookmarking Model --- p.74 / Chapter 5.2.2 --- Social Bookmark Recommendation (SBR) Problem --- p.75 / Chapter 5.3 --- Proposed Approach --- p.75 / Chapter 5.3.1 --- Social Bookmark Recommendation Framework --- p.75 / Chapter 5.3.2 --- Subjective Tag Detection (STD) --- p.77 / Chapter 5.3.3 --- Similarity Matrices --- p.80 / Chapter 5.3.4 --- User-Website matrix: --- p.81 / Chapter 5.3.5 --- User-Tag matrix --- p.81 / Chapter 5.3.6 --- Website-Tag matrix --- p.82 / Chapter 5.4 --- Pearson Correlation Coefficient --- p.82 / Chapter 5.5 --- Social Network-based User Similarity --- p.83 / Chapter 5.6 --- User-oriented Website Ranking --- p.85 / Chapter 5.7 --- Evaluation --- p.87 / Chapter 5.7.1 --- Bookmark Data --- p.87 / Chapter 5.7.2 --- Social Network --- p.87 / Chapter 5.7.3 --- Subjective Tag List --- p.87 / Chapter 5.7.4 --- Subjective Tag Detection --- p.88 / Chapter 5.7.5 --- Bookmark Recommendation Quality --- p.90 / Chapter 5.7.6 --- System Evaluation --- p.91 / Chapter 5.8 --- Summary --- p.93 / Chapter 6 --- Conclusion and Future Work --- p.94 / Chapter A --- List of Symbols and Notations --- p.97 / Chapter B --- List of Publications --- p.100 / Bibliography --- p.101
|
8 |
Unsupervised extraction and normalization of product attributes from web pages.January 2010 (has links)
Xiong, Jiani. / "July 2010." / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (p. 59-63). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.2 --- Motivation --- p.4 / Chapter 1.3 --- Our Approach --- p.8 / Chapter 1.4 --- Potential Applications --- p.12 / Chapter 1.5 --- Research Contributions --- p.13 / Chapter 1.6 --- Thesis Organization --- p.15 / Chapter 2 --- Literature Survey --- p.16 / Chapter 2.1 --- Supervised Extraction Approaches --- p.16 / Chapter 2.2 --- Unsupervised Extraction Approaches --- p.19 / Chapter 2.3 --- Attribute Normalization --- p.21 / Chapter 2.4 --- Integrated Approaches --- p.22 / Chapter 3 --- Problem Definition and Preliminaries --- p.24 / Chapter 3.1 --- Problem Definition --- p.24 / Chapter 3.2 --- Preliminaries --- p.27 / Chapter 3.2.1 --- Web Pre-processing --- p.27 / Chapter 3.2.2 --- Overview of Our Framework --- p.31 / Chapter 3.2.3 --- Background of Graphical Models --- p.32 / Chapter 4 --- Our Proposed Framework --- p.36 / Chapter 4.1 --- Our Proposed Graphical Model --- p.36 / Chapter 4.2 --- Inference --- p.41 / Chapter 4.3 --- Product Attribute Information Determination --- p.47 / Chapter 5 --- Experiments and Results --- p.49 / Chapter 6 --- Conclusion --- p.57 / Bibliography --- p.59 / Chapter A --- Dirichlet Process --- p.64 / Chapter B --- Hidden Markov Models --- p.68
|
9 |
Link-based similarity measurement techniques and applications. / 基於鏈接的相似度測量技術與應用 / CUHK electronic theses & dissertations collection / Ji yu lian jie de xiang si du ce liang ji shu yu ying yongJanuary 2011 (has links)
Lin, Zhenjiang. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 161-185). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
10 |
Parameter free document stream classification. / CUHK electronic theses & dissertations collectionJanuary 2006 (has links)
Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible. / For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted. / For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U. / In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically. / In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval. / Fung Pui Cheong Gabriel. / "August 2006." / Adviser: Jeffrey Xu Yu. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 122-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307.
|
Page generated in 0.4399 seconds