431 |
Mining a shared concept space for domain adaptation in text mining. / CUHK electronic theses & dissertations collectionJanuary 2011 (has links)
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based. on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques. / Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains. / We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches. / We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap. / Chen, Bo. / Adviser: Wai Lam. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 87-95). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
432 |
A task allocation protocol for real-time financial data mining system.January 2003 (has links)
Lam Lui-fuk. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 75-76). / Abstracts in English and Chinese. / ABSTRACT --- p.I / 摘要 --- p.II / ACKNOWLEDGEMENT --- p.III / TABLE OF CONTENTS --- p.IV / LIST OF FIGURES --- p.VIII / LIST OF ABBREVIATIONS --- p.X / Chapter CHAPTER 1 --- INTRODUCTION --- p.1 / Chapter 1.1 --- Introduction --- p.1 / Chapter 1.2. --- Motivation and Research Objective --- p.3 / Chapter 1.3. --- Organization of the Dissertation --- p.3 / Chapter CHAPTER 2 --- BACKGROUND STUDIES --- p.5 / Chapter 2.1 --- The Contract Net Protocol --- p.5 / Chapter 2.2 --- Two-tier software architectures --- p.8 / Chapter 2.3 --- Three-tier software architecture --- p.9 / Chapter CHAPTER 3 --- SYSTEM ARCHITECTURE --- p.12 / Chapter 3.1 --- Introduction --- p.12 / Chapter 3.2 --- System Architecture Overview --- p.12 / Chapter 3.2.1 --- Client Layer --- p.13 / Chapter 3.2.2 --- Middle Layer --- p.13 / Chapter 3.2.3 --- Back-end Layer --- p.14 / Chapter 3.3 --- Advantages of the System Architecture --- p.14 / Chapter 3.3.1 --- "Separate the presentation components, business logic and data storage" --- p.14 / Chapter 3.3.2 --- Provide a central-computing platform for user using different computing platforms --- p.15 / Chapter 3.3.3 --- Improve system capacity --- p.15 / Chapter 3.3.4 --- Enable distributed computing --- p.16 / Chapter CHAPTER 4. --- SOFTWARE ARCHITECTURE --- p.17 / Chapter 4.1 --- Introduction --- p.17 / Chapter 4.2 --- Descriptions of Middle Layer Server Side Software Components --- p.17 / Chapter 4.2.1 --- Data Cache --- p.18 / Chapter 4.2.2 --- Functions Library --- p.18 / Chapter 4.2.3 --- Communicator --- p.18 / Chapter 4.2.4 --- Planner Module --- p.19 / Chapter 4.2.5 --- Scheduler module --- p.19 / Chapter 4.2.6 --- Execution Module --- p.20 / Chapter 4.3 --- Overview the Execution of Service Request inside Server --- p.20 / Chapter 4.4 --- Descriptions of Client layer Software Components --- p.21 / Chapter 4.4.1 --- Graphical User Interface --- p.22 / Chapter 4.5 --- Overview of Task Execution in Advanced Client ´ةs Application --- p.23 / Chapter 4.6 --- The possible usages of task allocation protocol --- p.24 / Chapter 4.6.1 --- Chart Drawing --- p.25 / Chapter 4.6.2 --- Compute user-defined technical analysis indicator --- p.25 / Chapter 4.6.3 --- Unbalance loading --- p.26 / Chapter 4.6.4 --- Large number of small data mining V.S. small number of large data mining --- p.26 / Chapter 4.7 --- Summary --- p.27 / Chapter CHAPTER 5. --- THE CONTRACT NET PROTOCOL FOR TASK ALLOCATION --- p.28 / Chapter 5.1 --- Introduction --- p.28 / Chapter 5.2 --- The FIPA Contract Net Interaction Protocol --- p.28 / Chapter 5.2.1 --- Introduction to the FIPA Contract Net Interaction Protocol --- p.28 / Chapter 5.2.2 --- Strengths of the FIPA Contract Net Interaction Protocol for our system --- p.30 / Chapter 5.2.3 --- Weakness of the FIPA Contractor Net Interaction Protocol for our system --- p.32 / Chapter 5.3 --- The Modified Contract Net Protocol --- p.33 / Chapter 5.4 --- The Implementation of the Modified Contract Net Protocol --- p.39 / Chapter 5.5 --- Summary --- p.46 / Chapter CHAPTER 6. --- A CLIENT AS SERVER MODEL USING MCNP FOR TASK ALLOCATION --- p.48 / Chapter 6.1 --- Introduction --- p.48 / Chapter 6.2 --- The CASS System Model --- p.48 / Chapter 6.3 --- The analytical model of the CASS system --- p.51 / Chapter 6.4 --- Performance Analysis of the CASS System --- p.55 / Chapter 6.5 --- Performance Simulation --- p.62 / Chapter 6.6 --- An Extension of the Load-Balancing Algorithm for Non-Uniform Client's Service Time Distribution --- p.68 / Chapter 6.7 --- Summary --- p.69 / Chapter CHAPTER 7. --- CONCLUSION AND FUTURE WORK --- p.71 / Chapter 7.1 --- Conclusion --- p.71 / Chapter 7.2 --- Future Work --- p.73 / BIBLIOGRAPHY --- p.75
|
433 |
Efficient and effective outlier detection.January 2003 (has links)
by Chiu Lai Mei. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 142-149). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgement --- p.vi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Outlier Analysis --- p.2 / Chapter 1.2 --- Problem Statement --- p.4 / Chapter 1.2.1 --- Binary Property of Outlier --- p.4 / Chapter 1.2.2 --- Overlapping Clusters with Different Densities --- p.4 / Chapter 1.2.3 --- Large Datasets --- p.5 / Chapter 1.2.4 --- High Dimensional Datasets --- p.6 / Chapter 1.3 --- Contributions --- p.8 / Chapter 2 --- Related Work in Outlier Detection --- p.10 / Chapter 2.1 --- Outlier Detection --- p.11 / Chapter 2.1.1 --- Clustering-Based Methods --- p.11 / Chapter 2.1.2 --- Distance-Based Methods --- p.14 / Chapter 2.1.3 --- Density-Based Methods --- p.18 / Chapter 2.1.4 --- Deviation-Based Methods --- p.22 / Chapter 2.2 --- Breakthrough Outlier Notion: Degree of Outlier-ness --- p.25 / Chapter 2.2.1 --- LOF: Local Outlier Factor --- p.26 / Chapter 2.2.2 --- Definitions --- p.26 / Chapter 2.2.3 --- Properties --- p.29 / Chapter 2.2.4 --- Algorithm --- p.30 / Chapter 2.2.5 --- Time Complexity --- p.31 / Chapter 2.2.6 --- LOF of High Dimensional Data --- p.31 / Chapter 3 --- LOF': Formula with Intuitive Meaning --- p.33 / Chapter 3.1 --- Definition of LOF' --- p.33 / Chapter 3.2 --- Properties --- p.34 / Chapter 3.3 --- Time Complexity --- p.37 / Chapter 4 --- "LOF"" for Detecting Small Groups of Outliers" --- p.39 / Chapter 4.1 --- "Definition of LOF"" " --- p.40 / Chapter 4.2 --- Properties --- p.41 / Chapter 4.3 --- Time Complexity --- p.44 / Chapter 5 --- GridLOF for Pruning Reasonable Portions from Datasets --- p.46 / Chapter 5.1 --- GridLOF Algorithm --- p.47 / Chapter 5.2 --- Determine Values of Input Parameters --- p.51 / Chapter 5.2.1 --- Number of Intervals w --- p.51 / Chapter 5.2.2 --- Threshold Value σ --- p.52 / Chapter 5.3 --- Advantages --- p.53 / Chapter 5.4 --- Time Complexity --- p.55 / Chapter 6 --- SOF: Efficient Outlier Detection for High Dimensional Data --- p.57 / Chapter 6.1 --- Motivation --- p.57 / Chapter 6.2 --- Notations and Definitions --- p.59 / Chapter 6.3 --- SOF: Subspace Outlier Factor --- p.62 / Chapter 6.3.1 --- Formal Definition of SOF --- p.62 / Chapter 6.3.2 --- Properties of SOF --- p.67 / Chapter 6.4 --- SOF-Algorithm: the Overall Framework --- p.73 / Chapter 6.5 --- Identify Associated Subspaces of Clusters in SOF-Algorithm . . --- p.74 / Chapter 6.5.1 --- Technical Details in Phase I --- p.76 / Chapter 6.6 --- Technical Details in Phase II and Phase III --- p.88 / Chapter 6.6.1 --- Identify Outliers --- p.88 / Chapter 6.6.2 --- Subspace Quantization --- p.90 / Chapter 6.6.3 --- X-Tree Index Structure --- p.91 / Chapter 6.6.4 --- Compute GSOF and SOF --- p.95 / Chapter 6.6.5 --- Assign SO Values --- p.95 / Chapter 6.6.6 --- Multi-threads Programming --- p.96 / Chapter 6.7 --- Time Complexity --- p.97 / Chapter 6.8 --- Strength of SOF-Algorithm --- p.99 / Chapter 7 --- "Experiments on LOF' ,LOF"" and GridLOF" --- p.102 / Chapter 7.1 --- Datasets Used --- p.103 / Chapter 7.2 --- LOF' --- p.103 / Chapter 7.3 --- "LOF"" " --- p.109 / Chapter 7.4 --- GridLOF --- p.114 / Chapter 8 --- Empirical Results of SOF --- p.121 / Chapter 8.1 --- Synthetic Data Generation --- p.121 / Chapter 8.2 --- Experimental Setup --- p.124 / Chapter 8.3 --- Performance Measure --- p.124 / Chapter 8.3.1 --- Quality Measurement --- p.127 / Chapter 8.3.2 --- Scalability of SOF-Algorithm --- p.136 / Chapter 8.3.3 --- Effect of Parameters on SOF-Algorithm --- p.139 / Chapter 9 --- Conclusion --- p.140 / Bibliography --- p.142 / Publication --- p.149
|
434 |
Writer identification using wavelet, contourlet and statistical modelsHe, Zhenyu 01 January 2006 (has links)
No description available.
|
435 |
Clustering of categorical and numerical data without knowing cluster numberJia, Hong 01 January 2013 (has links)
No description available.
|
436 |
Analýza intranetu společnosti Sprinx Systems, a.s. a návrhy na jeho zlepšení. / Analysis of the intranet Sprinx Systems, a.s. and suggestions for its improvementPerná, Lucie January 2011 (has links)
This diploma thesis devote to design a search of intranets questions and analysis of the current state of the corporate intranet Sprinx Systems, a.s. Intranet can be as useful for small and big company. A well-functioning intranet secures the know-how of company and helps its users at work. The first part of this diploma thesis is focused on design a search of intranet questions, which contains for example historical development of intranets, the basic functions of intranets, create intranet plan, planning intranet content etc. The second part of diploma thesis is focused on the analysis of the current intranet, which was performed using our own experience, structured interview with managers and a questionnaire survey, evaluation and recommendation of proposals to improve the company`s intranet for the future.
|
437 |
Un nouvel horizon pour la recommandation : intégration de la dimension spatiale dans l'aide à la décision / A new horizon for the recommendation : integration of spatial dimensions to aid decision makingChulyadyo, Rajani 19 October 2016 (has links)
De nos jours, il est très fréquent de représenter un système en termes de relations entre objets. Parmi les applications les plus courantes de telles données relationnelles, se situent les systèmes de recommandation (RS), qui traitent généralement des relations entre utilisateurs et items à recommander. Les modèles relationnels probabilistes (PRM) sont un bon choix pour la modélisation des dépendances probabilistes entre ces objets. Une tendance croissante dans les systèmes de recommandation est de rajouter une dimension spatiale à ces objets, que ce soient les utilisateurs, ou les items. Cette thèse porte sur l’intersection peu explorée de trois domaines connexes - modèles probabilistes relationnels (et comment apprendre les dépendances probabilistes entre attributs d’une base de données relationnelles), les données spatiales et les systèmes de recommandation. La première contribution de cette thèse porte sur le chevauchement des PRM et des systèmes de recommandation. Nous avons proposé un modèle de recommandation à base de PRM capable de faire des recommandations à partir des requêtes des utilisateurs, mais sans profils d’utilisateurs, traitant ainsi le problème du démarrage à froid. Notre deuxième contribution aborde le problème de l’intégration de l’information spatiale dans un PRM. / Nowadays it is very common to represent a system in terms of relationships between objects. One of the common applications of such relational data is Recommender System (RS), which usually deals with the relationships between users and items. Probabilistic Relational Models (PRMs) can be a good choice for modeling probabilistic dependencies between such objects. A growing trend in recommender systems is to add spatial dimensions to these objects, and make recommendations considering the location of users and/or items. This thesis deals with the (not much explored) intersection of three related fields – Probabilistic Relational Models (a method to learn probabilistic models from relational data), spatial data (often used in relational settings), and recommender systems (which deal with relational data). The first contribution of this thesis deals with the overlapping of PRM and recommender systems. We have proposed a PRM-based personalized recommender system that is capable of making recommendations from user queries in cold-start systems without user profiles. Our second contribution addresses the problem of integrating spatial information into a PRM.
|
438 |
Indexing Linked Data / Indexing Linked DataConicov, Andrei January 2012 (has links)
The fast evolution of the World Wide Web has offered the possibility to publish a huge amount of linked documents. Each such document represents a valuable piece of information. Linked Data is the term used to describe a method of exposing and connecting such documents. Even if this method is still in an experimental phase, it is already hard to process all existing data sources and the most obvious solution is to try and index them. The study addresses questions on how to design an index that will be capable to operate with millions of such entries. It analyses the existing projects and describes an index that may fulfill the requirements. The prototype implementation and the provided test results offer additional information about the index structure and effectiveness.
|
439 |
Measuring academic performance of students in Higher Education using data mining techniquesAlsuwaiket, Mohammed January 2018 (has links)
Educational Data Mining (EDM) is a developing discipline, concerned with expanding the classical Data Mining (DM) methods and developing new methods for discovering the data that originate from educational systems. It aims to use those methods to achieve a logical understanding of students, and the educational environment they should have for better learning. These data are characterized by their large size and randomness and this can make it difficult for educators to extract knowledge from these data. Additionally, knowledge extracted from data by means of counting the occurrence of certain events is not always reliable, since the counting process sometimes does not take into consideration other factors and parameters that could affect the extracted knowledge. Student attendance in Higher Education has always been dealt with in a classical way, i.e. educators rely on counting the occurrence of attendance or absence building their knowledge about students as well as modules based on this count. This method is neither credible nor does it necessarily provide a real indication of a student s performance. On other hand, the choice of an effective student assessment method is an issue of interest in Higher Education. Various studies (Romero, et al., 2010) have shown that students tend to get higher marks when assessed through coursework-based assessment methods - which include either modules that are fully assessed through coursework or a mixture of coursework and examinations than assessed by examination alone. There are a large number of Educational Data Mining (EDM) studies that pre-processed data through the conventional Data Mining processes including the data preparation process, but they are using transcript data as it stands without looking at examination and coursework results weighting which could affect prediction accuracy. This thesis explores the above problems and tries to formulate the extracted knowledge in a way that guarantees achieving accurate and credible results. Student attendance data, gathered from the educational system, were first cleaned in order to remove any randomness and noise, then various attributes were studied so as to highlight the most significant ones that affect the real attendance of students. The next step was to derive an equation that measures the Student Attendance s Credibility (SAC) considering the attributes chosen in the previous step. The reliability of the newly developed measure was then evaluated in order to examine its consistency. In term of transcripts data, this thesis proposes a different data preparation process through investigating more than 230,000 student records in order to prepare students marks based on the assessment methods of enrolled modules. The data have been processed through different stages in order to extract a categorical factor through which students module marks are refined during the data preparation process. The results of this work show that students final marks should not be isolated from the nature of the enrolled module s assessment methods; rather they must be investigated thoroughly and considered during EDM s data pre-processing phases. More generally, it is concluded that Educational Data should not be prepared in the same way as exist data due to the differences such as sources of data, applications, and types of errors in them. Therefore, an attribute, Coursework Assessment Ratio (CAR), is proposed to use in order to take the different modules assessment methods into account while preparing student transcript data. The effect of CAR and SAC on prediction process using data mining classification techniques such as Random Forest, Artificial Neural Networks and k-Nears Neighbors have been investigated. The results were generated by applying the DM techniques on our data set and evaluated by measuring the statistical differences between Classification Accuracy (CA) and Root Mean Square Error (RMSE) of all models. Comprehensive evaluation has been carried out for all results in the experiments to compare all DM techniques results, and it has been found that Random forest (RF) has the highest CA and lowest RMSE. The importance of SAC and CAR in increasing the prediction accuracy has been proved in Chapter 5. Finally, the results have been compared with previous studies that predicted students final marks, based on students marks at earlier stages of their study. The comparisons have taken into consideration similar data and attributes, whilst first excluding average CAR and SAC and secondly by including them, and then measuring the prediction accuracy between both. The aim of this comparison is to ensure that the new preparation process stage will positively affect the final results.
|
440 |
Conditional Differential Expression for Biomarker Discovery In High-throughput Cancer DataWang, Dao Sen 15 February 2019 (has links)
Biomarkers have important clinical uses as diagnostic, prognostic, and predictive tools for cancer therapy. However, translation from biomarkers claimed in literature to clinical use has been traditionally poor. Importantly, clinical covariates have been shown to be important factors in biomarker discovery in small-scale studies. Yet, traditional differential gene expression analysis for expression biomarkers ignores covariates, which are only accounted for later, if at all. We conjecture that covariate-sensitive biomarker identification should lead to the discovery of more robust and true biomarkers as confounding effects are considered. Here we examine gene expression in more than 750 breast invasive ductal carcinoma cases from The Cancer Genome Atlas (TCGA-BRCA) in the form of RNA-Seq data. Specifically, we focus on differential gene expression with respect to understanding HER2, ER, and PR biology – the three key receptors in breast cancer. We explore methods of differential expression analysis, including non-parametric Mann-Whitney-Wilcoxon analysis, generalized linear models with covariates, and a novel categorical method for covariates. We tested the influence of common patient characteristics, such as age and race, and clinical covariates such as HER2, ER, and PR receptor statuses. More importantly, we show that inclusion of a correlated covariate (e.g. PR status as a covariate in ER analysis) substantially changes the list of differentially expressed genes, removing many likely false positives and revealing genes obscured by the covariate. Incorporation of relevant covariates in differential gene expression analysis holds strong biological importance with respect to biomarker discovery and may be the next step towards better translation of biomarkers to clinical use.
|
Page generated in 0.1664 seconds