Global ETD Search

301	Improvements on Trained Across Multiple Experiments (TAME), a New Method for Treatment Effect Detection Patikorn, Thanaporn 08 May 2017 (has links) One of my previous works introduced a new data mining technique to analyze multiple experiments called TAME: Trained Across Multiple Experiments. TAME detects treatment effects of a randomized controlled experiment by utilizing data from outside of the experiment of interest. TAME with linear regression showed promising result; in all simulated scenarios, TAME was at least as good as a standard method, ANOVA, and was significantly better than ANOVA in certain scenarios. In this work, I further investigated and improved TAME by altering how TAME assembles data and creates subject models. I found that mean-centering â€œpriorâ€� data and treating each experiment as equally important allow TAME to detect treatment effects better. In addition, we did not find Random Forest to be compatible with TAME. data mining treatment effect detection linear regression
302	Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining Kuhlman, Caitlin Anne 20 January 2017 (has links) This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality. distributed computing kNN Search data Mining
303	When practice does not make perfect: Differentiating between productive and unproductive persistence Almeda, Ma. Victoria Quintos January 2018 (has links) Research has suggested that persistence in the face of challenges plays an important role in learning. However, recent work on wheel-spinning—a type of unproductive persistence where students spend too much time struggling without achieving mastery of skills—has shown that not all persistence is uniformly beneficial for learning. For this reason, Study 1 used educational data-mining techniques to determine key differences between the behaviors associated with productive persistence and wheel-spinning in ASSISTments, an online math learning platform. This study’s results indicated that three features differentiated between these two modes of persistence: the number of hints requested in any problem, the number of bottom-out hints in the last eight problems, and the variation in the delay between solving problems of the same skill. These findings suggested that focusing on number of hints can provide insight into which students are struggling, and encouraging students to engage in longer delays between problem solving is likely helpful to reduce their wheel-spinning. Using the same definition of productive persistence in Study 1, Study 2 attempted to investigate the relationship between productive persistence and grit using Duckworth and Quinn’s (2009) Short Grit Scale. Correlational results showed that the two constructs were not significantly correlated with each other, providing implications for synthesizing literature on student persistence across computer-based learning environments and traditional classrooms. Educational psychology Persistence Data mining Learning strategies
304	Consistent data aggregate retrieval for sensor network systems. January 2005 (has links) Lee Lok Hang. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 87-93). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Sensors and Sensor Networks --- p.3 / Chapter 1.2 --- Sensor Network Deployment --- p.7 / Chapter 1.3 --- Motivations --- p.7 / Chapter 1.4 --- Contributions --- p.9 / Chapter 1.5 --- Thesis Organization --- p.10 / Chapter 2 --- Literature Review --- p.11 / Chapter 2.1 --- Data Cube --- p.11 / Chapter 2.2 --- Data Aggregation in Sensor Networks --- p.12 / Chapter 2.2.1 --- Hierarchical Data Aggregation --- p.13 / Chapter 2.2.2 --- Gossip-based Aggregation --- p.13 / Chapter 2.2.3 --- Hierarchical Gossip Aggregation --- p.13 / Chapter 2.3 --- GAF Algorithm --- p.14 / Chapter 2.4 --- Concurrency Control --- p.17 / Chapter 2.4.1 --- Two-phase Locking --- p.17 / Chapter 2.4.2 --- Timestamp Ordering --- p.18 / Chapter 3 --- Building Distributed Data Cubes in Sensor Network --- p.20 / Chapter 3.1 --- Aggregation Operators --- p.21 / Chapter 3.2 --- Distributed Prefix (PS) Sum Data Cube --- p.22 / Chapter 3.2.1 --- Prefix Sum (PS) Data Cube --- p.22 / Chapter 3.2.2 --- Notations --- p.24 / Chapter 3.2.3 --- Querying a PS Data Cube --- p.25 / Chapter 3.2.4 --- Building Distributed PS Data Cube --- p.27 / Chapter 3.2.5 --- Time Bounds --- p.32 / Chapter 3.2.6 --- Fast Aggregate Queries on Multiple Regions --- p.37 / Chapter 3.2.7 --- Simulation Results --- p.43 / Chapter 3.3 --- Distributed Local Prefix Sum (LPS) Data Cube --- p.50 / Chapter 3.3.1 --- Local Prefix Sum Data Cube --- p.52 / Chapter 3.3.2 --- Notations --- p.55 / Chapter 3.3.3 --- Querying an LPS Data Cube --- p.56 / Chapter 3.3.4 --- Building Distributed LPS Data Cube --- p.61 / Chapter 3.3.5 --- Time Bounds --- p.63 / Chapter 3.3.6 --- Fast Aggregate Queries on Multiple Regions --- p.67 / Chapter 3.3.7 --- Simulation Results --- p.68 / Chapter 3.3.8 --- Distributed PS Data Cube Vs Distributed LPS Data Cube --- p.74 / Chapter 4 --- Concurrency Control and Consistency in Sensor Networks --- p.76 / Chapter 4.1 --- Data Inconsistency in Sensor Networks --- p.76 / Chapter 4.2 --- Traditional Concurrency Control Protocols and Sensor Networks --- p.80 / Chapter 4.3 --- The Consistent Retrieval of Data from Distributed Data Cubes --- p.81 / Chapter 5 --- Conclusions --- p.85 / References --- p.87 / Appendix --- p.94 / A Publications --- p.94 Data mining Computer algorithms Sensor networks
305	The fuzzification of Choquet integral and its applications. / CUHK electronic theses & dissertations collection January 2005 (has links) As the most essential feature in problem solving and decision making by humans, uncertainty information occur frequently in business, scientific and engineering disciplines. The explosive growth and diverse forms of uncertainty information in the stored data have generated an urgent requirement for new techniques and tools that can intelligently and automatically assist us in eliciting valuable knowledge from raw data. / The DCIFI is defined based on the Choquet extension of a signed fuzzy measure. A numerical calculation algorithm is implemented to derive the integration result of the DCIFI. A DCIFI regression model is designed to handle the regression problem where heterogeneous fuzzy data are involved. We propose a GA-based Double Optimization Algorithm (GDOA) to retrieve the internal coefficients of the DCIFI regression model. Besides that, A DCIFI projection classifier, which is capable of classifying heterogeneous fuzzy data efficiently and effectively, is established. We proposed a GA-based Classifier-learning Algorithm (GACA) to search the relevant internal parameters of the DCIFI projection classifier. Both the DCIFI regression model and projection classifier are very informative and powerful to deal with heterogeneous fuzzy data sets with strong interaction. Their performances are validated by a series of experiments both on synthetic and real data. (Abstract shortened by UMI.) / This thesis is mainly devoted to a comprehensive investigation on innovative data mining methodologies which merge the advantages of nonlinear integral (Choquet integral) in the representation of nonlinear relationship and fuzzy set theory in the description of uncertainty existed in practical data bases. It proposes two fuzzifications on the classical Choquet integral, one is the Defuzzified Choquet Integral with Fuzzy-valued Integrand (DCIFI), and the other is the Fuzzified Choquet Integral with Fuzzy-valued Integrand (FCIFI). The DCIFI and the FCIFI are regarded as generalizations of Choquet integral since both of them allow their integrands to be fuzzy-valued. The difference lies in that the DCIFI has its integration result non-fuzzified while the FCIFI has its integration result fuzzified. Due to the different forms of integration results, the DCIFI and the FCIFI have their distinct theoretic analyses, implementation algorithms, and application scopes, respectively. / by Rong Yang. / "April 2005." / Advisers: Kwong-Sak Leung; Pheng-Ann Heng. / Source: Dissertation Abstracts International, Volume: 67-01, Section: B, page: 0371. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (p. 187-199). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract in English and Chinese. / School code: 1307. Choquet theory Data mining Fuzzy sets
306	Learning with unlabeled data. / 在未標記的數據中的機器學習 / CUHK electronic theses & dissertations collection / Zai wei biao ji de shu ju zhong de ji qi xue xi January 2009 (has links) In the first part, we deal with the unlabeled data that are in good quality and follow the conditions of semi-supervised learning. Firstly, we present a novel method for Transductive Support Vector Machine (TSVM) by relaxing the unknown labels to the continuous variables and reducing the non-convex optimization problem to a convex semi-definite programming problem. In contrast to the previous relaxation method which involves O (n2) free parameters in the semi-definite matrix, our method takes advantage of reducing the number of free parameters to O (n), so that we can solve the optimization problem more efficiently. In addition, the proposed approach provides a tighter convex relaxation for the optimization problem in TSVM. Empirical studies on benchmark data sets demonstrate that the proposed method is more efficient than the previous semi-definite relaxation method and achieves promising classification results comparing with the state-of-the-art methods. Our second contribution is an extended level method proposed to efficiently solve the multiple kernel learning (MKL) problems. In particular, the level method overcomes the drawbacks of both the Semi-Infinite Linear Programming (SILP) method and the Subgradient Descent (SD) method for multiple kernel learning. Our experimental results show that the level method is able to greatly reduce the computational time of MKL over both the SD method and the SILP method. Thirdly, we discuss the connection between two fundamental assumptions in semi-supervised learning. More specifically, we show that the loss on the unlabeled data used by TSVM can be essentially viewed as an additional regularizer for the decision boundary. We further show that this additional regularizer induced by the TSVM is closely related to the regularizer introduced by the manifold regularization. Both of them can be viewed as a unified regularization framework for semi-supervised learning. / In the second part, we discuss how to employ the unlabeled data for building reliable classification systems in three scenarios: (1) only poorly-related unlabeled data are available, (2) good quality unlabeled data are mixed with irrelevant data and there are no prior knowledge on their composition, and (3) no unlabeled data are available but can be achieved from the Internet for text categorization. We build several frameworks to deal with the above cases. Firstly, we present a study on how to deal with the weakly-related unlabeled data, called the Supervised Self-taught Learning framework, which can transfer knowledge from the unlabeled data actively. The proposed model is able to select those discriminative features or representations, which are more appropriate for classification. Secondly, we also propose a novel framework that can learn from a mixture of unlabeled data, where good quality unlabeled data are mixed with unlabeled irrelevant samples. Moreover, we do not need the prior knowledge on which data samples are relevant or irrelevant. Consequently it is significantly different from the recent framework of semi-supervised learning with universum and the framework of Universum Support Vector Machine. As an important contribution, we have successfully formulated this new learning approach as a Semi-definite Programming problem, which can be solved in polynomial time. A series of experiments demonstrate that this novel framework has advantages over the semi-supervised learning on both synthetic and real data in many facets. Finally, for third scenario, we present a general framework for semi-supervised text categorization that collects the unlabeled documents via Web search engines and utilizes them to improve the accuracy of supervised text categorization. Extensive experiments have demonstrated that the proposed semi-supervised text categorization framework can significantly improve the classification accuracy. Specifically, the classification error is reduced by 30% averaged on the nine data sets when using Google as the search engine. / We consider the problem of learning from both labeled and unlabeled data through the analysis on the quality of the unlabeled data. Usually, learning from both labeled and unlabeled data is regarded as semi-supervised learning, where the unlabeled data and the labeled data are assumed to be generated from the same distribution. When this assumption is not satisfied, new learning paradigms are needed in order to effectively explore the information underneath the unlabeled data. This thesis consists of two parts: the first part analyzes the fundamental assumptions of semi-supervised learning and proposes a few efficient semi-supervised learning models; the second part discusses three learning frameworks in order to deal with the case that unlabeled data do not satisfy the conditions of semi-supervised learning. / Xu, Zenglin. / Advisers: Irwin King; Michael R. Lyu. / Source: Dissertation Abstracts International, Volume: 70-09, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2009. / Includes bibliographical references (leaves 158-179). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Data mining Supervised learning (Machine learning)
307	Efficient similarity search in time series data. / CUHK electronic theses & dissertations collection January 2007 (has links) Time series data is ubiquitous in real world, and the similarity search in time series data is of great importance to many applications. This problem consists of two major parts: how to define the similarity between time series and how to search for similar time series efficiently. As for the similarity measure, the Euclidean distance is a good starting point; however, it also has several limitations. First, it is sensitive to the shifting and scaling transformations. Under a geometric model, we analyze this problem extensively and propose an angle-based similarity measure which is invariant to the shifting and scaling transformations. We then extend the conical index to support for the proposed angle-based similarity measure efficiently. Besides the distortions in amplitude axis, the Euclidean distance is also sensitive to the distortion in time axis; Dynamic Time Warping (DTW) distance is a very good similarity measure which is invariant to the time distortion. However, the time complexity of DTW is high which inhibits its application on large datasets. The index method under DTW distance is a common solution for this problem, and the lower-bound technique plays an important role in the indexing of DTW. We explain the existing lower-bound functions under a unified frame work and propose a group of new lower-bound functions which are much better. Based on the proposed lower-bound functions, an efficient index structure under DTW distance is implemented. In spite of the great success of DTW, it is not very suitable for the time scaling search problem where the time distortion is too large. We modify the traditional DTW distance and propose the Segment-wise Time Warping (STW) distance to adapt to the time scaling search problem. Finally, we devise an efficient search algorithm for the problem of online pattern detection in data streams under DTW distance. / Zhou, Mi. / "January 2007." / Adviser: Man Hon Wong. / Source: Dissertation Abstracts International, Volume: 68-09, Section: B, page: 6100. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2007. / Includes bibliographical references (p. 167-180). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract in English and Chinese. / School code: 1307. Data mining Database searching Time-series analysis
308	Materializing views in data warehouse: an efficient approach to OLAP. January 2003 (has links) Gou Gang. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2003. / Includes bibliographical references (leaves 83-87). / Abstracts in English and Chinese. / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Data Warehouse and OLAP --- p.4 / Chapter 1.2 --- Computational Model: Dependent Lattice --- p.10 / Chapter 1.3 --- Materialized View Selection --- p.12 / Chapter 1.3.1 --- Materialized View Selection under a Disk-Space Constraint --- p.13 / Chapter 1.3.2 --- Materialized View Selection under a Maintenance-Time Con- straint --- p.16 / Chapter 1.4 --- Main Contributions --- p.21 / Chapter 2 --- A* Search: View Selection under a Disk-Space Constraint --- p.24 / Chapter 2.1 --- The Weakness of Greedy Algorithms --- p.25 / Chapter 2.2 --- A*-algorithm --- p.29 / Chapter 2.2.1 --- An Estimation Function --- p.36 / Chapter 2.2.2 --- Pruning Feasible Subtrees --- p.38 / Chapter 2.2.3 --- Approaching the Optimal Solution from Two Directions --- p.41 / Chapter 2.2.4 --- NIBS Order: Accelerating Convergence --- p.43 / Chapter 2.2.5 --- Sliding Techniques: Eliminating Redundant H-Computation --- p.45 / Chapter 2.2.6 --- Examples --- p.50 / Chapter 2.3 --- Experiment Results --- p.54 / Chapter 2.3.1 --- Analysis of Experiment Results --- p.55 / Chapter 2.3.2 --- Computing for a Series of S Constraints --- p.60 / Chapter 2.4 --- Conclusions --- p.62 / Chapter 3 --- Randomized Search: View Selection under a Maintenance-Time Constraint --- p.64 / Chapter 3.1 --- Non-monotonic Property --- p.65 / Chapter 3.2 --- A Stochastic-Ranking-Based Evolutionary Algorithm --- p.67 / Chapter 3.2.1 --- A Basic Evolutionary Algorithm --- p.68 / Chapter 3.2.2 --- The Weakness of the rg-Method --- p.69 / Chapter 3.2.3 --- Stochastic Ranking: a Novel Constraint Handling Technique --- p.70 / Chapter 3.2.4 --- View Selection Using the Stochastic-Ranking-Based Evolu- tionary Algorithm --- p.72 / Chapter 3.3 --- Conclusions --- p.74 / Chapter 4 --- Conclusions --- p.75 / Chapter 4.1 --- Thesis Review --- p.76 / Chapter 4.2 --- Future Work --- p.78 / Chapter A --- My Publications for This Thesis --- p.81 / Bibliography --- p.83 OLAP technology Data mining Data warehousing
309	On feature selection, kernel learning and pairwise constraints for clustering analysis Zeng, Hong 01 January 2009 (has links) No description available. Cluster analysis Data mining Machine learning
310	A study of two problems in data mining: anomaly monitoring and privacy preservation. January 2008 (has links) Bu, Yingyi. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2008. / Includes bibliographical references (leaves 89-94). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.v / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Anomaly Monitoring --- p.1 / Chapter 1.2 --- Privacy Preservation --- p.5 / Chapter 1.2.1 --- Motivation --- p.7 / Chapter 1.2.2 --- Contribution --- p.12 / Chapter 2 --- Anomaly Monitoring --- p.16 / Chapter 2.1 --- Problem Statement --- p.16 / Chapter 2.2 --- A Preliminary Solution: Simple Pruning --- p.19 / Chapter 2.3 --- Efficient Monitoring by Local Clusters --- p.21 / Chapter 2.3.1 --- Incremental Local Clustering --- p.22 / Chapter 2.3.2 --- Batch Monitoring by Cluster Join --- p.24 / Chapter 2.3.3 --- Cost Analysis and Optimization --- p.28 / Chapter 2.4 --- Piecewise Index and Query Reschedule --- p.31 / Chapter 2.4.1 --- Piecewise VP-trees --- p.32 / Chapter 2.4.2 --- Candidate Rescheduling --- p.35 / Chapter 2.4.3 --- Cost Analysis --- p.36 / Chapter 2.5 --- Upper Bound Lemma: For Dynamic Time Warping Distance --- p.37 / Chapter 2.6 --- Experimental Evaluations --- p.39 / Chapter 2.6.1 --- Effectiveness --- p.40 / Chapter 2.6.2 --- Efficiency --- p.46 / Chapter 2.7 --- Related Work --- p.49 / Chapter 3 --- Privacy Preservation --- p.52 / Chapter 3.1 --- Problem Definition --- p.52 / Chapter 3.2 --- HD-Composition --- p.58 / Chapter 3.2.1 --- Role-based Partition --- p.59 / Chapter 3.2.2 --- Cohort-based Partition --- p.61 / Chapter 3.2.3 --- Privacy Guarantee --- p.70 / Chapter 3.2.4 --- Refinement of HD-composition --- p.75 / Chapter 3.2.5 --- Anonymization Algorithm --- p.76 / Chapter 3.3 --- Experiments --- p.77 / Chapter 3.3.1 --- Failures of Conventional Generalizations --- p.78 / Chapter 3.3.2 --- Evaluations of HD-Composition --- p.79 / Chapter 3.4 --- Related Work --- p.85 / Chapter 4 --- Conclusions --- p.87 / Bibliography --- p.89 Data mining Cluster analysis Data protection

Search results