1 |
CVIC: Cluster Validation Using Instance-Based ConfidencesLeBaron, Dean M 01 November 2015 (has links) (PDF)
As unlabeled data becomes increasingly available, the need for robust data mining techniques increases as well. Clustering is a common data mining tool which seeks to find related, independent patterns in data called clusters. The cluster validation problem addresses the question of how well a given clustering fits the data set. We present CVIC (cluster validation using instance-based confidences) which assigns confidence scores to each individual instance, as opposed to more traditional methods which focus on the clusters themselves. CVIC trains supervised learners to recreate the clustering, and instances are scored based on output from the learners which corresponds to the confidence that the instance was clustered correctly. One consequence of individually validated instances is the ability to direct users to instances in a cluster that are either potentially misclustered or correctly clustered. Instances with low confidences can either be manually inspected or reclustered and instances with high confidences can be automatically labeled. We compare CVIC to three competing methods for assigning confidence scores and show results on CVIC's ability to successfully assign scores that result in higher average precision and recall for detecting misclustered and correctly clustered instances across five clustering algorithms on twenty data sets including handwritten historical image data provided by Ancestry.com.
|
2 |
Active learning in cost-sensitive environmentsLiu, Alexander Yun-chung 21 June 2010 (has links)
Active learning techniques aim to reduce the amount of labeled data required for a supervised learner to achieve a certain level of performance. This can be very useful in domains where unlabeled data is easy to obtain but labelling data is costly. In this dissertation, I introduce methods of creating computationally efficient active learning techniques that handle different misclassification costs, different evaluation metrics, and different label acquisition costs. This is accomplished in part by developing techniques from utility-based data mining typically not studied in conjunction with active learning. I first address supervised learning problems where labeled data may be scarce, especially for one particular class. I revisit claims about resampling, a particularly popular approach to handling imbalanced data, and cost-sensitive learning. The presented research shows that while resampling and cost-sensitive learning can be equivalent in some cases, the two approaches are not identical. This work on resampling and cost-sensitive learning motivates a need for active learners that can handle different misclassification costs. After presenting a cost-sensitive active learning algorithm, I show that this algorithm can be combined with a proposed framework for analyzing evaluation metrics in order to create an active learning approach that can optimize any evaluation metric that can be expressed as a function of terms in a confusion matrix. Finally, I address methods for active learning in terms of different utility costs incurred when labeling different types of points, particularly when label acquisition costs are spatially driven. / text
|
Page generated in 0.0489 seconds