Spelling suggestions: "subject:"data minining"" "subject:"data chanining""
361 |
DARM distance-based association rule mining.Icev, Aleksandar. January 2003 (has links)
Thesis (M.S.)--Worcester Polytechnic Institute. / Keywords: spatial data mining; distance-based association rules; distance-based Apriori algorithm. Includes bibliographical references (p. 51-54).
|
362 |
An improved unsupervised modeling methodology for detecting fraud in vendor payment transactions /Rouillard, Gregory W. January 2003 (has links) (PDF)
Thesis (M.S. in Operations Research)--Naval Postgraduate School, June 2003. / Thesis advisor(s): Samuel E. Buttrey, Lyn R. Whitaker. Includes bibliographical references (p. 147-148). Also available online.
|
363 |
CUDIA : a probabilistic cross-level imputation framework using individual auxiliary information / Probabilistic cross-level imputation framework using individual auxiliary informationPark, Yubin 17 February 2012 (has links)
In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones such as Hospital Referral Regions (HRR) or Hospital Service Areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this thesis, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel "cross-level" imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework. / text
|
364 |
Classification of encrypted cloud computing service traffic using data mining techniquesQian, Cheng 27 February 2012 (has links)
In addition to the wireless network providers’ need for traffic classification, the need is more and more common in the Cloud Computing environment. A data center hosting Cloud Computing services needs to apply priority policies and Service Level Agreement (SLA) rules at the edge of its network. Overwhelming requirements about user privacy protection and the trend of IPv6 adoption will contribute to the significant growth of encrypted Cloud Computing traffic. This report presents experiments focusing on application of data mining based Internet traffic classification methods to classify encrypted Cloud Computing service traffic. By combining TCP session level attributes, client and host connection patterns and Cloud Computing service Message Exchange Patterns (MEP), the best method identified in this report yields 89% overall accuracy. / text
|
365 |
Mining statistical correlations with applications to software analysisDavis, Jason Victor 12 October 2012 (has links)
Machine learning, data mining, and statistical methods work by representing real-world objects in terms of feature sets that best describe them. This thesis addresses problems related to inferring and analyzing correlations among such features. The contributions of this thesis are two-fold: we develop formulations and algorithms for addressing correlation mining problems, and we also provide novel applications of our methods to statistical software analysis domains. We consider problems related to analyzing correlations via unsupervised approaches, as well as algorithms that infer correlations using fully-supervised or semi-supervised information. In the context of correlation analysis, we propose the problem of correlation matrix clustering which employs a k-means style algorithm to group sets of correlations in an unsupervised manner. Fundamental to this algorithm is a measure for comparing correlations called the log-determinant (LogDet) divergence, and a primary contribution of this thesis is that of interpreting and analyzing this measure in the context of information theory and statistics. Additionally based on the LogDet divergence, we present a metric learning problem called Information-Theoretic Metric Learning which uses semi-supervised or fully-supervised data to infer correlations for parametrization of a Mahalanobis distance metric. We also consider the problem of learning Mahalanobis correlation matrices in the presence of high dimensions when the number of pairwise correlations can grow very large. In validating our correlation mining methods, we consider two in-depth and real-world statistical software analysis problems: software error reporting and unit test prioritization. In the context of Clarify, we investigate two types of correlation mining applications: metric learning for nearest neighbor software support, and decision trees for error classification. We show that our metric learning algorithms can learn program-specific similarity models for more accurate nearest neighbor comparisons. In the context of decision tree learning, we address the problem of learning correlations with associated feature costs, in particular, the overhead costs of software instrumentation. As our second application, we present a unit test ordering algorithm which uses clustering and nearest neighbor algorithms, along with a metric learning component, to efficiently search and execute large unit test suites. / text
|
366 |
Reverse Top-k search using random walk with restartYu, Wei, 余韡 January 2013 (has links)
With the increasing popularity of social networking applications, large volumes of graph data are becoming available. Large graphs are also derived by structure extraction from relational, text, or scientific data (e.g., relational tuple networks, citation graphs, ontology networks, protein-protein interaction graphs). Nodeto-node proximity is the key building block for many graph based applications that search or analyze the data. Among various proximity measures, random walk with restart (RWR) is widely adapted because of its ability to consider the global structure of the whole network.
Although RWR-based similarity search has been well studied before, there is no prior work on reverse top-k proximity search in graphs based on RWR. We discuss the applicability of this query and show that the direct application of existing methods on RWR-based similarity search to solve reverse top-k queries has very high computational and storage demands. To address this issue, we propose an indexing technique, paired with an on-line reverse top-k search algorithm.
In the indexing step, we compute from the graph G a graph index, which is based on a K X |V| matrix, containing in each column v the K largest approximate proximity values from v to any other node in G. K is application-dependent and represents the highest value of k in a practical reverse top-k query. At each column v of the index, the approximate values are lower bounds of the K largest proximity values from v to all other nodes.
Given the graph index and a reverse top-k query q (k _ K), we prove that the exact proximities from any node v to query q can be efficiently computed by applying the power method. By comparing these with the corresponding lower bounds taken from the k-th row of the graph index, we are able to determine which nodes are certainly not in the reverse top-k result of q. For some of the remaining nodes, we may also be able to determine that they are certainly in the reverse top-k result of q, based on derived upper bounds for the k-th largest proximity value from them. Finally, for any candidate that remains, we progressively refine its approximate proximities, until based on its lower or upper bound it can be determined not to be or to be in the result. The proximities refined during a reverse top-k are used to update the graph index, making its values progressively more accurate for future queries.
Our experimental evaluation shows that our technique is efficient and has manageable storage requirements even when applied on very large graphs. We also show the effectiveness of the reverse top-k search in the scenarios of spam detection and determining the popularity of authors. / published_or_final_version / Computer Science / Master / Master of Philosophy
|
367 |
Incremental algorithms for multilinear principal component analysis of tensor objectsCao, Zisheng, 曹子晟 January 2013 (has links)
In recent years, massive data sets are generated in many areas of science and business, and are gathered by using advanced data acquisition techniques. New approaches are therefore required to facilitate effective data management and data analysis in this big data era, especially to analyze multidimensional data for real-time applications. This thesis aims at developing generic and effective algorithms for compressing and recovering online multidimensional data, and applying such algorithms in image processing and other related areas.
Since multidimensional data are usually represented by tensors, this research uses multilinear algebra as the mathematical foundation to facilitate development. After reviewing the techniques of singular value decomposition (SVD), principal component analysis (PCA) and tensor decomposition, this thesis deduces an effective multilinear principal component analysis (MPCA) method to process such data by seeking optimal orthogonal basis functions that map the original tensor space to a tensor subspace with minimal reconstruction error. Two real examples, 3D data compression for positron emission tomography (PET) and offline fabric defect detection, are used to illustrate the tensor decomposition method and the deduced MPCA method, respectively. Based on the deduced MPCA method, this research develops an incremental MPCA (IMPCA) algorithm which targets at compressing and recovering online tensor objects.
To reduce computational complexity of the IMPCA algorithm, this research investigates the low-rank updates of singular values in the matrix and tensor domains, which leads to the development of a sequential low-rank update scheme similar to the sequential Karhunen-Loeve algorithm (SKL) for incremental matrix singular value decomposition, a sequential low-rank update scheme for incremental tensor decomposition, and a quick subspace tracking (QST) algorithm to further enhance the low-rank updates of singular values if the matrix is positive-symmetric definite. Although QST is slightly inferior to the SKL algorithm in terms of accuracy in estimating eigenvector and eigenvalue, the algorithm has lower computational complexity. Two fast incremental MPCA
(IMPCA) algorithms are then developed by incorporating the SKL algorithm and the QST algorithm separately into the IMPCA algorithm. Results obtained from applying the developed IMPCA algorithms to detect anomalies from online multidimensional data in a number of numerical experiments, and to track and reconstruct the global surface temperature anomalies over the past several decades clearly confirm the excellent performance of the algorithms.
This research also applies the developed IMPCA algorithms to solve an online fabric defect inspection problem. Unlike existing pixel-wise detection schemes, the developed algorithms employ a scanning window to extract tensor objects from fabric images, and to detect the occurrence of anomalies. The proposed method is unsupervised because no pre-training is needed. Two image processing techniques, selective local Gabor binary patterns (SLGBP) and multi-channel feature combination, are developed to accomplish the feature extraction of textile patterns and represent the features as tensor objects. Results of experiments conducted by using a real textile dataset confirm that the developed algorithms are comparable to existing supervised methods in terms of accuracy and computational complexity. A cost-effective parallel implementation scheme is developed to solve the problem in real-time. / published_or_final_version / Industrial and Manufacturing Systems Engineering / Doctoral / Doctor of Philosophy
|
368 |
Relationship-based clustering and cluster ensembles for high-dimensional data miningStrehl, Alexander 28 August 2008 (has links)
Not available / text
|
369 |
Text mining with information extractionNahm, Un Yong 28 August 2008 (has links)
Not available / text
|
370 |
Sequence classification and melody tracks selectionTang, Fung, Michael, 鄧峰 January 2001 (has links)
published_or_final_version / abstract / toc / Computer Science and Information Systems / Master / Master of Philosophy
|
Page generated in 0.1065 seconds