21 |
Exploratory Analysis of Human Sleep DataLaxminarayan, Parameshvyas 19 January 2004 (has links)
In this thesis we develop data mining techniques to analyze sleep irregularities in humans. We investigate the effects of several demographic, behavioral and emotional factors on sleep progression and on patient's susceptibility to sleep-related and other disorders. Mining is performed over subjective and objective data collected from patients visiting the UMass Medical Center and the Day Kimball Hospital for treatment. Subjective data are obtained from patient responses to questions posed in a sleep questionnaire. Objective data comprise observations and clinical measurements recorded by sleep technicians using a suite of instruments together called polysomnogram. We create suitable filters to capture significant events within sleep epochs. We propose and employ a Window-based Association Rule Mining Algorithm to discover associations among sleep progression, pathology, demographics and other factors. This algorithm is a modified and extended version of the Set-and-Sequences Association Rule Mining Algorithm developed at WPI to support the mining of association rules from complex data types. We analyze both the medical as well as the statistical significance of the associations discovered by our algorithm. We also develop predictive classification models using logistic regression and compare the results with those obtained through association rule mining.
|
22 |
Fuzzy Association Rule Mining From Spatio-temporal Data: An Analysis Of Meteorological Data In TurkeyUnal Calargun, Seda 01 January 2008 (has links) (PDF)
Data mining is the extraction of interesting non-trivial, implicit, previously unknown and
potentially useful information or patterns from data in large databases. Association rule
mining is a data mining method that seeks to discover associations among transactions encoded
within a database. Data mining on spatio-temporal data takes into consideration the
dynamics of spatially extended systems for which large amounts of spatial data exist, given
that all real world spatial data exists in some temporal context. We need fuzzy sets in mining
association rules from spatio-temporal databases since fuzzy sets handle the numerical
data better by softening the sharp boundaries of data which models the uncertainty embedded
in the meaning of data. In this thesis, fuzzy association rule mining is performed
on spatio-temporal data using data cubes and Apriori algorithm. A methodology is developed
for fuzzy spatio-temporal data cube construction. Besides the performance criteria
interpretability, precision, utility, novelty, direct-to-the-point and visualization are defined
to be the metrics for the comparison of association rule mining techniques. Fuzzy association
rule mining using spatio-temporal data cubes and Apriori algorithm performed
within the scope of this thesis are compared using these metrics. Real meteorological data
(precipitation and temperature) for Turkey recorded between 1970 and 2007 are analyzed
using data cube and Apriori algorithm in order to generate the fuzzy association rules.
|
23 |
Proximity based association rules for spatial data mining in genomesSaha, Surya 08 August 2009 (has links)
Our knowledge discovery algorithm employs a combination of association rule mining and graph mining to identify frequent spatial proximity relationships in genomic data where the data is viewed as a one-dimensional space. We apply mining techniques and metrics from association rule mining to identify frequently co-occurring features in genomes followed by graph mining to extract sets of co-occurring features. Using a case study of ab initio repeat finding, we have shown that our algorithm, ProxMiner, can be successfully applied to identify weakly conserved patterns among features in genomic data. The application of pairwise spatial relationships increases the sensitivity of our algorithm while the use of a confidence threshold based on false discovery rate reduces the noise in our results. Unlike available defragmentation algorithms, ProxMiner discovers associations among ab initio repeat families to identify larger more complete repeat families. ProxMiner will increase the effectiveness of repeat discovery techniques for newly sequenced genomes where ab initio repeat finders are only able to identify partial repeat families. In this dissertation, we provide two detailed examples of ProxMiner-discovered novel repeat families and one example of a known rice repeat family that has been extended by ProxMiner. These examples encompass some of the different types of repeat families that can be discovered by our algorithm. We have also discovered many other potentially interesting novel repeat families that can be further studied by biologists.
|
24 |
Itemset size-sensitive interestingness measures for association rule mining and link predictionAljandal, Waleed A. January 1900 (has links)
Doctor of Philosophy / Department of Computing and Information Sciences / William H. Hsu / Association rule learning is a data mining technique that can capture relationships between pairs of entities in different domains. The goal of this research is to discover factors from data that can improve the precision, recall, and accuracy of association rules found using interestingness measures and frequent itemset mining. Such factors can be calibrated using validation data and applied to rank candidate rules in domain-dependent tasks such as link existence prediction. In addition, I use interestingness measures themselves as numerical features to improve link existence prediction. The focus of this dissertation is on developing and testing an analytical framework for association rule interestingness measures, to make them sensitive to the relative size of itemsets. I survey existing interestingness measures and then introduce adaptive parametric models for normalizing and optimizing these measures, based on the size of itemsets containing a candidate pair of co-occurring entities. The central thesis of this work is that in certain domains, the link strength between entities is related to the rarity of their shared memberships (i.e., the size of itemsets in which they co-occur), and that a data-driven approach can capture such properties by normalizing the quantitative measures used to rank associations. To test this hypothesis under different levels of variability in itemset size, I develop several test bed domains, each containing an association rule mining task and a link existence prediction task. The definitions of itemset membership and link existence in each domain depend on its local semantics. My primary goals are: to capture quantitative aspects of these local semantics in normalization factors for association rule interestingness measures; to represent these factors as quantitative features for link existence prediction, to apply them to significantly improve precision and recall in several real-world domains; and to build an experimental framework for measuring this improvement, using information theory and classification-based validation.
|
25 |
Deriving classifiers with single and multi-label rules using new Associative Classification methodsAbdelhamid, Neda January 2013 (has links)
Associative Classification (AC) in data mining is a rule based approach that uses association rule techniques to construct accurate classification systems (classifiers). The majority of existing AC algorithms extract one class per rule and ignore other class labels even when they have large data representation. Thus, extending current AC algorithms to find and extract multi-label rules is promising research direction since new hidden knowledge is revealed for decision makers. Furthermore, the exponential growth of rules in AC has been investigated in this thesis aiming to minimise the number of candidate rules, and therefore reducing the classifier size so end-user can easily exploit and maintain it. Moreover, an investigation to both rule ranking and test data classification steps have been conducted in order to improve the performance of AC algorithms in regards to predictive accuracy. Overall, this thesis investigates different problems related to AC not limited to the ones listed above, and the results are new AC algorithms that devise single and multi-label rules from different applications data sets, together with comprehensive experimental results. To be exact, the first algorithm proposed named Multi-class Associative Classifier (MAC): This algorithm derives classifiers where each rule is connected with a single class from a training data set. MAC enhanced the rule discovery, rule ranking, rule filtering and classification of test data in AC. The second algorithm proposed is called Multi-label Classifier based Associative Classification (MCAC) that adds on MAC a novel rule discovery method which discovers multi-label rules from single label data without learning from parts of the training data set. These rules denote vital information ignored by most current AC algorithms which benefit both the end-user and the classifier's predictive accuracy. Lastly, the vital problem related to web threats called 'website phishing detection' was deeply investigated where a technical solution based on AC has been introduced in Chapter 6. Particularly, we were able to detect new type of knowledge and enhance the detection rate with respect to error rate using our proposed algorithms and against a large collected phishing data set. Thorough experimental tests utilising large numbers of University of California Irvine (UCI) data sets and a variety of real application data collections related to website classification and trainer timetabling problems reveal that MAC and MCAC generates better quality classifiers if compared with other AC and rule based algorithms with respect to various evaluation measures, i.e. error rate, Label-Weight, Any-Label, number of rules, etc. This is mainly due to the different improvements related to rule discovery, rule filtering, rule sorting, classification step, and more importantly the new type of knowledge associated with the proposed algorithms. Most chapters in this thesis have been disseminated or under review in journals and refereed conference proceedings.
|
26 |
Improving the Scalability of an Exact Approach for Frequent Item Set HidingLaMacchia, Carolyn 01 January 2013 (has links)
Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data. The frequent item set hiding problem is an area of active research to study approaches for hiding the sensitive knowledge patterns before disclosing the data outside the organization. Several methods address hiding sensitive item sets including an exact approach that generates an extension to the original database that, when combined with the original database, limits the discovery of sensitive association rules without impacting other non-sensitive information. To generate the database extension, this method formulates a constraint optimization problem (COP). Solving the COP formulation is the dominant factor in the computational resource requirements of the exact approach. This dissertation developed heuristics that address the scalability of the exact hiding method. The heuristics are directed at improving the performance of COP solver by reducing the size of the COP formulation without significantly affecting the quality of the solutions generated. The first heuristic decomposes the COP formulation into multiple smaller problem instances that are processed separately by the COP solver to generate partial extensions of the database. The smaller database extensions are then combined to form a database extension that is close to the database extension generated with the original, larger COP formulation. The second heuristic evaluates the revised border used to formulate the COP and reduces the number of variables and constraints by selectively substituting multiple item sets with composite variables. Solving the COP with fewer variables and constraints reduces the computational cost of the processing. Results of heuristic processing were compared with an existing exact approach based on the size of the database extension, the ability to hide sensitive data, and the impact on nonsensitive data.
|
27 |
MapReduce network enabled algorithms for classification based on association rulesHammoud, Suhel January 2011 (has links)
There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization.
|
28 |
A data mining framework for targeted category promotionsReutterer, Thomas, Hornik, Kurt, March, Nicolas, Gruber, Kathrin 06 1900 (has links) (PDF)
This research presents a new approach to derive recommendations for
segment-specific, targeted marketing campaigns on the product category level. The
proposed methodological framework serves as a decision support tool for customer
relationship managers or direct marketers to select attractive product categories for
their target marketing efforts, such as segment-specific rewards in loyalty programs,
cross-merchandising activities, targeted direct mailings, customized supplements in
catalogues, or customized promotions. The proposed methodology requires cus-
tomers' multi-category purchase histories as input data and proceeds in a stepwise
manner. It combines various data compression techniques and integrates an opti-
mization approach which suggests candidate product categories for segment-specific
targeted marketing such that cross-category spillover effects for non-promoted
categories are maximized. To demonstrate the empirical performance of our pro-
posed procedure, we examine the transactions from a real-world loyalty program of
a major grocery retailer. A simple scenario-based analysis using promotion
responsiveness reported in previous empirical studies and prior experience by
domain experts suggests that targeted promotions might boost profitability between
15 % and 128 % relative to an undifferentiated standard campaign.
|
29 |
A Formal Concept Analysis Approach to Association Rule Mining: The QuICL AlgorithmsSmith, David T. 01 January 2009 (has links)
Association rule mining (ARM) is the task of identifying meaningful implication rules exhibited in a data set. Most research has focused on extracting frequent item (FI) sets and thus fallen short of the overall ARM objective. The FI miners fail to identify the upper covers that are needed to generate a set of association rules whose size can be exploited by an end user. An alternative to FI mining can be found in formal concept analysis (FCA), a branch of applied mathematics. FCA derives a concept lattice whose concepts identify closed FI sets and connections identify the upper covers. However, most FCA algorithms construct a complete lattice and therefore include item sets that are not frequent. An iceberg lattice, on the other hand, is a concept lattice whose concepts contain only FI sets. Only three algorithms to construct an iceberg lattice were found in literature. Given that an iceberg concept lattice provides an analysis tool to succinctly identify association rules, this study investigated additional algorithms to construct an iceberg concept lattice. This report presents the development and analysis of the Quick Iceberg Concept Lattice (QuICL) algorithms. These algorithms provide incremental construction of an iceberg lattice. QuICL uses recursion instead of iteration to navigate the lattice and establish connections, thereby eliminating costly processing incurred by past algorithms. The QuICL algorithms were evaluated against leading FI miners and FCA construction algorithms using benchmarks cited in literature. Results demonstrate that QuICL provides performance on the order of FI miners yet additionally derive the upper covers. QuICL, when combined with known algorithms to extract a basis of association rules from a lattice, offer a "best known" ARM solution. Beyond this, the QuICL algorithms have proved to be very efficient, providing an order of magnitude gains over other incremental lattice construction algorithms. For example, on the Mushroom data set, QuICL completes in less than 3 seconds. Past algorithms exceed 200 seconds. On T10I4D100k, QuICL completes in less than 120 seconds. Past algorithms approach 10,000 seconds. QuICL is proved to be the "best known" all around incremental lattice construction algorithm. Runtime complexity is shown to be O(l d i) where l is the cardinality of the lattice, d is the average degree of the lattice, and i is a mean function on the frequent item extents.
|
30 |
Extração de tópicos baseado em agrupamento de regras de associação / Topic extraction based on association rule clusteringSantos, Fabiano Fernandes dos 29 May 2015 (has links)
Uma representação estruturada dos documentos em um formato apropriado para a obtenção automática de conhecimento, sem que haja perda de informações relevantes em relação ao formato originalmente não-estruturado, é um dos passos mais importantes da mineração de textos, pois a qualidade dos resultados obtidos com as abordagens automáticas para obtenção de conhecimento de textos estão fortemente relacionados à qualidade dos atributos utilizados para representar a coleção de documentos. O Modelo de Espaço de Vetores (MEV) é um modelo tradicional para obter uma representação estruturada dos documentos. Neste modelo, cada documento é representado por um vetor de pesos correspondentes aos atributos do texto. O modelo bag-of-words é a abordagem de MEV mais utilizada devido a sua simplicidade e aplicabilidade. Entretanto, o modelo bag-of-words não trata a dependência entre termos e possui alta dimensionalidade. Diversos modelos para representação dos documentos foram propostos na literatura visando capturar a informação de relação entre termos, destacando-se os modelos baseados em frases ou termos compostos, o Modelo de Espaço de Vetores Generalizado (MEVG) e suas extensões, modelos de tópicos não-probabilísticos, como o Latent Semantic Analysis (LSA) ou o Non-negative Matrix Factorization (NMF), e modelos de tópicos probabilísticos, como o Latent Dirichlet Allocation (LDA) e suas extensões. A representação baseada em modelos de tópicos é uma das abordagens mais interessantes uma vez que elas fornece uma estrutura que descreve a coleção de documentos em uma forma que revela sua estrutura interna e as suas inter-relações. As abordagens de extração de tópicos também fornecem uma estratégia de redução da dimensionalidade visando a construção de novas dimensões que representam os principais tópicos ou assuntos identificados na coleção de documentos. Entretanto, a extração é eficiente de informações sobre as relações entre os termos para construção da representação de documentos ainda é um grande desafio de pesquisa. Os modelos para representação de documentos que exploram a correlação entre termos normalmente enfrentam um grande desafio para manter um bom equilíbrio entre (i) a quantidade de dimensões obtidas, (ii) o esforço computacional e (iii) a interpretabilidade das novas dimensões obtidas. Assim,é proposto neste trabalho o modelo para representação de documentos Latent Association Rule Cluster based Model (LARCM). Este é um modelo de extração de tópicos não-probabilístico que explora o agrupamento de regras de associação para construir uma representação da coleção de documentos com dimensionalidade reduzida tal que as novas dimensões são extraídas a partir das informações sobre as relações entre os termos. No modelo proposto, as regras de associação são extraídas para cada documento para obter termos correlacionados que formam expressões multi-palavras. Essas relações entre os termos formam o contexto local da relação entre termos. Em seguida, aplica-se um processo de agrupamento em todas as regras de associação para formar o contexto geral das relações entre os termos, e cada grupo de regras de associação obtido formará um tópico, ou seja, uma dimensão da representação. Também é proposto neste trabalho uma metodologia de avaliação que permite selecionar modelos que maximizam tanto os resultados na tarefa de classificação de textos quanto os resultados de interpretabilidade dos tópicos obtidos. O modelo LARCM foi comparado com o modelo LDA tradicional e o modelo LDA utilizando uma representação que inclui termos compostos (bag-of-related-words). Os resultados dos experimentos indicam que o modelo LARCM produz uma representação para os documentos que contribui significativamente para a melhora dos resultados na tarefa de classificação de textos, mantendo também uma boa interpretabilidade dos tópicos obtidos. O modelo LARCM também apresentou ótimo desempenho quando utilizado para extração de informação de contexto para aplicação em sistemas de recomendação sensíveis ao contexto. / A structured representation of documents in an appropriate format for the automatic knowledge extraction without loss of relevant information is one of the most important steps of text mining, since the quality of the results obtained with automatic approaches for the text knowledge extraction is strongly related to the quality of the selected attributes to represent the collection of documents. The Vector Space model (VSM) is a traditional structured representation of documents. In this model, each document is represented as a vector of weights that corresponds to the features of the document. The bag-of-words model is the most popular VSM approach because of its simplicity and general applicability. However, the bag-of-words model does not include dependencies of the terms and has a high dimensionality. Several models for document representation have been proposed in the literature in order to capture the dependence among the terms, especially models based on phrases or compound terms, the Generalized Vector Space Model (GVSM) and their extensions, non-probabilistic topic models as Latent Semantic Analysis (LSA) or Non-negative Matrix Factorization (NMF) and still probabilistic topic models as the Latent Dirichlet Allocation (LDA) and their extensions. The topic model representation is one of the most interesting approaches since it provides a structure that describes the collection of documents in a way that reveals their internal structure and their interrelationships. Also, this approach provides a dimensionality reduction strategy aiming to built new dimensions that represent the main topics or ideas of the document collection. However, the efficient extraction of information about the relations of terms for document representation is still a major research challenge nowadays. The document representation models that explore correlated terms usually face a great challenge of keeping a good balance among the (i) number of extracted features, (ii) the computational performance and (iii) the interpretability of new features. In this way, we proposed the Latent Association Rule Cluster based Model (LARCM). The LARCM is a non-probabilistic topic model that explores association rule clustering to build a document representation with low dimensionality in a way that each dimension is composed by information about the relations among the terms. In the proposed approach, the association rules are built for each document to extract the correlated terms that will compose the multi-word expressions. These relations among the terms are the local context of relations. Then, a clustering process is applied for all association rules to discover the general context of the relations, and each obtained cluster is an extracted topic or a dimension of the new document representation. This work also proposes in this work an evaluation methodology to select topic models that maximize the results in the text classification task as much as the interpretability of the obtained topics. The LARCM model was compared against both the traditional LDA model and the LDA model using a document representation that includes multi-word expressions (bag-of-related-words). The experimental results indicate that LARCM provides an document representation that improves the results in the text classification task and even retains a good interpretability of the extract topics. The LARCM model also achieved great results as a method to extract contextual information for context-aware recommender systems.
|
Page generated in 0.1004 seconds