1 |
An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray DatasetsLee, Li-Wen 05 July 2006 (has links)
A DNA microarray is a very good tool to study the gene expression level in different situations. Mining association rules in DNA microarray datasets can help us know how genes affect each other, and what genes are usually co-expressed. Mining closed large itemsets can be useful for reducing the size of the mining result from the DNA microarray datasets, where a closed itemset is an itemset that there is no superset whose support value is the same as the support value of this itemset. Since the number of genes stored in columns is much larger than the number of samples stored in rows in a DNA microarray dataset, traditional mining methods which use column enumeration face a great challenge, where the column enumeration means that enumerating itemsets from the combinations of items stored in columns. Therefore, several row enumeration methods, e.g., RERII, have been proposed to solve this problem, where row enumeration means that enumerating itemsets from the combinations of items stored in rows. Although the RERII method saves more memory space and has better performance than the other row enumeration methods, it needs complex intersection operations at each node of the row enumeration tree to generate the closed itemsets. In this thesis, we propose a new method, UMiner, which is based on the union operations to mine the closed large itemsets in the DNA microarray datasets. Our approach is a row enumeration method. First, we add all tuples in the transposed table to a prefix tree, where a transposed table records the information about where an item appears in the original table. Next, we traverse this prefix tree to create a row-node table which records the information about a node and the related range of its child nodes in the prefix tree created from the transposed table. Then we generate the closed itemset by using the union operations on the itemsets in the item groups stored in the row-node table. Since we do not use the intersection operations to generate the closed itemset for each enumeration node, we can reduce the time complexity that is needed at each node of the row enumeration tree. Moreover, we develop four pruning techniques to reduce the number of candidate closed itemsets in the row enumeration tree. By replacing the complex intersection operations with the union operations and designing four pruning techniques to reduce the number of branches in the row enumeration tree, our method can find closed large itemsets very efficiently. In our performance study, we use three real datasets which are the clinical data on breast cancer, lung cancer, and AML-ALL. From the experiment results, we show that our UMiner method is always faster than the RERII method in all support values, no matter what the average length of the closed large itemsets is. Moreover, in our simulation result, we also show that the processing time of our method increases much more slowly than that of the RERII method as the average number of items in the rows of a dataset increases.
|
2 |
arules - A Computational Environment for Mining Association Rules and Frequent Item SetsHornik, Kurt, Grün, Bettina, Hahsler, Michael January 2005 (has links) (PDF)
Mining frequent itemsets and association rules is a popular and well researched approach
for discovering interesting relationships between variables in large databases. The
R package arules presented in this paper provides a basic infrastructure for creating and
manipulating input data sets and for analyzing the resulting itemsets and rules. The package
also includes interfaces to two fast mining algorithms, the popular C implementations
of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent
itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. (authors' abstract)
|
3 |
A computational environment for mining association rules and frequent item setsHahsler, Michael, Grün, Bettina, Hornik, Kurt January 2005 (has links) (PDF)
Mining frequent itemsets and association rules is a popular and well researched approach to discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
|
4 |
A Sliding-Window Approach to Mining Maximal Large Itemsets for Large DatabasesChang, Yuan-feng 28 July 2004 (has links)
Mining association rules, means a process of nontrivial extraction of implicit,
previously and potentially useful information from data in databases. Mining maximal
large itemsets is a further work of mining association rules, which aims to find
the set of all subsets of large (frequent) itemsets that could be representative of all large
itemsets. Previous algorithms to mining maximal large itemsets can be classified into two approaches: exhausted and
shortcut. The shortcut approach could generate smaller number of
candidate itemsets than the exhausted approach,
resulting in better performance in terms of time and storage space.
On the other hand, when updates to the transaction databases occur,
one possible approach is to re-run the mining algorithm on the whole
database. The other approach is incremental mining, which aims for efficient maintenance of discovered association rules
without re-running the mining algorithms. However,
previous algorithms for mining maximal large itemsets based on the shortcut approach
can not support incremental mining for mining maximal large itemsets.
While the algorithms for incremental mining, {it e.g.}, the SWF
algorithm, could not efficiently support mining maximal large
itemsets, since it is based on the exhausted approach.
Therefore, in this thesis, we focus on the design of an
algorithm which could provide good performance for both mining maximal itemsets and incremental mining.
Based on some observations, for example, ``{it if an itemset is large, all its
subsets must be large; therefore, those subsets need not to be examined
further}", we propose a Sliding-Window approach, the SWMax algorithm, for
efficiently mining maximal large itemsets and incremental mining. Our
SWMax algorithm is a two-passes partition-based approach. We will find all candidate
1-itemsets ($C_1$), candidate 3-itemsets ($C_3$), large 1-itemsets ($L_1$),
and large 3-itemsets ($L_3$) in the first pass.
We generate the virtual maximal large itemsets after the first pass. Then, we use $L_1$ to generate $C_2$, use $L_3$
to generate $C_4$, use $C_4$ to generate $C_5$, until there is no
$C_k$ generated. In the second pass, we use the virtual maximal large itemsets to
prune $C_k$, and decide the maximal large itemsets.
For incremental mining, we consider two cases: (1)
data insertion, (2) data deletion. Both in Case 1 and Case 2, if an itemset
with size equal to 1 is not large in the original database, it could not be found in the
updated database based on the SWF algorithm. That is, a missing case
could occur in the incremental mining process of the SWF
algorithm, because the SWF algorithm only keeps the $C_2$ information.
While our SWMax algorithm could support incremental mining
correctly, since $C_1$ and $C_3$ are maintained in our algorithm.
We generate some synthetic databases to simulate the real transaction
databases in our simulation. From our simulation, the
results show that our SWMax algorithm could generate fewer number of candidates
and needs less time than the SWF algorithm.
|
5 |
Modeling and Analysis of Regulatory Elements in Arabidopsis thaliana from Annotated Genomes and Gene Expression DataPati, Amrita 15 August 2005 (has links)
Modeling of cis-elements in the upstream regions of genes is a challenging computational problem. A set of regulatory motifs present in the promoters of a set of genes can be modeled by a biclique. Combinations of cis-elements play a vital role in ascertaining that the correct co-action of transcription factors binding to the gene promoter, results in appropriate gene expression in response to various stimuli. Geometrical and spatial constraints in transcription factor binding also impose restrictions on order and separation of cis-elements. Not all regulatory elements that coexist are biologically significant. If the set of genes in which a set of regulatory elements co-occur, are tightly correlated with respect to gene expression data over a set of treatments, the regulatory element combination can be biologically directed.
The system developed in this work, XcisClique, consists of a comprehensive infrastructure for annotated genome and gene expression data for Arabidopsis thaliana. XcisClique models cis-regulatory elements as regular expressions and detects maximal bicliques of genes and motifs, called itemsets. An itemset consists of a set of genes (called a geneset) and a set of motifs (called a motifset) such that every motif in the motifset occurs in the promoter of every gene in the geneset. XcisClique differs from existing tools of the same kind in that, it offers a common platform for the integration of sequence and gene expression data. Itemsets identified by XcisClique are not only evaluated for statistical over-representation in sequence data, but are also examined with respect to the expression patterns of the corresponding geneset. Thus, the results produced are biologically directed. XcisClique is also the only tool of its kind for Arabidopsis thaliana, and can also be used for other organisms in the presence of appropriate sequence, expression, and regulatory element data. The web-interface to a subset of functionalities, source code and supplemental material are available online at http://bioinformatics.cs.vt.edu/xcisclique. / Master of Science
|
6 |
Systèmes producteurs de confiance : ouverture de droit à des services par apprentissage dynamique du comportement des utilisateurs du système d'information / Design of a right-to-service system by dynamic learning of the information service users' behaviourDia, Diyé 17 March 2016 (has links)
Résumé indisponible. / Résumé indisponible.
|
7 |
Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And ClassificationRanganath, B N 09 1900 (has links)
Data mining is an area to find valid, novel, potentially useful, and ultimately understandable abstractions in a data. Frequent itemset mining is one of the important data mining approaches to find those abstractions in the form of patterns. Frequent Closed itemsets provide complete and condensed information for non-redundant association rules generation. For many applications mining all the frequent itemsets is not necessary, and mining frequent Closed itemsets are adequate. Compared to frequent itemset mining, frequent Closed itemset mining generates less number of itemsets, and therefore improves the efficiency and effectiveness of these tasks.
Recently, much research has been done on Closed itemsets mining, but it is mainly for traditional databases where multiple scans are needed, and whenever new transactions arrive, additional scans must be performed on the updated transaction database; therefore, they are not suitable for data stream mining.
Mining frequent itemsets from data streams has many potential and broad applications. Some of the emerging applications of data streams that require association rule mining are network traffic monitoring and web click streams analysis. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data.
Recent works on data stream mining based on sliding window method slide the window by one transaction at a time. But when the window size is large and support threshold is low, the existing methods consume significant time and lead to a large increase in user response time.
In our first work, we propose a novel algorithm Stream-Close based on sliding window model to mine frequent Closed itemsets from the data streams within the current sliding window. We enhance the scalabality of the algorithm by introducing several optimization techniques such as sliding the window by multiple transactions at a time and novel pruning techniques which lead to a considerable reduction in the number of candidate itemsets to be examined for closure checking. Our experimental studies show that the proposed algorithm scales well with large data sets.
Still the notion of frequent closed itemsets generates a huge number of closed itemsets in some applications. This drawback makes frequent closed itemsets mining infeasible in many applications since users cannot interpret the large volume of output (which sometimes will be greater than the data itself when support threshold is low) and may lead to an overhead to develop extra applications which post processes the output of original algorithm to reduce the size of the output.
Recent work on clustering of itemsets considers strictly either expression(consists of items present in itemset) or support of the itemsets or partially both to reduce the number of itemsets. But the drawback of the above approaches is that in some situations, number of itemsets does not reduce due to their restricted view of either considering expressions or support.
So we propose a new notion of frequent itemsets called clustered itemsets which considers both expressions and support of the itemsets in summarizing the output. We introduce a new distance measure w.r.t expressions and also prove the problem of mining clustered itemsets to be NP-hard.
In our second work, we propose a deterministic locality sensitive hashing based classifier using clustered itemsets. Locality sensitive hashing(LSH)is a technique for efficiently finding a nearest neighbour in high dimensional data sets. The idea of locality sensitive hashing is to hash the points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than those that are far apart. We propose a LSH based approximate nearest neighbour classification strategy. But the problem with LSH is, it randomly chooses hash functions and the estimation of a large number of hash functions could lead to an increase in query time. From Classification point of view, since LSH chooses randomly from a family of hash functions the buckets may contain points belonging to other classes which may affect classification accuracy. So, in order to overcome these problems we propose to use class association rules based hash functions which ensure that buckets corresponding to the class association rules contain points from the same class. But associative classification involves generation and examination of large number of candidate class association rules. So, we use the clustered itemsets which reduce the number of class association rules to be examined. We also establish formal connection between clustering parameter(delta used in the generation of clustered frequent itemsets) and discriminative measure such as Information gain. Our experimental studies show that the proposed method achieves an increase in accuracy over LSH based near neighbour classification strategy.
|
8 |
Získávání znalostí z textových dat / Knowledge Discovery in TextSmékal, Luděk January 2007 (has links)
This MSc Thesis handles with so-called data mining. Data mining is about obtaining some data or informations from databases, where these data or informations are not directly visible, but they are accessible by using special algorithms. This MSc Thesis mainly aims documents clasifying by selected method in scope of digital library. The selected method is based on sets of items called "itemsets method". This method extends Apriori algorithm application field originally designed for transaction databases processing and generation of sets of frequented items.
|
9 |
Εξόρυξη και διαχείριση κανόνων συσχέτισης με χρήση τεχνικών ανάκτησης πληροφορίαςΒαρσάμης, Θεόδωρος 11 June 2013 (has links)
Σε έναν κόσμο που κατακλύζεται από δεδομένα, καθίσταται αναγκαία η αποδοτική οργάνωσή τους και η μετέπειτα επεξεργασία τους, με σκοπό την εύρεση και την ανάκτηση πληροφορίας για λήψη αποφάσεων. Στα πλαίσια της προσπάθειας αυτής έχουν δημοσιευθεί διάφορες μελέτες που στοχεύουν στην ανεύρεση σχέσεων μεταξύ των δεδομένων, οι οποίες μπορούν να αναδείξουν άγνωστες μέχρι πρότινος εξαρτήσεις και να επιτρέψουν την πρόγνωση και την πρόβλεψη μελλοντικών αποτελεσμάτων και αποφάσεων.
Στην εργασία αυτή μελετάμε τους πιο διαδεδομένους αλγορίθμους εύρεσης κανόνων συσχετίσεων και ακολούθως προτείνουμε ένα σχήμα που χρησιμοποιεί ως βασική δομή για την ανάκτηση πληροφορίας από βάσεις δεδομένων συναλλαγών τα αντεστραμμένα αρχεία. Στόχος μας είναι η εύκολη παραγωγή κανόνων συσχέτισης αντικειμένων, βασιζόμενη στην αποδοτική αποθήκευση και ανάκτηση των Συχνών Συνόλων Αντικειμένων (Frequent Itemsets).
Αρχικά επικεντρωνόμαστε στον τρόπο εύρεσης και αποθήκευσης ενός ελάχιστου συνόλου συναλλαγών, εκμεταλλευόμενοι την πληροφορία που εμπεριέχουν τα Κλειστά Συχνά Σύνολα Αντικειμένων (Closed Frequent Itemsets) και τα Μέγιστα Συχνά Σύνολα Αντικειμένων (Maximum Frequent Itemsets). Στη συνέχεια, αξιοποιώντας την αποθηκευμένη πληροφορία στα MFI και με ελάχιστο υπολογιστικό κόστος, προτείνουμε τον αλγόριθμο MFI-drive που απαντάει σε ερωτήματα εύρεσης υπερσυνόλου και υποσυνόλου αντικειμένων, καθώς και συνόλων αντικειμένων με προκαθορισμένο βαθμό ομοιότητας σε σχέση με ένα δεδομένο σύνολο. / --
|
10 |
Une approche basée sur les motifs fermés pour résoudre le problème de clustering par consensus / A closed patterns-based approach to the consensus clustering problemAl-Najdi, Atheer 30 November 2016 (has links)
Le clustering est le processus de partitionnement d’un ensemble de données en groupes, de sorte que les instances du même groupe sont plus semblables les unes aux autres qu’avec celles de tout autre groupe. De nombreux algorithmes de clustering ont été proposés, mais aucun d’entre eux ne s’avère fournir une partitiondes données pertinente dans toutes les situations. Le clustering par consensus vise à améliorer le processus de regroupement en combinant différentes partitions obtenues à partir de divers algorithmes afin d’obtenir une solution de consensus de meilleure qualité. Dans ce travail, une nouvelle méthode de clustering par consensus, appelée MultiCons, est proposée. Cette méthode utilise la technique d’extraction des itemsets fréquents fermés dans le but de découvrir les similitudes entre les différentes solutions de clustering dits de base. Les similitudes identifiées sont représentées sous une forme de motifs de clustering, chacun définissant un accord entre un ensemble de clusters de bases sur le regroupement d’un ensemble d’instances. En traitant ces motifs par groupes, en fonction du nombre de clusters de base qui définissent le motif, la méthode MultiCons génère une solution de consensus pour chaque groupe, générant par conséquence plusieurs consensus candidats. Ces différentes solutions sont ensuite représentées dans une structure arborescente appelée arbre de consensus, ouConsTree. Cette représentation graphique facilite la compréhension du processus de construction des multiples consensus, ainsi que les relations entre les instances et les structures d’instances dans l’espace de données / Clustering is the process of partitioning a dataset into groups, so that the instances in the same group are more similar to each other than to instances in any other group. Many clustering algorithms were proposed, but none of them proved to provide good quality partition in all situations. Consensus clustering aims to enhance the clustering process by combining different partitions obtained from different algorithms to yield a better quality consensus solution. In this work, a new consensus clustering method, called MultiCons, is proposed. It uses the frequent closed itemset mining technique in order to discover the similarities between the different base clustering solutions. The identified similarities are presented in a form of clustering patterns, that each defines the agreement between a set of base clusters in grouping a set of instances. By dividing these patterns into groups based on the number of base clusters that define the pattern, MultiCons generates a consensussolution from each group, resulting in having multiple consensus candidates. These different solutions are presented in a tree-like structure, called ConsTree, that facilitates understanding the process of building the multiple consensuses, and also the relationships between the data instances and their structuring in the data space. Five consensus functions are proposed in this work in order to build a consensus solution from the clustering patterns. Approach 1 is to just merge any intersecting clustering patterns. Approach 2 can either merge or split intersecting patterns based on a proposed measure, called intersection ratio
|
Page generated in 0.0453 seconds