• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 14
  • 2
  • 2
  • Tagged with
  • 21
  • 21
  • 15
  • 14
  • 11
  • 9
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

The Application of Sequential Pattern Mining in Healthcare Workflow System and an Improved Mining Algorithm Based on Pattern-Growth Approach

Zhang, Qi 24 October 2013 (has links)
No description available.
12

Efficient Temporal Synopsis of Social Media Streams

Abouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
13

Efficient Temporal Synopsis of Social Media Streams

Abouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
14

Scalable APRIORI-based frequent pattern discovery

Chester, Sean 28 April 2009 (has links)
Frequent itemset mining, the task of finding sets of items that frequently occur to- gether in a dataset, has been at the core of the field of data mining for the past sixteen years. In that time, the size of datasets has grown much faster than has the ability of existing algorithms to handle those datasets. Consequentely, improvements are needed. In this thesis, we take the classic algorithm for the problem, A Priori, and improve it quite significantly by introducing what we call a vertical sort. We then use the benchmark large dataset, webdocs, from the FIMI 2004 conference to contrast our performance against several state-of-the-art implementations and demonstrate not only equal efficiency with lower memory usage at all support thresholds, but also the ability to mine support thresholds as yet unattempted in literature. We also indicate how we believe this work can be extended to achieve yet more impressive results.
15

Discovering Neglected Conditions in Software by Mining Program Dependence Graphs

CHANG, RAY-YAUNG January 2009 (has links)
No description available.
16

Novel frequent itemset hiding techniques and their evaluation / Σύγχρονες μέθοδοι τεχνικών απόκρυψης συχνών στοιχειοσυνόλων και αξιολόγησή τους

Καγκλής, Βασίλειος 20 May 2015 (has links)
Advances in data collection and data storage technologies have given way to the establishment of transactional databases among companies and organizations, as they allow enormous volumes of data to be stored efficiently. Most of the times, these vast amounts of data cannot be used as they are. A data processing should first take place, so as to extract the useful knowledge. After the useful knowledge is mined, it can be used in several ways depending on the nature of the data. Quite often, companies and organizations are willing to share data for the sake of mutual benefit. However, these benefits come with several risks, as problems with privacy might arise, as a result of this sharing. Sensitive data, along with sensitive knowledge inferred from these data, must be protected from unintentional exposure to unauthorized parties. One form of the inferred knowledge is frequent patterns, which are discovered during the process of mining the frequent itemsets from transactional databases. The problem of protecting such patterns is known as the frequent itemset hiding problem. In this thesis, we review several techniques for protecting sensitive frequent patterns in the form of frequent itemsets. After presenting a wide variety of techniques in detail, we propose a novel approach towards solving this problem. The proposed method is an approach that combines heuristics with linear-programming. We evaluate the proposed method on real datasets. For the evaluation, a number of performance metrics are presented. Finally, we compare the results of the newly proposed method with those of other state-of-the-art approaches. / Η ραγδαία εξέλιξη των τεχνολογιών συλλογής και αποθήκευσης δεδομένων οδήγησε στην καθιέρωση των βάσεων δεδομένων συναλλαγών σε οργανισμούς και εταιρείες, καθώς επιτρέπουν την αποδοτική αποθήκευση τεράστιου όγκου δεδομένων. Τις περισσότερες φορές όμως, αυτός ο τεράστιος όγκος δεδομένων δεν μπορεί να χρησιμοποιηθεί ως έχει. Μια πρώτη επεξεργασία των δεδομένων πρέπει να γίνει, ώστε να εξαχθεί η χρήσιμη πληροφορία. Ανάλογα με τη φύση των δεδομένων, αυτή η χρήσιμη πληροφορία μπορεί να χρησιμοποιηθεί στη συνέχεια αναλόγως. Αρκετά συχνά, οι εταιρείες και οι οργανισμοί είναι πρόθυμοι να μοιραστούν τα δεδομένα μεταξύ τους με στόχο το κοινό τους όφελος. Ωστόσο, αυτά τα οφέλη συνοδεύονται με διάφορους κινδύνους, καθώς ενδέχεται να προκύψουν προβλήματα ιδιωτικής φύσης, ως αποτέλεσμα αυτής της κοινής χρήσης των δεδομένων. Ευαίσθητα δεδομένα, μαζί με την ευαίσθητη γνώση που μπορεί να προκύψει από αυτά, πρέπει να προστατευτούν από την ακούσια έκθεση σε μη εξουσιοδοτημένους τρίτους. Μια μορφή της εξαχθείσας γνώσης είναι τα συχνά μοτίβα, που ανακαλύφθηκαν κατά την εξόρυξη συχνών στοιχειοσυνόλων από βάσεις δεδομένων συναλλαγών. Το πρόβλημα της προστασίας συχνών μοτίβων τέτοιας μορφής είναι γνωστό ως το πρόβλημα απόκρυψης συχνών στοιχειοσυνόλων. Στην παρούσα διπλωματική εργασία, εξετάζουμε διάφορες τεχνικές για την προστασία ευαίσθητων συχνών μοτίβων, υπό τη μορφή συχνών στοιχειοσυνόλων. Αφού παρουσιάσουμε λεπτομερώς μια ευρεία ποικιλία τεχνικών απόκρυψης, προτείνουμε μια νέα προσέγγιση για την επίλυση αυτού του προβλήματος. Η προτεινόμενη μέθοδος είναι μια προσέγγιση που συνδυάζει ευρετικές μεθόδους με γραμμικό προγραμματισμό. Για την αξιολόγηση της προτεινόμενης μεθόδου χρησιμοποιούμε πραγματικά δεδομένα. Για τον σκοπό αυτό, παρουσιάζουμε επίσης και μια σειρά από μετρικές αξιολόγησης. Τέλος, συγκρίνουμε τα αποτελέσματα της νέας προτεινόμενης μεθόδου με άλλες κορυφαίες προσεγγίσεις.
17

Evolutionary algorithms and frequent itemset mining for analyzing epileptic oscillations

Smart, Otis Lkuwamy 28 March 2007 (has links)
This research presents engineering tools that address an important area impacting many persons worldwide: epilepsy. Over 60 million people are affected by epilepsy, a neurological disorder characterized by recurrent seizures that occur suddenly. Surgery and anti-epileptic drugs (AED s) are common therapies for epilepsy patients. However, only persons with seizures that originate in an unambiguous, focal portion of the brain are candidates for surgery, while AED s can lead to very adverse side-effects. Although medical devices based upon focal cooling, drug infusion or electrical stimulation are viable alternatives for therapy, a reliable method to automatically pinpoint dysfunctional brain and direct these devices is needed. This research introduces a method to effectively localize epileptic networks, or connectivity between dysfunctional brain, to guide where to insert electrodes in the brain for therapeutic devices, surgery, or further investigation. The method uses an evolutionary algorithm (EA) and frequent itemset mining (FIM) to detect and cluster frequent concentrations of epileptic neuronal action potentials within human intracranial electroencephalogram (EEG) recordings. In an experiment applying the method to seven patients with neocortical epilepsy (a total of 35 seizures), the approach reliably identifies the seizure onset zone, in six of the subjects (a total of 31 seizures). Hopefully, this research will lead to a better control of seizures and an improved quality of life for the millions of persons affected by epilepsy.
18

Metody pro získávání asociačních pravidel z dat / Methods for Mining Association Rules from Data

Uhlíř, Martin January 2007 (has links)
The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
19

pcApriori: Scalable apriori for multiprocessor systems

Schlegel, Benjamin, Kiefer, Tim, Kissinger, Thomas, Lehner, Wolfgang 16 September 2022 (has links)
Frequent-itemset mining is an important part of data mining. It is a computational and memory intensive task and has a large number of scientific and statistical application areas. In many of them, the datasets can easily grow up to tens or even several hundred gigabytes of data. Hence, efficient algorithms are required to process such amounts of data. In the recent years, there have been proposed many efficient sequential mining algorithms, which however cannot exploit current and future systems providing large degrees of parallelism. Contrary, the number of parallel frequent-itemset mining algorithms is rather small and most of them do not scale well as the number of threads is largely increased. In this paper, we present a highly-scalable mining algorithm that is based on the well-known Apriori algorithm; it is optimized for processing very large datasets on multiprocessor systems. The key idea of pcApriori is to employ a modified producer--consumer processing scheme, which partitions the data during processing and distributes it to the available threads. We conduct many experiments on large datasets. pcApriori scales almost linear on our test system comprising 32 cores.
20

Efficient Frequent Closed Itemset Algorithms With Applications To Stream Mining And Classification

Ranganath, B N 09 1900 (has links)
Data mining is an area to find valid, novel, potentially useful, and ultimately understandable abstractions in a data. Frequent itemset mining is one of the important data mining approaches to find those abstractions in the form of patterns. Frequent Closed itemsets provide complete and condensed information for non-redundant association rules generation. For many applications mining all the frequent itemsets is not necessary, and mining frequent Closed itemsets are adequate. Compared to frequent itemset mining, frequent Closed itemset mining generates less number of itemsets, and therefore improves the efficiency and effectiveness of these tasks. Recently, much research has been done on Closed itemsets mining, but it is mainly for traditional databases where multiple scans are needed, and whenever new transactions arrive, additional scans must be performed on the updated transaction database; therefore, they are not suitable for data stream mining. Mining frequent itemsets from data streams has many potential and broad applications. Some of the emerging applications of data streams that require association rule mining are network traffic monitoring and web click streams analysis. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. Recent works on data stream mining based on sliding window method slide the window by one transaction at a time. But when the window size is large and support threshold is low, the existing methods consume significant time and lead to a large increase in user response time. In our first work, we propose a novel algorithm Stream-Close based on sliding window model to mine frequent Closed itemsets from the data streams within the current sliding window. We enhance the scalabality of the algorithm by introducing several optimization techniques such as sliding the window by multiple transactions at a time and novel pruning techniques which lead to a considerable reduction in the number of candidate itemsets to be examined for closure checking. Our experimental studies show that the proposed algorithm scales well with large data sets. Still the notion of frequent closed itemsets generates a huge number of closed itemsets in some applications. This drawback makes frequent closed itemsets mining infeasible in many applications since users cannot interpret the large volume of output (which sometimes will be greater than the data itself when support threshold is low) and may lead to an overhead to develop extra applications which post processes the output of original algorithm to reduce the size of the output. Recent work on clustering of itemsets considers strictly either expression(consists of items present in itemset) or support of the itemsets or partially both to reduce the number of itemsets. But the drawback of the above approaches is that in some situations, number of itemsets does not reduce due to their restricted view of either considering expressions or support. So we propose a new notion of frequent itemsets called clustered itemsets which considers both expressions and support of the itemsets in summarizing the output. We introduce a new distance measure w.r.t expressions and also prove the problem of mining clustered itemsets to be NP-hard. In our second work, we propose a deterministic locality sensitive hashing based classifier using clustered itemsets. Locality sensitive hashing(LSH)is a technique for efficiently finding a nearest neighbour in high dimensional data sets. The idea of locality sensitive hashing is to hash the points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than those that are far apart. We propose a LSH based approximate nearest neighbour classification strategy. But the problem with LSH is, it randomly chooses hash functions and the estimation of a large number of hash functions could lead to an increase in query time. From Classification point of view, since LSH chooses randomly from a family of hash functions the buckets may contain points belonging to other classes which may affect classification accuracy. So, in order to overcome these problems we propose to use class association rules based hash functions which ensure that buckets corresponding to the class association rules contain points from the same class. But associative classification involves generation and examination of large number of candidate class association rules. So, we use the clustered itemsets which reduce the number of class association rules to be examined. We also establish formal connection between clustering parameter(delta used in the generation of clustered frequent itemsets) and discriminative measure such as Information gain. Our experimental studies show that the proposed method achieves an increase in accuracy over LSH based near neighbour classification strategy.

Page generated in 0.0603 seconds