1 |
New approaches to weighted frequent pattern miningYun, Unil 25 April 2007 (has links)
Researchers have proposed frequent pattern mining algorithms that are more
efficient than previous algorithms and generate fewer but more important patterns. Many
techniques such as depth first/breadth first search, use of tree/other data structures, top
down/bottom up traversal and vertical/horizontal formats for frequent pattern mining
have been developed. Most frequent pattern mining algorithms use a support measure to
prune the combinatorial search space. However, support-based pruning is not enough
when taking into consideration the characteristics of real datasets. Additionally, after
mining datasets to obtain the frequent patterns, there is no way to adjust the number of
frequent patterns through user feedback, except for changing the minimum support.
Alternative measures for mining frequent patterns have been suggested to address these
issues. One of the main limitations of the traditional approach for mining frequent
patterns is that all items are treated uniformly when, in reality, items have different
importance. For this reason, weighted frequent pattern mining algorithms have been
suggested that give different weights to items according to their significance. The main
focus in weighted frequent pattern mining concerns satisfying the downward closure
property. In this research, frequent pattern mining approaches with weight constraints are
suggested. Our main approach is to push weight constraints into the pattern growth
algorithm while maintaining the downward closure property. We develop WFIM
(Weighted Frequent Itemset Mining with a weight range and a minimum weight),
WLPMiner (Weighted frequent Pattern Mining with length decreasing constraints), WIP
(Weighted Interesting Pattern mining with a strong weight and/or support affinity),
WSpan (Weighted Sequential pattern mining with a weight range and a minimum
weight) and WIS (Weighted Interesting Sequential pattern mining with a similar level of
support and/or weight affinity)
The extensive performance analysis shows that suggested approaches are
efficient and scalable in weighted frequent pattern mining.
|
2 |
Efficient frequent pattern mining from big data and its applicationsJiang, Fan January 2016 (has links)
Frequent pattern mining is an important research areas in data mining. Since its introduction, it has drawn attention of many researchers. Consequently, many algorithms have been proposed. Popular algorithms include level-wise Apriori based algorithms, tree based algorithms, and hyperlinked array structure based algorithms. While these algorithms are popular and beneficial due to some nice properties, they also suffer from some drawbacks such as multiple database scans, recursive tree constructions, or multiple hyperlink adjustments. In the current era of big data, high volumes of a wide variety of valuable data of different veracities can be easily collected or generated at high velocity in various real-life applications. Among these 5V's of big data, I focus on handling high volumes of big data in my Ph.D. thesis. Specifically, I design and implement a new efficient frequent pattern mining algorithmic technique called B-mine, which overcomes some of the aforementioned drawbacks and achieves better performance when compared with existing algorithms. I also extend my B-mine algorithm into a family of algorithms that can perform big data mining efficiently. Moreover, I design four different frameworks that apply this family of algorithms to the real-life application of social network mining. Evaluation results show the efficiency and practicality of all these algorithms. / February 2017
|
3 |
Using Association Analysis for Medical DiagnosesNunna, Shinjini 01 January 2016 (has links)
In order to fully examine the application of association analysis to medical data for the purpose of deriving medical diagnoses, we survey classical association analysis and approaches, the current challenges faced by medical association analysis and proposed solutions, and finally culminate this knowledge in a proposition for the application of medical association analysis to the identification of food intolerance. The field of classical association analysis has been well studied since its introduction in the seminal paper on market basket research in the 1990's. While the theory itself is relatively simple, the brute force approach is prohibitively expensive and thus, creative approaches utilizing various data structures and strategies must be explored for efficiency. Medical association analysis is a burgeoning field with various focuses, including diagnosis systems and gene analysis. There are a number of challenges faced in the field, primarily stemming from characteristics of analysis of complex, voluminous and high dimensional medical data. We examine the challenges faced in the pre-processing, analysis and post-processing phases, and corresponding solutions. Additionally, we survey proposed measures for ensuring the results of medical association analysis will hold up to medical diagnosis standards. Finally, we explore how medical association analysis can be utilized to identify food intolerances. The proposed analysis system is based upon a current method of diagnosis used by medical professionals, and seeks to eliminate manual analysis, while more efficiently and intelligently identifying interesting, and less obvious patterns between patients' food consumption and symptoms to propose a food intolerance diagnosis.
|
4 |
Discovering Co-Location Patterns and Rules in Uncertain Spatial DatasetsAdilmagambetov, Aibek Unknown Date
No description available.
|
5 |
Frequent Pattern Mining among Weighted and Directed GraphsCederquist, Aaron January 2009 (has links)
No description available.
|
6 |
A Classification System For The Problem Of Protein Subcellular LocalizationAlay, Gokcen 01 September 2007 (has links) (PDF)
The focus of this study is on predicting the subcellular localization of a protein. Subcellular localization
information is important for protein function annotation which is a fundamental problem in computational
biology. For this problem, a classification system is built that has two main parts: a predictor that is
based on a feature mapping technique to extract biologically meaningful information from protein sequences
and a client/server architecture for searching and predicting subcellular localizations. In the first part of the
thesis, we describe a feature mapping technique based on frequent patterns. In the feature mapping technique we describe,
frequent patterns in a protein sequence dataset were identified using a search technique based on a priori
property and the distribution of these patterns over a new sample is used as a feature vector for classification.
The effect of a number of feature selection methods on the classification performance is investigated and the best
one is applied. The method is assessed on the subcellular localization
prediction problem with 4 compartments (Endoplasmic reticulum (ER) targeted, cytosolic, mitochondrial, and nuclear)
and the dataset is the same used in P2SL. Our method improved the overall accuracy to 91.71% which was
originally 81.96% by P2SL. In the second part of the thesis, a client/server architecture is designed and implemented
based on Simple Object Access Protocol (SOAP) technology which provides a user-friendly interface for accessing the
protein subcellular localization predictions. Client part is in fact a Cytoscape plug-in that is used for functional
enrichment of biological networks. Instead of the individual use of subcellular localization information,
this plug-in lets biologists to analyze a set of genes/proteins under system view.
|
7 |
Dažnų sekų analizė sprendimų priėmimui labai didelėse duomenų bazėse / Frequent pattern analysis for decision making in big dataPragarauskaitė, Julija 01 July 2013 (has links)
Didžiuliai informacijos kiekiai yra sukaupiami kiekvieną dieną pasaulyje bei jie sparčiai auga. Apytiksliai duomenų tyrybos algoritmai yra labai svarbūs analizuojant tokius didelius duomenų kiekius, nes algoritmų greitis yra ypač svarbus daugelyje sričių, tuo tarpu tikslieji metodai paprastai yra lėti bei naudojami tik uždaviniuose, kuriuose reikalingas tikslus atsakymas.
Ši disertacija analizuoja kelias duomenų tyrybos sritis: dažnų sekų paiešką bei vizualizaciją sprendimų priėmimui.
Dažnų sekų paieškai buvo pasiūlyti trys nauji apytiksliai metodai, kurie buvo testuojami naudojant tikras bei dirbtinai sugeneruotas duomenų bazes:
• Atsitiktinės imties metodas (Random Sampling Method - RSM) formuoja pradinės duomenų bazės atsitiktinę imtį ir nustato dažnas sekas, remiantis atsitiktinės imties analizės rezultatais. Šio metodo privalumas yra teorinis paklaidų tikimybių įvertinimas, naudojantis standartiniais statistiniais metodais.
• Daugybinio perskaičiavimo metodas (Multiple Re-sampling Method - MRM) yra RSM metodo patobulinimas, kuris formuoja kelias pradinės duomenų bazės atsitiktines imtis ir taip sumažina paklaidų tikimybes.
• Markovo savybe besiremiantis metodas (Markov Property Based Method - MPBM) kelis kartus skaito pradinę duomenų bazę, priklausomai nuo Markovo proceso eilės, bei apskaičiuoja empirinius dažnius remdamasis Markovo savybe.
Didelio duomenų kiekio vizualizavimui buvo naudojami pirkėjų internetu elgsenos duomenys, kurie analizuojami naudojant... [toliau žr. visą tekstą] / Huge amounts of digital information are stored in the World today and the amount is increasing by quintillion bytes every day. Approximate data mining algorithms are very important to efficiently deal with such amounts of data due to the computation speed required by various real-world applications, whereas exact data mining methods tend to be slow and are best employed where the precise results are of the highest important.
This thesis focuses on several data mining tasks related to analysis of big data: frequent pattern mining and visual representation.
For mining frequent patterns in big data, three novel approximate methods are proposed and evaluated on real and artificial databases:
• Random Sampling Method (RSM) creates a random sample from the original database and makes assumptions on the frequent and rare sequences based on the analysis results of the random sample. A significant benefit is a theoretical estimate of classification errors made by this method using standard statistical methods.
• Multiple Re-sampling Method (MRM) is an improved version of RSM method with a re-sampling strategy that decreases the probability to incorrectly classify the sequences as frequent or rare.
• Markov Property Based Method (MPBM) relies upon the Markov property. MPBM requires reading the original database several times (the number equals to the order of the Markov process) and then calculates the empirical frequencies using the Markov property.
For visual representation... [to full text]
|
8 |
New techniques for efficiently discovering frequent patternsJin, Ruoming 01 August 2005 (has links)
No description available.
|
9 |
Efficient Temporal Synopsis of Social Media StreamsAbouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
|
10 |
Efficient Temporal Synopsis of Social Media StreamsAbouelnagah, Younes January 2013 (has links)
Search and summarization of streaming social media, such as Twitter, requires the ongoing analysis of large volumes of data with dynamically changing characteristics. Tweets are short and repetitious -- lacking context and structure -- making it difficult to generate a coherent synopsis of events within a given time period. Although some established algorithms for frequent itemset analysis might provide an efficient foundation for synopsis generation, the unmodified application of standard methods produces a complex mass of rules, dominated by common language constructs and many trivial variations on topically related results. Moreover, these results are not necessarily specific to events within the time period of interest. To address these problems, we build upon the Linear time Closed itemset Mining (LCM) algorithm, which is particularly suited to the large and sparse vocabulary of tweets. LCM generates only closed itemsets, providing an immediate reduction in the number of trivial results. To reduce the impact of function words and common language constructs, we apply a filltering step that preserves these terms only when they may form part of a relevant collocation. To further reduce trivial results, we propose a novel strengthening of the closure condition of LCM to retain only those results that exceed a threshold of distinctiveness. Finally, we perform temporal ranking, based on information gain, to identify results that are particularly relevant to the time period of interest. We evaluate our work over a collection of tweets gathered in late 2012, exploring the efficiency and filtering characteristic of each processing step, both individually and collectively. Based on our experience, the resulting synopses from various time periods provide understandable and meaningful pictures of events within those periods, with potential application to tasks such as temporal summarization and query expansion for search.
|
Page generated in 0.1358 seconds