Spelling suggestions: "subject:"frequent atterns"" "subject:"frequent 1atterns""
1 |
Αλγόριθμοι εξόρυξης δεδομένων για χειρισμό πολλαπλών υποστηρίξεων και αρνητικών συσχετίσεωνΓουρδουλής, Ιωάννης Πρόδρομος 12 January 2009 (has links)
Η παρούσα εργασία πραγματεύεται το πρόβλημα της εξόρυξης γνώσης μέσα από μεγάλες βάσεις δεδομένων. Πιο συγκεκριμένα αναλύονται τεχνικές εύρεσης συχνών συνόλων αντικειμένων και κανόνων συσχέτισης (association rules). Οι τεχνικές αυτές έχουν μεγάλη εφαρμογή σε τομείς της καθημερινής ζωής. Ένα παράδειγμα είναι το σύνολο των αντικειμένων που μπορεί να αγοράσει κάποιος από ένα πολυκατάστημα (market analysis). Η εύρεση συσχετισμού μεταξύ των αντικειμένων που περιέχει το καλάθι των προϊόντων μπορεί να βοηθήσει σε μεγάλο βαθμό την διεύθυνση της επιχείρησης ώστε να αναδιατάξει τη σειρά με την οποία τοποθετεί τα προϊόντα στα ράφια. Οι πιο δημοφιλείς αλγόριθμοι που επιλύουν το πρόβλημα είναι ο apriori και ο fp-growth, χρησιμοποιούν ένα κατώφλι εμπιστοσύνης πάνω από το οποίο πρέπει να είναι ο αριθμός εμφανίσεων των συνόλων αντικειμένων μέσα στην βάση δεδομένων. Αυτό όμως δεν ανταποκρίνεται στην πραγματικότητα γιατί στην καθημερινότητα υπάρχουν προϊόντα (items) τα οποία εμφανίζονται από τη φύση τους αρκετά σπάνια οπότε το όριο της υποστήριξης τους πρέπει να τεθεί αρκετά χαμηλά ώστε να αποκτήσουν σημασία οι λιγοστές εμφανίσεις τους. Αντίστοιχα τα προϊόντα που εμφανίζονται αρκετά συχνά μπορούν να έχουν όριο υποστήριξης σχετικά μεγάλο. Για το λόγο αυτό αναπτύχθηκαν αλγόριθμοι όπου αντί για ένα ενιαίο κατώφλι υποστήριξης, έχουμε πολλαπλά (multiple minimum support) και κάθε item έχει το δικό του. Έτσι υλοποιήθηκε ο αλγόριθμος apriori ενώ η παρούσα εργασία φιλοδοξεί να εφαρμόσει τον αλγόριθμο fp-growth και μάλιστα με δύο εναλλακτικές τεχνικές σε πολλαπλές υποστηρίξεις. Οι δύο τεχνικές διαφέρουν ως προς τον τρόπο κατασκευής της βασικής δομής δηλαδή του δέντρου fp-tree. Το δέντρο αυτό περιέχει τα υποσύνολα αντικειμένων, τα οποία θα διερευνήσουμε αν εμφανίζονται αρκετά συχνά ώστε να θεωρήσουμε ότι τα αντικείμενα τους συσχετίζονται μεταξύ τους. Στη συνέχεια μελετάμε τις τεχνικές εύρεσης αρνητικών κανόνων συσχέτισης (negative association rules). Οι κανόνες αυτοί είναι χρήσιμοι στην ανάλυση αγορών με στόχο να ανακαλυφθούν τα προϊόντα που ‘συγκρούονται’ μεταξύ τους ή αλληλοσυμπληρώνονται. Η μορφή τους είναι X→⌐Y και υποδηλώνει την απουσία κάποιων αντικειμένων (Υ) ως αποτέλεσμα της παρουσίας των συνόλου αντικειμένων Χ. Και σε αυτό τον τομέα έχει αναπτυχθεί ένας αλγόριθμος που εφαρμόζει apriori για την εύρεση των συνόλων από όπου θα προκύψουν τα Χ και Y και εισάγει μια μετρική συσχέτισης των δύο συνόλων αντικειμένων, που επιλέχθηκε να είναι το correlation coefficient . Αυτό που εμείς υλοποιήσαμε είναι η εφαρμογή του ίδιου αλγορίθμου αλλάζοντας όμως τεχνική χρησιμοποιώντας τον fp-growth αντί για apriori. Τέλος ασχοληθήκαμε με το πρόβλημα της εξόρυξης προτύπων με προαπαιτούμενο τη διατήρηση της σειράς των αντικειμένων (order preserving). Αναπτύχθηκε μια νέα τεχνική εξόρυξης η οποία βασίζεται στον υπάρχοντα αλγόριθμο fp-growth αλλά βασισμένη σε μια εναλλακτική μορφή τον λεγόμενο fp γράφο. Συμπερασματικά σε αυτήν την εργασία θα αναπτυχθούν τα παρακάτω: 1. Αλγόριθμος fp-growth σε πολλαπλές υποστηρίξεις όπου η βασική δομή- το δέντρο FP-tree- έχει στα φύλλα αντικείμενα με τις μεγαλύτερες υποστηρίξεις. 2. Αλγόριθμος fp-growth σε πολλαπλές υποστηρίξεις όπου η βασική δομή- το δέντρο FP-tree- έχει στα φύλλα αντικείμενα με τις μικρότερες υποστηρίξεις. 3. Αλγόριθμος fp-growth για την εύρεση αρνητικών κανόνων συσχέτισης. 4. Αλγόριθμος fp-growth για την εξόρυξη προτύπων με διατήρηση σειράς αντικειμένων όπου η βασική δομή είναι ο fp γράφος / The present work deals with the problem of data mining from databases. More specifically we analyze methods for detection of frequent itemsets and association rules. The most popular case study of data mining is the market analysis. The important here is the detection of relation among buying items. These relation could follow to a new different classification of items inside the super market.
The most popular algorithms of data mining – apriori and fp growth – use a single support threshold for the number of appearances of items inside the database.
This single threshold is not appropriate because of the different kind of items. This conclude to replacement of single threshold with multiple threshold, one for each concrete item.
We implement the algorithm fp growth for multiple minimum supports, using two different methods. These two methods differs in the way we build the fp tree. The fp tree contents the itemsets which we must examine if they are frequent or not.
Right after that we study methods for detection positive and negative association rules. Negative association rules have the form X→⌐Y. The symbol ⌐ indicate the absence of itemset Y. Our method also use a statistic method for calculate the relation between X and Y called ‘correlation coefficient’. This method leads to an algorithm fp growth for detection of positive and negative rules.
Finally we involve with the data mining problem using a semantic requirement: ‘The order of the items in an itemset cannot change’. We develop an fp growth algorithm for detect frequent itemsets with order preserving. This algorithm use a fp graph instead of fp tree.
|
2 |
Efficient mining of interesting emerging patterns and their effective use in classificationFan, Hongjian Unknown Date (has links) (PDF)
Knowledge Discovery in Databases (KDD), or Data Mining is used to discover interesting or useful patterns and relationships in data, with an emphasis on large volume of observational databases. Among many other types of information (knowledge) that can be discovered in data, patterns that are expressed in terms of features are popular because they can be understood and used directly by people. The recently proposed Emerging Pattern (EP) is one type of such knowledge patterns. Emerging Patterns are sets of items (conjunctions of attribute values) whose frequency change significantly from one dataset to another. They are useful as a means of discovering distinctions inherently present amongst a collection of datasets and have been shown to be a powerful method for constructing accurate classifiers. (For complete abstract open document)
|
3 |
An Efficient and Incremental System to Mine Contiguous Frequent SequencesEl-Sayed, Maged F 30 January 2004 (has links)
Mining frequent patterns is an important component of many prediction systems. One common usage in web applications is the mining of users' access behavior for the purpose of predicting and hence pre-fetching the web pages that the user is likely to visit. Frequent sequence mining approaches in the literature are often based on the use of an Apriori-like candidate generation strategy, which typically requires numerous scans of the potentially huge sequence database. In this paper we instead introduce a more efficient strategy for discovering frequent patterns in sequence databases that requires only two scans of the database. The first scan obtains support counts for subsequences of length two. The second scan extracts potentially frequent sequences of any length and represents them as a compressed frequent sequences tree structure (FS-tree). Frequent sequence patterns are then mined from the FS-tree. Incremental and interactive mining functionalities are also facilitated by the FS-tree. As part of this work, we developed the FS-Miner, an system that discovers frequent sequences from web log files. The FS-Miner has the ability to adapt to changes in users' behavior over time, in the form of new input sequences, and to respond incrementally without the need to perform full re-computation. Our system also allows the user to change the input parameters (e.g., minimum support and desired pattern size) interactively without requiring full re-computation in most cases. We have tested our system using two different data sets, comparing it against two other algorithms from the literature. Our experimental results show that our system scales up linearly with the size of the input database. Furthermore, it exhibits excellent adaptability to support threshold decreases. We also show that the incremental update capability of the system provides significant performance advantages over full re-computation even for relatively large update sizes.
|
4 |
Mineração de padrões frequentes em séries temporais para apoio à tomada de decisão em agrometereologia / Mining frequent patterns in time series to support decision-making in agrometeorologyChino, Daniel Yoshinobu Takada 18 March 2014 (has links)
O crescente aumento no volume de dados complexos tem se tornado um desafio para pesquisadores. Séries temporais são um tipo de dados complexos que tem tido um crescimento em sua relevância, devido a sua importância para o monitoramento e acompanhamento de safras agrícolas. Assim, a mineração de informação a partir de grandes volumes de séries temporais para o apoio a tomada de decisões tem se tornado uma atividade valiosa. Uma das atividades importantes na mineração em séries temporais é a descoberta de padrões frequentes. Entretanto, a complexidade dessa atividade requer métodos rápidos e eficientes. Nesse contexto, esta dissertação de mestrado apresenta propostas para novos algoritmos e métodos para minerar e indexar séries temporais. Uma das propostas dessa dissertação é o índice Telesto, que utiliza uma estrutura baseada em árvores de sufixo generalizada para recuperar séries temporais em uma base de dados de séries temporais de modo rápido e eficiente. Outra proposta dessa dissertação é o algoritmo TrieMotif, que se baseia em uma trie para eliminar comparações desnecessárias entre subsequências, agilizando o processo de mineração de padrões frequentes em séries temporais. Os algoritmos propostos foram utilizados para a análise de dados climáticos e agrometeorológicos. Os resultados apresentados nessa dissertação de mestrado mostram que os algoritmos são escaláveis, podendo ser utilizados para grandes volumes de dados / Dealing with large volumes of complex data is a challenging task that has motivated many researchers around the world. Time series is a type of complex data that is growing in importance due to the increasing demand of sensors for surveillance and monitoring. Thus, mining information from large volumes of time series to support decision making is a valuable activity nowadays. This Master dissertation goes in this direction, as it proposes new algorithms and methods to mine and index time series. The novelty of the TrieMotif, a new algorithm to mine frequent patterns (motifs) from time series employing a trie structure that allows clever comparison between the sequences, as well as the Telesto index structure based on suffix trees area presented and discussed in the context of agrometeorological and climatological data, being the two main contributions of this work. The dissertation shows that the proposed algorithms are scalable, being suitable to big data, and when compared to the competitors they always presented the best results
|
5 |
Mineração de padrões frequentes em séries temporais para apoio à tomada de decisão em agrometereologia / Mining frequent patterns in time series to support decision-making in agrometeorologyDaniel Yoshinobu Takada Chino 18 March 2014 (has links)
O crescente aumento no volume de dados complexos tem se tornado um desafio para pesquisadores. Séries temporais são um tipo de dados complexos que tem tido um crescimento em sua relevância, devido a sua importância para o monitoramento e acompanhamento de safras agrícolas. Assim, a mineração de informação a partir de grandes volumes de séries temporais para o apoio a tomada de decisões tem se tornado uma atividade valiosa. Uma das atividades importantes na mineração em séries temporais é a descoberta de padrões frequentes. Entretanto, a complexidade dessa atividade requer métodos rápidos e eficientes. Nesse contexto, esta dissertação de mestrado apresenta propostas para novos algoritmos e métodos para minerar e indexar séries temporais. Uma das propostas dessa dissertação é o índice Telesto, que utiliza uma estrutura baseada em árvores de sufixo generalizada para recuperar séries temporais em uma base de dados de séries temporais de modo rápido e eficiente. Outra proposta dessa dissertação é o algoritmo TrieMotif, que se baseia em uma trie para eliminar comparações desnecessárias entre subsequências, agilizando o processo de mineração de padrões frequentes em séries temporais. Os algoritmos propostos foram utilizados para a análise de dados climáticos e agrometeorológicos. Os resultados apresentados nessa dissertação de mestrado mostram que os algoritmos são escaláveis, podendo ser utilizados para grandes volumes de dados / Dealing with large volumes of complex data is a challenging task that has motivated many researchers around the world. Time series is a type of complex data that is growing in importance due to the increasing demand of sensors for surveillance and monitoring. Thus, mining information from large volumes of time series to support decision making is a valuable activity nowadays. This Master dissertation goes in this direction, as it proposes new algorithms and methods to mine and index time series. The novelty of the TrieMotif, a new algorithm to mine frequent patterns (motifs) from time series employing a trie structure that allows clever comparison between the sequences, as well as the Telesto index structure based on suffix trees area presented and discussed in the context of agrometeorological and climatological data, being the two main contributions of this work. The dissertation shows that the proposed algorithms are scalable, being suitable to big data, and when compared to the competitors they always presented the best results
|
6 |
Determining multimediastreaming content / Bestämning av innehåll på multimedia-strömmarTano, Richard January 2011 (has links)
This Master Thesis report was written by Umeå University Engineering Physics student Richard Tano during his thesis work at Ericsson Luleå. Monitoring network quality is of utmost importance to network providers. This can be done with models evaluating QoS (Quality of Service) and conforming to ITU-T Recommendations. When determining video stream quality there is of more importance to evaluatethe QoE (Quality of Experience) to understand how the user perceives the quality. This isranked in MOS (Mean opinion scores) values. An important aspect of determining the QoEis the video content type, which is correlated to the coding complexity and MOS values ofthe video. In this work the possibilities to improve quality estimation models complying to ITU-T study group 12 (q.14) was investigated. Methods were evaluated and an algorithm was developed that applies time series analysis of packet statistics for determination of videostreams MOS scores. Methods used in the algorithm includes a novel assembling of frequentpattern analysis and regression analysis. A model which incorporates the algorithm for usage from low to high bitrates was dened. The new model resulted in around 20% improvedprecision in MOS score estimation compared to the existing reference model. Furthermore an algorithm using only regression statistics and modeling of related statistical parameters was developed. Improvements in coding estimation was comparable with earlier algorithm but efficiency increased considerably. / Detta examensarbete skrevs av Richard Tano student på Umeå universitet åt Ericsson Luleå. Övervakning av nätets prestanda är av yttersta vikt för nätverksleverantörer. Detta görs med modeller för att utvärdera QoS (Quality of Service) som överensstämmer med ITU-T rekommendationer. Vid bestämning av kvaliten på videoströmmar är det mer meningsfullt att utvärdera QoE (Quality of Experience) för att få insikt i hur användaren uppfattar kvaliten. Detta graderas i värden av MOS (Mean opinion score). En viktig aspekt för att bestämma QoE är typen av videoinnehåll, vilket är korrelerat till videons kodningskomplexitet och MOS värden. I detta arbete undersöktes möjligheterna att förbättra kvalitetsuppskattningsmodellerna under uppfyllande av ITU-T studygroup 12 (q.14). Metoder undersöktes och en algoritm utvecklades som använder tidsserieanalys av paketstatistik för uppskattning av videoströmmars MOS-värden. Metoder som ingår i algoritmen är en nyutvecklad frekventa mönster metod tillsammans med regressions analys. En modell som använder algoritmen från låg till hög bithastighet definierades. Den nya modellen gav omkring 20% förbättrad precision i uppskattning av MOS-värden jämfört med existerande referensmodell. Även en algoritm som enbart använder regressionsstatistik och modellerande av statistiska parametrar utvecklades. Denna algoritm levererade jämförbara resultat med föregående algoritm men gav även kraftigt förbättrad effektivitet.
|
7 |
Interopérabilité des systèmes distribués produisant des flux de données sémantiques au profit de l'aide à la prise de décision / Interoperability of distributed systems producing semantic data stream for decision-makingBelghaouti, Fethi 26 January 2017 (has links)
Internet est une source infinie de données émanant de sources telles que les réseaux sociaux ou les capteurs (domotique, ville intelligente, véhicule autonome, etc.). Ces données hétérogènes et de plus en plus volumineuses, peuvent être gérées grâce au web sémantique, qui propose de les homogénéiser et de les lier et de raisonner dessus, et aux systèmes de gestion de flux de données, qui abordent essentiellement les problèmes liés au volume, à la volatilité et à l’interrogation continue. L’alliance de ces deux disciplines a vu l’essor des systèmes de gestion de flux de données sémantiques RSP (RDF Stream Processing systems). L’objectif de cette thèse est de permettre à ces systèmes, via de nouvelles approches et algorithmes à faible coût, de rester opérationnels, voire plus performants, même en cas de gros volumes de données en entrée et/ou de ressources système limitées.Pour atteindre cet objectif, notre thèse s’articule principalement autour de la problématique du : "Traitement de flux de données sémantiques dans un contexte de systèmes informatiques à ressources limitées". Elle adresse les questions de recherche suivantes : (i) Comment représenter un flux de données sémantiques ? Et (ii) Comment traiter les flux de données sémantiques entrants, lorsque leurs débits et/ou volumes dépassent les capacités du système cible ?Nous proposons comme première contribution une analyse des données circulant dans les flux de données sémantiques pour considérer non pas une succession de triplets indépendants mais plutôt une succession de graphes en étoiles, préservant ainsi les liens entre les triplets. En utilisant cette approche, nous avons amélioré significativement la qualité des réponses de quelques algorithmes d’échantillonnage bien connus dans la littérature pour le délestage des flux. L’analyse de la requête continue permet d’optimiser cette solution en repèrant les données non pertinentes pour être délestées les premières. Dans la deuxième contribution, nous proposons un algorithme de détection de motifs fréquents de graphes RDF dans les flux de données RDF, appelé FreGraPaD (Frequent RDF Graph Patterns Detection). C’est un algorithme en une passe, orienté mémoire et peu coûteux. Il utilise deux structures de données principales un vecteur de bits pour construire et identifier le motif de graphe RDF assurant une optimisation de l’espace mémoire et une table de hachage pour le stockage de ces derniers. La troisième contribution de notre thèse consiste en une solution déterministe de réduction de charge des systèmes RSP appelée POL (Pattern Oriented Load-shedding for RDF Stream Processing systems). Elle utilise des opérateurs booléens très peu coûteux, qu’elle applique aux deux motifs binaires construits de la donnée et de la requête continue pour déterminer et éjecter celle qui est non-pertinente. Elle garantit un rappel de 100%, réduit la charge du système et améliore son temps de réponse. Enfin, notre quatrième contribution est un outil de compression en ligne de flux RDF, appelé Patorc (Pattern Oriented Compression for RSP systems). Il se base sur les motifs fréquents présents dans les flux qu’il factorise. C’est une solution de compression sans perte de données dont l’interrogation sans décompression est très envisageable. Les solutions apportées par cette thèse permettent l’extension des systèmes RSP existants en leur permettant le passage à l’échelle dans un contexte de Bigdata. Elles leur permettent ainsi de manipuler un ou plusieurs flux arrivant à différentes vitesses, sans perdre de leur qualité de réponse et tout en garantissant leur disponibilité au-delà même de leurs limites physiques. Les résultats des expérimentations menées montrent que l’extension des systèmes existants par nos solutions améliore leurs performances. Elles illustrent la diminution considérable de leur temps de réponse, l’augmentation de leur seuil de débit de traitement en entrée tout en optimisant l’utilisation de leurs ressources systèmes / Internet is an infinite source of data coming from sources such as social networks or sensors (home automation, smart city, autonomous vehicle, etc.). These heterogeneous and increasingly large data can be managed through semantic web technologies, which propose to homogenize, link these data and reason above them, and data flow management systems, which mainly address the problems related to volume, volatility and continuous querying. The alliance of these two disciplines has seen the growth of semantic data stream management systems also called RSP (RDF Stream Processing Systems). The objective of this thesis is to allow these systems, via new approaches and "low cost" algorithms, to remain operational, even more efficient, even for large input data volumes and/or with limited system resources.To reach this goal, our thesis is mainly focused on the issue of "Processing semantic data streamsin a context of computer systems with limited resources". It directly contributes to answer the following research questions : (i) How to represent semantic data stream ? And (ii) How to deal with input semantic data when their rates and/or volumes exceed the capabilities of the target system ?As first contribution, we propose an analysis of the data in the semantic data streams in order to consider a succession of star graphs instead of just a success of andependent triples, thus preserving the links between the triples. By using this approach, we significantly impoved the quality of responses of some well known sampling algoithms for load-shedding. The analysis of the continuous query allows the optimisation of this solution by selection the irrelevant data to be load-shedded first. In the second contribution, we propose an algorithm for detecting frequent RDF graph patterns in semantic data streams.We called it FreGraPaD for Frequent RDF Graph Patterns Detection. It is a one pass algorithm, memory oriented and "low-cost". It uses two main data structures : A bit-vector to build and identify the RDF graph pattern, providing thus memory space optimization ; and a hash-table for storing the patterns.The third contribution of our thesis consists of a deterministic load-shedding solution for RSP systems, called POL (Pattern Oriented Load-shedding for RDF Stream Processing systems). It uses very low-cost boolean operators, that we apply on the built binary patterns of the data and the continuous query inorder to determine which data is not relevant to be ejected upstream of the system. It guarantees a recall of 100%, reduces the system load and improves response time. Finally, in the fourth contribution, we propose Patorc (Pattern Oriented Compression for RSP systems). Patorc is an online compression toolfor RDF streams. It is based on the frequent patterns present in RDF data streams that factorizes. It is a data lossless compression solution whith very possible querying without any need to decompression.This thesis provides solutions that allow the extension of existing RSP systems and makes them able to scale in a bigdata context. Thus, these solutions allow the RSP systems to deal with one or more semantic data streams arriving at different speeds, without loosing their response quality while ensuring their availability, even beyond their physical limitations. The conducted experiments, supported by the obtained results show that the extension of existing systems with the new solutions improves their performance. They illustrate the considerable decrease in their engine’s response time, increasing their processing rate threshold while optimizing the use of their system resources
|
8 |
Datenzentrierte Bestimmung von Assoziationsregeln in parallelen DatenbankarchitekturenLegler, Thomas 15 August 2009 (has links) (PDF)
Die folgende Arbeit befasst sich mit der Alltagstauglichkeit moderner Massendatenverarbeitung, insbesondere mit dem Problem der Assoziationsregelanalyse. Vorhandene Datenmengen wachsen stark an, aber deren Auswertung ist für ungeübte Anwender schwierig. Daher verzichten Unternehmen auf Informationen, welche prinzipiell vorhanden sind. Assoziationsregeln zeigen in diesen Daten Abhängigkeiten zwischen den Elementen eines Datenbestandes, beispielsweise zwischen verkauften Produkten. Diese Regeln können mit Interessantheitsmaßen versehen werden, welche dem Anwender das Erkennen wichtiger Zusammenhänge ermöglichen. Es werden Ansätze gezeigt, dem Nutzer die Auswertung der Daten zu erleichtern. Das betrifft sowohl die robuste Arbeitsweise der Verfahren als auch die einfache Auswertung der Regeln. Die vorgestellten Algorithmen passen sich dabei an die zu verarbeitenden Daten an, was sie von anderen Verfahren unterscheidet.
Assoziationsregelsuchen benötigen die Extraktion häufiger Kombinationen (EHK). Hierfür werden Möglichkeiten gezeigt, Lösungsansätze auf die Eigenschaften moderne System anzupassen. Als Ansatz werden Verfahren zur Berechnung der häufigsten $N$ Kombinationen erläutert, welche anders als bekannte Ansätze leicht konfigurierbar sind. Moderne Systeme rechnen zudem oft verteilt. Diese Rechnerverbünde können große Datenmengen parallel verarbeiten, benötigen jedoch die Vereinigung lokaler Ergebnisse. Für verteilte Top-N-EHK auf realistischen Partitionierungen werden hierfür Ansätze mit verschiedenen Eigenschaften präsentiert.
Aus den häufigen Kombinationen werden Assoziationsregeln gebildet, deren Aufbereitung ebenfalls einfach durchführbar sein soll. In der Literatur wurden viele Maße vorgestellt. Je nach den Anforderungen entsprechen sie je einer subjektiven Bewertung, allerdings nicht zwingend der des Anwenders. Hierfür wird untersucht, wie mehrere Interessantheitsmaßen zu einem globalen Maß vereinigt werden können. Dies findet Regeln, welche mehrfach wichtig erschienen. Der Nutzer kann mit den Vorschlägen sein Suchziel eingrenzen. Ein zweiter Ansatz gruppiert Regeln. Dies erfolgt über die Häufigkeiten der Regelelemente, welche die Grundlage von Interessantheitsmaßen bilden. Die Regeln einer solchen Gruppe sind daher bezüglich vieler Interessantheitsmaßen ähnlich und können gemeinsam ausgewertet werden. Dies reduziert den manuellen Aufwand des Nutzers.
Diese Arbeit zeigt Möglichkeiten, Assoziationsregelsuchen auf einen breiten Benutzerkreis zu erweitern und neue Anwender zu erreichen. Die Assoziationsregelsuche wird dabei derart vereinfacht, dass sie statt als Spezialanwendung als leicht nutzbares Werkzeug zur Datenanalyse verwendet werden kann. / The importance of data mining is widely acknowledged today. Mining for association rules and frequent patterns is a central activity in data mining. Three main strategies are available for such mining: APRIORI , FP-tree-based approaches like FP-GROWTH, and algorithms based on vertical data structures and depth-first mining strategies like ECLAT and CHARM.
Unfortunately, most of these algorithms are only moderately suitable for many “real-world” scenarios because their usability and the special characteristics of the data are two aspects of practical association rule mining that require further work.
All mining strategies for frequent patterns use a parameter called minimum support to define a minimum occurrence frequency for searched patterns. This parameter cuts down the number of patterns searched to improve the relevance of the results. In complex business scenarios, it can be difficult and expensive to define a suitable value for the minimum support because it depends strongly on the particular datasets. Users are often unable to set this parameter for unknown datasets, and unsuitable minimum-support values can extract millions of frequent patterns and generate enormous runtimes. For this reason, it is not feasible to permit ad-hoc data mining by unskilled users. Such users do not have the knowledge and time to define suitable parameters by trial-and-error procedures. Discussions with users of SAP software have revealed great interest in the results of association-rule mining techniques, but most of these users are unable or unwilling to set very technical parameters. Given such user constraints, several studies have addressed the problem of replacing the minimum-support parameter with more intuitive top-n strategies.
We have developed an adaptive mining algorithm to give untrained SAP users a tool to analyze their data easily without the need for elaborate data preparation and parameter determination. Previously implemented approaches of distributed frequent-pattern mining were expensive and time-consuming tasks for specialists. In contrast, we propose a method to accelerate and simplify the mining process by using top-n strategies and relaxing some requirements on the results, such as completeness. Unlike such data approximation techniques as sampling, our algorithm always returns exact frequency counts. The only drawback is that the result set may fail to include some of the patterns up to a specific frequency threshold.
Another aspect of real-world datasets is the fact that they are often partitioned for shared-nothing architectures, following business-specific parameters like location, fiscal year, or branch office. Users may also want to conduct mining operations spanning data from different partners, even if the local data from the respective partners cannot be integrated at a single location for data security reasons or due to their large volume.
Almost every data mining solution is constrained by the need to hide complexity. As far as possible, the solution should offer a simple user interface that hides technical aspects like data distribution and data preparation. Given that BW Accelerator users have such simplicity and distribution requirements, we have developed an adaptive mining algorithm to give unskilled users a tool to analyze their data easily, without the need for complex data preparation or consolidation.
For example, Business Intelligence scenarios often partition large data volumes by fiscal year to enable efficient optimizations for the data used in actual workloads. For most mining queries, more than one data partition is of interest, and therefore, distribution handling that leaves the data unaffected is necessary.
The algorithms presented in this paper have been developed to work with data stored in SAP BW. A salient feature of SAP BW Accelerator is that it is implemented as a distributed landscape that sits on top of a large number of shared-nothing blade servers. Its main task is to execute OLAP queries that require fast aggregation of many millions of rows of data. Therefore, the distribution of data over the dedicated storage is optimized for such workloads. Data mining scenarios use the same data from storage, but reporting takes precedence over data mining, and hence, the data cannot be redistributed without massive costs. Distribution by special data semantics or user-defined selections can produce many partitions and very different partition sizes. The handling of such real-world distributions for frequent-pattern mining is an important task, but it conflicts with the requirement of balanced partition.
|
9 |
Získávání frekventovaných vzorů z proudu dat / Frequent Pattern Discovery in a Data StreamDvořák, Michal January 2012 (has links)
Frequent-pattern mining from databases has been widely studied and frequently observed. Unfortunately, these algorithms are not suitable for data stream processing. In frequent-pattern mining from data streams, it is important to manage sets of items and also their history. There are several reasons for this; it is not just the history of frequent items, but also the history of potentially frequent sets that can become frequent later. This requires more memory and computational power. This thesis describes two algorithms: Lossy Counting and FP-stream. An effective implementation of these algorithms in C# is an integral part of this thesis. In addition, the two algorithms have been compared.
|
10 |
Datenzentrierte Bestimmung von Assoziationsregeln in parallelen DatenbankarchitekturenLegler, Thomas 22 June 2009 (has links)
Die folgende Arbeit befasst sich mit der Alltagstauglichkeit moderner Massendatenverarbeitung, insbesondere mit dem Problem der Assoziationsregelanalyse. Vorhandene Datenmengen wachsen stark an, aber deren Auswertung ist für ungeübte Anwender schwierig. Daher verzichten Unternehmen auf Informationen, welche prinzipiell vorhanden sind. Assoziationsregeln zeigen in diesen Daten Abhängigkeiten zwischen den Elementen eines Datenbestandes, beispielsweise zwischen verkauften Produkten. Diese Regeln können mit Interessantheitsmaßen versehen werden, welche dem Anwender das Erkennen wichtiger Zusammenhänge ermöglichen. Es werden Ansätze gezeigt, dem Nutzer die Auswertung der Daten zu erleichtern. Das betrifft sowohl die robuste Arbeitsweise der Verfahren als auch die einfache Auswertung der Regeln. Die vorgestellten Algorithmen passen sich dabei an die zu verarbeitenden Daten an, was sie von anderen Verfahren unterscheidet.
Assoziationsregelsuchen benötigen die Extraktion häufiger Kombinationen (EHK). Hierfür werden Möglichkeiten gezeigt, Lösungsansätze auf die Eigenschaften moderne System anzupassen. Als Ansatz werden Verfahren zur Berechnung der häufigsten $N$ Kombinationen erläutert, welche anders als bekannte Ansätze leicht konfigurierbar sind. Moderne Systeme rechnen zudem oft verteilt. Diese Rechnerverbünde können große Datenmengen parallel verarbeiten, benötigen jedoch die Vereinigung lokaler Ergebnisse. Für verteilte Top-N-EHK auf realistischen Partitionierungen werden hierfür Ansätze mit verschiedenen Eigenschaften präsentiert.
Aus den häufigen Kombinationen werden Assoziationsregeln gebildet, deren Aufbereitung ebenfalls einfach durchführbar sein soll. In der Literatur wurden viele Maße vorgestellt. Je nach den Anforderungen entsprechen sie je einer subjektiven Bewertung, allerdings nicht zwingend der des Anwenders. Hierfür wird untersucht, wie mehrere Interessantheitsmaßen zu einem globalen Maß vereinigt werden können. Dies findet Regeln, welche mehrfach wichtig erschienen. Der Nutzer kann mit den Vorschlägen sein Suchziel eingrenzen. Ein zweiter Ansatz gruppiert Regeln. Dies erfolgt über die Häufigkeiten der Regelelemente, welche die Grundlage von Interessantheitsmaßen bilden. Die Regeln einer solchen Gruppe sind daher bezüglich vieler Interessantheitsmaßen ähnlich und können gemeinsam ausgewertet werden. Dies reduziert den manuellen Aufwand des Nutzers.
Diese Arbeit zeigt Möglichkeiten, Assoziationsregelsuchen auf einen breiten Benutzerkreis zu erweitern und neue Anwender zu erreichen. Die Assoziationsregelsuche wird dabei derart vereinfacht, dass sie statt als Spezialanwendung als leicht nutzbares Werkzeug zur Datenanalyse verwendet werden kann. / The importance of data mining is widely acknowledged today. Mining for association rules and frequent patterns is a central activity in data mining. Three main strategies are available for such mining: APRIORI , FP-tree-based approaches like FP-GROWTH, and algorithms based on vertical data structures and depth-first mining strategies like ECLAT and CHARM.
Unfortunately, most of these algorithms are only moderately suitable for many “real-world” scenarios because their usability and the special characteristics of the data are two aspects of practical association rule mining that require further work.
All mining strategies for frequent patterns use a parameter called minimum support to define a minimum occurrence frequency for searched patterns. This parameter cuts down the number of patterns searched to improve the relevance of the results. In complex business scenarios, it can be difficult and expensive to define a suitable value for the minimum support because it depends strongly on the particular datasets. Users are often unable to set this parameter for unknown datasets, and unsuitable minimum-support values can extract millions of frequent patterns and generate enormous runtimes. For this reason, it is not feasible to permit ad-hoc data mining by unskilled users. Such users do not have the knowledge and time to define suitable parameters by trial-and-error procedures. Discussions with users of SAP software have revealed great interest in the results of association-rule mining techniques, but most of these users are unable or unwilling to set very technical parameters. Given such user constraints, several studies have addressed the problem of replacing the minimum-support parameter with more intuitive top-n strategies.
We have developed an adaptive mining algorithm to give untrained SAP users a tool to analyze their data easily without the need for elaborate data preparation and parameter determination. Previously implemented approaches of distributed frequent-pattern mining were expensive and time-consuming tasks for specialists. In contrast, we propose a method to accelerate and simplify the mining process by using top-n strategies and relaxing some requirements on the results, such as completeness. Unlike such data approximation techniques as sampling, our algorithm always returns exact frequency counts. The only drawback is that the result set may fail to include some of the patterns up to a specific frequency threshold.
Another aspect of real-world datasets is the fact that they are often partitioned for shared-nothing architectures, following business-specific parameters like location, fiscal year, or branch office. Users may also want to conduct mining operations spanning data from different partners, even if the local data from the respective partners cannot be integrated at a single location for data security reasons or due to their large volume.
Almost every data mining solution is constrained by the need to hide complexity. As far as possible, the solution should offer a simple user interface that hides technical aspects like data distribution and data preparation. Given that BW Accelerator users have such simplicity and distribution requirements, we have developed an adaptive mining algorithm to give unskilled users a tool to analyze their data easily, without the need for complex data preparation or consolidation.
For example, Business Intelligence scenarios often partition large data volumes by fiscal year to enable efficient optimizations for the data used in actual workloads. For most mining queries, more than one data partition is of interest, and therefore, distribution handling that leaves the data unaffected is necessary.
The algorithms presented in this paper have been developed to work with data stored in SAP BW. A salient feature of SAP BW Accelerator is that it is implemented as a distributed landscape that sits on top of a large number of shared-nothing blade servers. Its main task is to execute OLAP queries that require fast aggregation of many millions of rows of data. Therefore, the distribution of data over the dedicated storage is optimized for such workloads. Data mining scenarios use the same data from storage, but reporting takes precedence over data mining, and hence, the data cannot be redistributed without massive costs. Distribution by special data semantics or user-defined selections can produce many partitions and very different partition sizes. The handling of such real-world distributions for frequent-pattern mining is an important task, but it conflicts with the requirement of balanced partition.
|
Page generated in 0.0829 seconds