Global ETD Search

51	Aplikace metod strojového učení na dolování znalosti z dat Kraus, Jan January 2014 (has links) The diploma thesis deals with the area of data mining applied to large collections of textual data. Specifically the thesis is focused on sentiment analysis based on the user's subjective verbal assessment in natural language. The first part of the diploma thesis introduces the reader to basic terms of machine learning and data mining applied particularly to large textual data collections. Following is the description of textual data preprocessing methods and principles of machine learning algorithms. In the practical part of this thesis there are experiments designed and subsequently executed using the SPSS Modeler tool. The experimental part is focused especially on identification of significant attributes and recongnition of relationships between them. The emphasis is put especially on thorough interpretation of the results obtained.
52	Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul / Applying the Knowledge Discovery in Database (KDD) Process to Data of the Judiciary Power of Rio Grande do Sul Schneider, Luís Felipe January 2003 (has links) Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões. / With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn Banco : Dados Descoberta : Conhecimento Mineracao : Dados Armazem : Dados Informática jurídica Knowledge Discovery in Databases (KDD) Data mining Clustering Affinity grouping Apriori K-means Weka Judiciary power Procedural Classification
53	Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul / Applying the Knowledge Discovery in Database (KDD) Process to Data of the Judiciary Power of Rio Grande do Sul Schneider, Luís Felipe January 2003 (has links) Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões. / With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn Banco : Dados Descoberta : Conhecimento Mineracao : Dados Armazem : Dados Informática jurídica Knowledge Discovery in Databases (KDD) Data mining Clustering Affinity grouping Apriori K-means Weka Judiciary power Procedural Classification
54	Discovering Frequent Episodes With General Partial Orders Achar, Avinash 12 1900 (has links) (PDF) Pattern Discovery, a popular paradigm in data mining refers to a class of techniques that try and extract some unknown or interesting patterns from data. The work carried out in this thesis concerns frequent episode mining, a popular framework within pattern discovery, with applications in alarm management, fault analysis, network reconstruction etc. The data here is in the form of a single longtime-ordered stream of events. The pattern of interest here, namely episode, is basically a set of event-types with a partial order on it. The task here is to unearth all patterns( episodes here) which have a frequency above a user-defined threshold irrespective of pattern size. Most current discovery algorithms employ a level-wise a priori-based method for mining, which basically adopts a breadth-first search strategy of the space of all episodes. The episode literature has seen multiple ways of defining frequency with each definition having its own set of merits and demerits. The main reason for different frequencies definitions being proposed is that, in general, counting all occurrences of a set of episodes is computationally very expensive. The first part of the thesis gives a unified view of all the apriori-based discovery algorithms for serial episodes(associated with a total order)under these various frequencies. Specifically, the various existing counting algorithms can be viewed as minor modifications of each other. We also provide some novel proofs of correctness for some of the serial episode counting schemes, which in turn can be generalized to episodes with general partial orders. Our unified view helps us derive quantitative relationships between different frequencies. We also discuss all the anti-monotonicity properties satisfied by the various frequencies, a crucial information needed for the candidate generation step. The second part of the thesis proposes discovery algorithms for episodes with general partial orders, for which no algorithms currently exist in literature. The discovery algorithm proposed is apriori-based and generalizes the existing serial and parallel (associated with a trivial order) episode algorithms. The discovery algorithm is a level-wise procedure involving the steps of candidate generation and counting a teach level. In the context of general partial orders, a major problem in a priori-based discovery is to have an efficient candidate generation scheme. We present a novel candidate generation algorithm for mining episodes with general partial orders. The counting algorithm design for general partial order episodes draws ideas from the unified view of counting for serial episodes, presented in the first part of the work. We formally show the correctness of the proposed candidate generation and counting steps for general partial orders. The proposed candidate generation algorithm is flexible enough to be able to mine in certain specialized classes of partial orders (satisfying what we call maximal sub episode property), of which, the serial and parallel class of episodes are two specific instances. Our algorithm design initially restricts itself to the class of general partial order episodes called injective episodes wherein repeated event-types are not allowed. We then generalize this to a larger class of episodes called chain episodes, where episodes can have some repeated event types. The class of chain episodes contains all (including non-injective) serial and parallel episodes and thus our method properly generalizes the existing methods for serial and parallel episode discovery. We also discuss some problems in extending our algorithms to episodes beyond the class of chain episodes. Also, we demonstrate that frequency alone is not a sufficient enough interestingness measure for episodes with unrestricted partial orders. To address this issue, we propose an additional measure called bidirectional evidence to assess interestingness which, along with frequency is found to be extremely effective in unearthing interesting patterns. In the frequent episode framework, the choice of thresholds are most often user-defined and arbitrary. To address this issue, the last part of the work deals with assessing significance of partial order episodes in a statistical sense based on ideas from classical hypothesis testing. We declare an episode to be significant if its observed frequency in the data stream is large enough to be very unlikely, under a random i.i.d model .The key step in the significance analysis involves the mean and variance computation of the the time between successive occurrences of the pattern. This computation can be reformulated as, solving for the mean and variance of the first visit time to a particular stat e in an associated Markov chain. We use a generating function approach to solve for this mean and variance. Using this and a Gaussian approximation to the frequency random variable, we can now calculate a frequency threshold for any partial order episode, beyond which we infer it to be significant. Our significance analysis for general partial order episodes generalizes the existing significance analysis of serial episode patterns. We demonstrate on synthetic data the effectiveness of our significance thresholds. Datamining With Partial Orders Datamining Pattern Discovery Partial Order Episodes Episode Discovery General Partial Order Episodes Apriori-Based Episode Discovery Episode Mining Frequent Episode Discovery Computer Science
55	Algoritmus pro cílené doporučování produktů / Algorithm for Product Recommendation Bodeček, Miroslav January 2011 (has links) The goal of this project is to explore the problem of product recommendations in the area of e-commerce and to evaluate known techniques, design product recommendation system for an existing e-commerce site, implement it and test it. This report introduces the problem, briefly examines current state of affairs in this area and defines requirements for a product recommendation module. The concept of data mining in general is introduced. The report proceeds to present detailed design corresponding to defined requirements and summarizes data gathered during testing phase. It concludes with evaluation and with discussion of the remaining goals for this thesis.
56	Získávání znalostí z webových logů / Knowledge Discovery from Web Logs Vlk, Vladimír January 2013 (has links) This master's thesis deals with creating of an application, goal of which is to perform data preprocessing of web logs and finding association rules in them. The first part deals with the concept of Web mining. The second part is devoted to Web usage mining and notions related to it. The third part deals with design of the application. The forth section is devoted to describing the implementation of the application. The last section deals with experimentation with the application and results interpretation.
57	Developing the Cis-Regulatory Association Model (CRAM) to Identify Combinations of Transcription Factors in ChIP-Seq Data Kennedy, Brian Alexander 17 December 2010 (has links) No description available. Bioinformatics Computer Science Genetics transcription factor cis-regulation expression genes cis-regulatory modules regulatory modules position weight matrix chip-seq chromatin immunoprecipitation apriori neural networks item-set mining
58	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 08 May 2014 (has links) (PDF) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. Data Mining Assoziationsanalyse Mehrprozessorsysteme Paralleles Data Mining SIMD Apriori Eclat FP-growth Data mining Association rule mining Multiprocessor Systems Parallel mining SIMD Compression Apriori Eclat FP-growth ddc:004 rvk:ST 530 Datenverarbeitung Informatik Computerprogrammierung Programme Daten Spezielle Computerverfahren Data Mining Algorithmen Multithreading SIMD Datenkompression
59	Frequent itemset mining on multiprocessor systems Schlegel, Benjamin 30 May 2013 (has links) Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets. info:eu-repo/classification/ddc/004 ddc:004
60	應用記憶體內運算於多維度多顆粒度資料探勘之研究―以醫療服務創新為例 / A Research Into In-memory Computing In Multidimensional, Multi-granularity Data Mining ― With Healthcare Services Innovation 朱家棋, Chu, Chia Chi Unknown Date (has links) 全球面臨人口老化與人口不斷成長的壓力下，對於醫療服務的需求不斷提升。醫療服務領域中常以資料探勘「關聯規則」分析，挖掘隱藏在龐大的醫學資料庫中的知識(knowledge)，以支援臨床決策或創新醫療服務。隨著醫療服務與應用推陳出新(如，電子健康紀錄或行動醫療等)，與醫療機構因應政府政策需長期保存大量病患資料，讓醫療領域面臨如何有效的處理巨量資料。然而傳統的關聯規則演算法，其效能上受到相當大的限制。因此，許多研究提出將關聯規則演算法，在分散式環境中，以Hadoop MapReduce框架實現平行化處理巨量資料運算。其相較於單節點 (single-node) 的運算速度確實有大幅提升。但實際上，MapReduce並不適用於需要密集迭帶運算的關聯規則演算法。本研究藉由Spark記憶體內運算框架，在分散式叢集上實現平行化挖掘多維度多顆粒度挖掘關聯規則，實驗結果可以歸納出下列三點。第一點，當資料規模小時，由於平行化將資料流程分為Map與Reduce處理，因此在小規模資料處理上沒有太大的效益。第二點，當資料規模大時，平行化策略模式與單機版有明顯大幅度差異，整體運行時間相差100倍之多；然而當項目個數大於1萬個時，單機版因記憶體不足而無法運行，但平行化策略依舊可以運行。第三點，整體而言Spark雖然在小規模處理上略慢於單機版的速度，但其運行時間仍小於Hadoop的4倍。大規模處理速度上Spark依舊優於Hadoop版本。因此，在處理大規模資料時，就運算效能與擴充彈性而言，Spark都為最佳化解決方案。 / Under the population aging and population growth and rising demand for Healthcare. Healthcare is facing a big issue how to effectively deal with huge amounts of data. Cased by new healthcare services or applications (such as electronic health records or health care, etc), and also medical institutions in accordance with government policy for long-term preservation of a large number of patient data. But the traditional algorithms for mining association rules, subject to considerable restrictions on their effectiveness. Therefore, many studies suggest that the association rules algorithm in a distributed computing, such as Hadoop MapReduce framework implements parallel to process huge amounts of data operations. But in fact, MapReduce does not apply to require intensive iterative computation algorithm of association rules. Studied in this Spark in-memory computing framework, implemented on a distributed cluster parallel mining association rules mining multidimensional granularity, the experimental results can be summed up in the following three points. 1th, when data is small, due to the parallel data flow consists of Map and Reduce, so not much in the small-scale processing of benefits. 2nd, when the data size is large, parallel strategy models and stand-alone obviously significant differences overall running time is 100 times as much when the item number is greater than 10,000, however, stand-alone version cannot run due to insufficient memory, but parallel strategies can still run. 3rd, overall Spark though somewhat slower than the single version in small scale processing speed, but the running time is less than 4 times times the Hadoop. Massive processing speed Spark is still superior to the Hadoop version. Therefore, when working with large data, operational efficiency and expansion elasticity, Spark for optimum solutions. 資料探勘多維度關聯分析記憶體內運算創新醫療服務應用 Data Mining In-Memory Computing Apriori Algorithm

Search results