Spelling suggestions: "subject:"csrknowledge discovery inn databases (KDD)"" "subject:"csrknowledge discovery inn atabases (KDD)""
1 |
Interestingness Measures for Association Rules in a KDD Process : PostProcessing of Rules with ARQAT ToolHuynh, Xuan-Hiep 07 December 2006 (has links) (PDF)
This work takes place in the framework of Knowledge Discovery in Databases (KDD), often called "Data Mining". This domain is both a main research topic and an application ¯eld in companies. KDD aims at discovering previously unknown and useful knowledge in large databases. In the last decade many researches have been published about association rules, which are frequently used in data mining. Association rules, which are implicative tendencies in data, have the advantage to be an unsupervised model. But, in counter part, they often deliver a large number of rules. As a consequence, a postprocessing task is required by the user to help him understand the results. One way to reduce the number of rules - to validate or to select the most interesting ones - is to use interestingness measures adapted to both his/her goals and the dataset studied. Selecting the right interestingness measures is an open problem in KDD. A lot of measures have been proposed to extract the knowledge from large databases and many authors have introduced the interestingness properties for selecting a suitable measure for a given application. Some measures are adequate for some applications but the others are not. In our thesis, we propose to study the set of interestingness measure available in the literature, in order to evaluate their behavior according to the nature of data and the preferences of the user. The ¯nal objective is to guide the user's choice towards the measures best adapted to its needs and in ¯ne to select the most interesting rules. For this purpose, we propose a new approach implemented in a new tool, ARQAT (Association Rule Quality Analysis Tool), in order to facilitate the analysis of the behavior about 40 interest- ingness measures. In addition to elementary statistics, the tool allows a thorough analysis of the correlations between measures using correlation graphs based on the coe±cients suggested by Pear- son, Spearman and Kendall. These graphs are also used to identify the clusters of similar measures. Moreover, we proposed a series of comparative studies on the correlations between interestingness measures on several datasets. We discovered a set of correlations not very sensitive to the nature of the data used, and which we called stable correlations. Finally, 14 graphical and complementary views structured on 5 levels of analysis: ruleset anal- ysis, correlation and clustering analysis, most interesting rules analysis, sensitivity analysis, and comparative analysis are illustrated in order to show the interest of both the exploratory approach and the use of complementary views.
|
2 |
Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul / Applying the Knowledge Discovery in Database (KDD) Process to Data of the Judiciary Power of Rio Grande do SulSchneider, Luís Felipe January 2003 (has links)
Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões. / With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn
|
3 |
Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul / Applying the Knowledge Discovery in Database (KDD) Process to Data of the Judiciary Power of Rio Grande do SulSchneider, Luís Felipe January 2003 (has links)
Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões. / With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn
|
4 |
Aplicação do processo de descoberta de conhecimento em dados do poder judiciário do estado do Rio Grande do Sul / Applying the Knowledge Discovery in Database (KDD) Process to Data of the Judiciary Power of Rio Grande do SulSchneider, Luís Felipe January 2003 (has links)
Para explorar as relações existentes entre os dados abriu-se espaço para a procura de conhecimento e informações úteis não conhecidas, a partir de grandes conjuntos de dados armazenados. A este campo deu-se o nome de Descoberta de Conhecimento em Base de Dados (DCBD), o qual foi formalizado em 1989. O DCBD é composto por um processo de etapas ou fases, de natureza iterativa e interativa. Este trabalho baseou-se na metodologia CRISP-DM . Independente da metodologia empregada, este processo tem uma fase que pode ser considerada o núcleo da DCBD, a “mineração de dados” (ou modelagem conforme CRISP-DM), a qual está associado o conceito “classe de tipo de problema”, bem como as técnicas e algoritmos que podem ser empregados em uma aplicação de DCBD. Destacaremos as classes associação e agrupamento, as técnicas associadas a estas classes, e os algoritmos Apriori e K-médias. Toda esta contextualização estará compreendida na ferramenta de mineração de dados escolhida, Weka (Waikato Environment for Knowledge Analysis). O plano de pesquisa está centrado em aplicar o processo de DCBD no Poder Judiciário no que se refere a sua atividade fim, julgamentos de processos, procurando por descobertas a partir da influência da classificação processual em relação à incidência de processos, ao tempo de tramitação, aos tipos de sentenças proferidas e a presença da audiência. Também, será explorada a procura por perfis de réus, nos processos criminais, segundo características como sexo, estado civil, grau de instrução, profissão e raça. O trabalho apresenta nos capítulos 2 e 3 o embasamento teórico de DCBC, detalhando a metodologia CRISP-DM. No capítulo 4 explora-se toda a aplicação realizada nos dados do Poder Judiciário e por fim, no capítulo 5, são apresentadas as conclusões. / With the purpose of exploring existing connections among data, a space has been created for the search of Knowledge an useful unknown information based on large sets of stored data. This field was dubbed Knowledge Discovery in Databases (KDD) and it was formalized in 1989. The KDD consists of a process made up of iterative and interactive stages or phases. This work was based on the CRISP-DM methodology. Regardless of the methodology used, this process features a phase that may be considered as the nucleus of KDD, the “data mining” (or modeling according to CRISP-DM) which is associated with the task, as well as the techniques and algorithms that may be employed in an application of KDD. What will be highlighted in this study is affinity grouping and clustering, techniques associated with these tasks and Apriori and K-means algorithms. All this contextualization will be embodied in the selected data mining tool, Weka (Waikato Environment for Knowledge Analysis). The research plan focuses on the application of the KDD process in the Judiciary Power regarding its related activity, court proceedings, seeking findings based on the influence of the procedural classification concerning the incidence of proceedings, the proceduring time, the kind of sentences pronounced and hearing attendance. Also, the search for defendants’ profiles in criminal proceedings such as sex, marital status, education background, professional and race. In chapters 2 and 3, the study presents the theoretical grounds of KDD, explaining the CRISP-DM methodology. Chapter 4 explores all the application preformed in the data of the Judiciary Power, and lastly, in Chapter conclusions are drawn
|
5 |
Etude comportementale des mesures d'intérêt d'extraction de connaissances / Behavioral study of interestingness measures of knowledge extractionGrissa, Dhouha 02 December 2013 (has links)
La recherche de règles d’association intéressantes est un domaine important et actif en fouille de données. Puisque les algorithmes utilisés en extraction de connaissances à partir de données (ECD), ont tendance à générer un nombre important de règles, il est difficile à l’utilisateur de sélectionner par lui même les connaissances réellement intéressantes. Pour répondre à ce problème, un post-filtrage automatique des règles s’avère essentiel pour réduire fortement leur nombre. D’où la proposition de nombreuses mesures d’intérêt dans la littérature, parmi lesquelles l’utilisateur est supposé choisir celle qui est la plus appropriée à ses objectifs. Comme l’intérêt dépend à la fois des préférences de l’utilisateur et des données, les mesures ont été répertoriées en deux catégories : les mesures subjectives (orientées utilisateur ) et les mesures objectives (orientées données). Nous nous focalisons sur l’étude des mesures objectives. Néanmoins, il existe une pléthore de mesures objectives dans la littérature, ce qui ne facilite pas le ou les choix de l’utilisateur. Ainsi, notre objectif est d’aider l’utilisateur, dans sa problématique de sélection de mesures objectives, par une approche par catégorisation. La thèse développe deux approches pour assister l’utilisateur dans sa problématique de choix de mesures objectives : (1) étude formelle suite à la définition d’un ensemble de propriétés de mesures qui conduisent à une bonne évaluation de celles-ci ; (2) étude expérimentale du comportement des différentes mesures d’intérêt à partir du point de vue d’analyse de données. Pour ce qui concerne la première approche, nous réalisons une étude théorique approfondie d’un grand nombre de mesures selon plusieurs propriétés formelles. Pour ce faire, nous proposons tout d’abord une formalisation de ces propriétés afin de lever toute ambiguïté sur celles-ci. Ensuite, nous étudions, pour différentes mesures d’intérêt objectives, la présence ou l’absence de propriétés caractéristiques appropriées. L’évaluation des mesures est alors un point de départ pour une catégorisation de celle-ci. Différentes méthodes de classification ont été appliquées : (i) méthodes sans recouvrement (CAH et k-moyennes) qui permettent l’obtention de groupes de mesures disjoints, (ii) méthode avec recouvrement (analyse factorielle booléenne) qui permet d’obtenir des groupes de mesures qui se chevauchent. Pour ce qui concerne la seconde approche, nous proposons une étude empirique du comportement d’une soixantaine de mesures sur des jeux de données de nature différente. Ainsi, nous proposons une méthodologie expérimentale, où nous cherchons à identifier les groupes de mesures qui possèdent, empiriquement, un comportement semblable. Nous effectuons par la suite une confrontation avec les deux résultats de classification, formel et empirique dans le but de valider et mettre en valeur notre première approche. Les deux approches sont complémentaires, dans l’optique d’aider l’utilisateur à effectuer le bon choix de la mesure d’intérêt adaptée à son application. / The search for interesting association rules is an important and active field in data mining. Since knowledge discovery from databases used algorithms (KDD) tend to generate a large number of rules, it is difficult for the user to select by himself the really interesting knowledge. To address this problem, an automatic post-filtering rules is essential to significantly reduce their number. Hence, many interestingness measures have been proposed in the literature in order to filter and/or sort discovered rules. As interestingness depends on both user preferences and data, interestingness measures were classified into two categories : subjective measures (user-driven) and objective measures (data-driven). We focus on the study of objective measures. Nevertheless, there are a plethora of objective measures in the literature, which increase the user’s difficulty for choosing the appropriate measure. Thus, our goal is to avoid such difficulty by proposing groups of similar measures by means of categorization approaches. The thesis presents two approaches to assist the user in his problematic of objective measures choice : (1) formal study as per the definition of a set of measures properties that lead to a good measure evaluation ; (2) experimental study of the behavior of various interestingness measures from data analysispoint of view. Regarding the first approach, we perform a thorough theoretical study of a large number of measures in several formal properties. To do this, we offer first of all a formalization of these properties in order to remove any ambiguity about them. We then study for various objective interestingness measures, the presence or absence of appropriate characteristic properties. Interestingness measures evaluation is therefore a starting point for measures categorization. Different clustering methods have been applied : (i) non overlapping methods (CAH and k-means) which allow to obtain disjoint groups of measures, (ii) overlapping method (Boolean factor analysis) that provides overlapping groups of measures. Regarding the second approach, we propose an empirical study of the behavior of about sixty measures on datasets with different nature. Thus, we propose an experimental methodology, from which we seek to identify groups of measures that have empirically similar behavior. We do next confrontation with the two classification results, formal and empirical in order to validate and enhance our first approach. Both approaches are complementary, in order to help the user making the right choice of the appropriate interestingness measure to his application.
|
6 |
Metodika vývoje a nasazování Business Intelligence v malých a středních podnicích / Methodology of development and deployment of Business Intelligence solutions in Small and Medium Sized EnterprisesRydzi, Daniel January 2005 (has links)
Dissertation thesis deals with development and implementation of Business Intelligence (BI) solutions for Small and Medium Sized Enterprises (SME) in the Czech Republic. This thesis represents climax of author's up to now effort that has been put into completing a methodological model for development of this kind of applications for SMEs using self-owned skills and minimum of external resources and costs. This thesis can be divided into five major parts. First part that describes used technologies is divided into two chapters. First chapter describes contemporary state of Business Intelligence concept and it also contains original taxonomy of Business Intelligence solutions. Second chapter describes two Knowledge Discovery in Databases (KDD) techniques that were used for building those BI solutions that are introduced in case studies. Second part describes the area of Czech SMEs, which is an environment where the thesis was written and which it is meant to contribute to. This environment is represented by one chapter that defines the differences of SMEs against large corporations. Furthermore, there are author's reasons why he is personally focusing on this area explained. Third major part introduces the results of survey that was conducted among Czech SMEs with support of Department of Information Technologies of Faculty of Informatics and Statistics of University of Economics in Prague. This survey had three objectives. First one was to map the readiness of Czech SMEs for BI solutions development and deployment. Second was to determine major problems and consequent decisions of Czech SMEs that could be supported by BI solutions and the third objective was to determine top factors preventing SMEs from developing and deploying BI solutions. Fourth part of the thesis is also the core one. In two chapters there is the original Methodology for development and deployment of BI solutions by SMEs described as well as other methodologies that were studied. Original methodology is partly based on famous CRISP-DM methodology. Finally, last part describes particular company that has become a testing ground for author's theories and that supports his research. In further chapters it introduces case-studies of development and deployment of those BI solutions in this company, that were build using contemporary BI and KDD techniques with respect to original methodology. In that sense, these case-studies verified theoretical methodology in real use.
|
7 |
Extraction de connaissances pour la modélisation tri-dimensionnelle de l'interactome structural / Knowledge-based approaches for modelling the 3D structural interactomeGhoorah, Anisah W. 22 November 2012 (has links)
L'étude structurale de l'interactome cellulaire peut conduire à des découvertes intéressantes sur les bases moléculaires de certaines pathologies. La modélisation par homologie et l'amarrage de protéines ("protein docking") sont deux approches informatiques pour modéliser la structure tri-dimensionnelle (3D) d'une interaction protéine-protéine (PPI). Des études précédentes ont montré que ces deux approches donnent de meilleurs résultats quand des données expérimentales sur les PPIs sont prises en compte. Cependant, les données PPI ne sont souvent pas disponibles sous une forme facilement accessible, et donc ne peuvent pas être re-utilisées par les algorithmes de prédiction. Cette thèse présente une approche systématique fondée sur l'extraction de connaissances pour représenter et manipuler les données PPI disponibles afin de faciliter l'analyse structurale de l'interactome et d'améliorer les algorithmes de prédiction par la prise en compte des données PPI. Les contributions majeures de cette thèse sont de : (1) décrire la conception et la mise en oeuvre d'une base de données intégrée KBDOCK qui regroupe toutes les interactions structurales domaine-domaine (DDI); (2) présenter une nouvelle méthode de classification des DDIs par rapport à leur site de liaison dans l'espace 3D et introduit la notion de site de liaison de famille de domaines protéiques ("domain family binding sites" ou DFBS); (3) proposer une classification structurale (inspirée du système CATH) des DFBSs et présenter une étude étendue sur les régularités d'appariement entre DFBSs en terme de structure secondaire; (4) introduire une approche systématique basée sur le raisonnement à partir de cas pour modéliser les structures 3D des complexes protéiques à partir des DDIs connus. Une interface web (http://kbdock.loria.fr) a été développée pour rendre accessible le système KBDOCK / Understanding how the protein interactome works at a structural level could provide useful insights into the mechanisms of diseases. Comparative homology modelling and ab initio protein docking are two computational methods for modelling the three-dimensional (3D) structures of protein-protein interactions (PPIs). Previous studies have shown that both methods give significantly better predictions when they incorporate experimental PPI information. However, in general, PPI information is often not available in an easily accessible way, and cannot be re-used by 3D PPI modelling algorithms. Hence, there is currently a need to develop a reliable framework to facilitate the reuse of PPI data. This thesis presents a systematic knowledge-based approach for representing, describing and manipulating 3D interactions to study PPIs on a large scale and to facilitate knowledge-based modelling of protein-protein complexes. The main contributions of this thesis are: (1) it describes an integrated database of non-redundant 3D hetero domain interactions; (2) it presents a novel method of describing and clustering DDIs according to the spatial orientations of the binding partners, thus introducing the notion of "domain family-level binding sites" (DFBS); (3) it proposes a structural classification of DFBSs similar to the CATH classification of protein folds, and it presents a study of secondary structure propensities of DFBSs and interaction preferences; (4) it introduces a systematic case-base reasoning approach to model on a large scale the 3D structures of protein complexes from existing structural DDIs. All these contributions have been made publicly available through a web server (http://kbdock.loria.fr)
|
Page generated in 0.0627 seconds