Spelling suggestions: "subject:"elational data minining"" "subject:"elational data chanining""
1 |
Random Relational RulesAnderson, Grant January 2008 (has links)
In the field of machine learning, methods for learning from single-table data have received much more attention than those for learning from multi-table, or relational data, which are generally more computationally complex. However, a significant amount of the world's data is relational. This indicates a need for algorithms that can operate efficiently on relational data and exploit the larger body of work produced in the area of single-table techniques. This thesis presents algorithms for learning from relational data that mitigate, to some extent, the complexity normally associated with such learning. All algorithms in this thesis are based on the generation of random relational rules. The assumption is that random rules enable efficient and effective relational learning, and this thesis presents evidence that this is indeed the case. To this end, a system for generating random relational rules is described, and algorithms using these rules are evaluated. These algorithms include direct classification, classification by propositionalisation, clustering, semi-supervised learning and generating random forests. The experimental results show that these algorithms perform competitively with previously published results for the datasets used, while often exhibiting lower runtime than other tested systems. This demonstrates that sufficient information for classification and clustering is retained in the rule generation process and that learning with random rules is efficient. Further applications of random rules are investigated. Propositionalisation allows single-table algorithms for classification and clustering to be applied to the resulting data, reducing the amount of relational processing required. Further results show that techniques for utilising additional unlabeled training data improve accuracy of classification in the semi-supervised setting. The thesis also develops a novel algorithm for building random forests by making efficient use of random rules to generate trees and leaves in parallel.
|
2 |
Mineração multirrelacional de regras de associação em grandes bases de dadosOyama, Fernando Takeshi [UNESP] 22 February 2010 (has links) (PDF)
Made available in DSpace on 2014-06-11T19:29:40Z (GMT). No. of bitstreams: 0
Previous issue date: 2010-02-22Bitstream added on 2014-06-13T20:39:07Z : No. of bitstreams: 1
oyama_ft_me_sjrp.pdf: 1107324 bytes, checksum: 0977db2af1589dece4aa46b5882d84d6 (MD5) / O crescente avanço e a disponibilidade de recursos computacionais viabilizam o armazenamento e a manipulação de grandes bases de dados. As técnicas típicas de mineração de dados possibilitam a extração de padrões desde que os dados estejam armazenados em uma única tabela. A mineração de dados multirrelacional, por sua vez, apresenta-se como uma abordagem mais recente que permite buscar padrões provenientes de múltiplas tabelas, sendo indicada para a aplicação em bases de dados relacionais. No entanto, os algoritmos multirrelacionais de mineração de regras de associação existentes tornam-se impossibilitados de efetuar a tarefa de mineração em grandes volumes de dados, uma vez que a quantia de memória exigida para a conclusão do processamento ultrapassa a quantidade disponível. O objetivo do presente trabalho consiste em apresentar um algoritmo multirrelacional de extração de regras de associação com o foco na aplicação em grandes bases de dados relacionais. Para isso, o algoritmo proposto, MR-RADIX, apresenta uma estrutura denominada Radix-tree que representa comprimidamente a base de dados em memória. Além disso, o algoritmo utiliza-se do conceito de particionamento para subdividir a base de dados, de modo que cada partição possa ser processada integralmente em memória. Os testes realizados demonstram que o algoritmo MR-RADIX proporciona um desempenho superior a outros algoritmos correlatos e, ainda, efetua com êxito, diferentemente dos demais, a mineração de regras de associação em grandes bases de dados. / The increasing spread and availability of computing resources make feasible storage and handling of large databases. Traditional techniques of data mining allows the extraction of patterns provided that data is stored in a single table. The multi- relational data mining presents itself as a more recent approach that allows search patterns from multiple tables, indicated for use in relational databases. However, the existing multi-relational association rules mining algorithms become unable to make mining task in large data, since the amount of memory required for the completion of processing exceed the amount available. The goal of this work is to present a multi- relational algorithm for extracting association rules with focus application in large relational databases. For this the proposed algorithm MR-RADIX presents a structure called Radix-tree that represents compressly the database in memory. Moreover, the algorithm uses the concept of partitioning to subdivide the database, so that each partition can be processed entirely in memory. The tests show that the MR-RADIX algorithm provides better performance than other related algorithms, and also performs successfully, unlike others, the association rules mining in large databases.
|
3 |
Mineração multirrelacional de regras de associação em grandes bases de dados /Oyama, Fernando Takeshi. January 2010 (has links)
Orientador: Carlos Roberto Valêncio / Banca: Cristina Dutra de Aguiar Ciferri / Banca: Rogéria Cristiane Gratão de Souza / Resumo: O crescente avanço e a disponibilidade de recursos computacionais viabilizam o armazenamento e a manipulação de grandes bases de dados. As técnicas típicas de mineração de dados possibilitam a extração de padrões desde que os dados estejam armazenados em uma única tabela. A mineração de dados multirrelacional, por sua vez, apresenta-se como uma abordagem mais recente que permite buscar padrões provenientes de múltiplas tabelas, sendo indicada para a aplicação em bases de dados relacionais. No entanto, os algoritmos multirrelacionais de mineração de regras de associação existentes tornam-se impossibilitados de efetuar a tarefa de mineração em grandes volumes de dados, uma vez que a quantia de memória exigida para a conclusão do processamento ultrapassa a quantidade disponível. O objetivo do presente trabalho consiste em apresentar um algoritmo multirrelacional de extração de regras de associação com o foco na aplicação em grandes bases de dados relacionais. Para isso, o algoritmo proposto, MR-RADIX, apresenta uma estrutura denominada Radix-tree que representa comprimidamente a base de dados em memória. Além disso, o algoritmo utiliza-se do conceito de particionamento para subdividir a base de dados, de modo que cada partição possa ser processada integralmente em memória. Os testes realizados demonstram que o algoritmo MR-RADIX proporciona um desempenho superior a outros algoritmos correlatos e, ainda, efetua com êxito, diferentemente dos demais, a mineração de regras de associação em grandes bases de dados. / Abstract: The increasing spread and availability of computing resources make feasible storage and handling of large databases. Traditional techniques of data mining allows the extraction of patterns provided that data is stored in a single table. The multi- relational data mining presents itself as a more recent approach that allows search patterns from multiple tables, indicated for use in relational databases. However, the existing multi-relational association rules mining algorithms become unable to make mining task in large data, since the amount of memory required for the completion of processing exceed the amount available. The goal of this work is to present a multi- relational algorithm for extracting association rules with focus application in large relational databases. For this the proposed algorithm MR-RADIX presents a structure called Radix-tree that represents compressly the database in memory. Moreover, the algorithm uses the concept of partitioning to subdivide the database, so that each partition can be processed entirely in memory. The tests show that the MR-RADIX algorithm provides better performance than other related algorithms, and also performs successfully, unlike others, the association rules mining in large databases. / Mestre
|
4 |
Learning probabilistic relational models: a novel approach. / Aprendendo modelos probabilísticos relacionais: uma nova abordagem.Mormille, Luiz Henrique Barbosa 17 August 2018 (has links)
While most statistical learning methods are designed to work with data stored in a single table, many large datasets are stored in relational database systems. Probabilistic Relational Models (PRM) extend Bayesian networks by introducing relations and individuals, thus making it possible to represent information in a relational database. However, learning a PRM from relational data is a more complex task than learning a Bayesian Network from \"flat\" data. The main difficulties that arise while learning a PRM are establishing what are the legal dependency structures, searching for possible structures, and scoring them. This thesis focuses on the development of a novel approach to learn the structure of a PRM, describes a package in the R language to support the learning framework, and applies it to a real, large scale scenario of a city named Atibaia, in the state of São Paulo, Brazil. The research is based on a database combining three different tables, each representing one class in the domain of study. The first table contains 27 attributes from 110,816 citizens of Atibaia. The second table contains 9 attributes from 20,162 companies located in the city. And finally, the third table has 8 attributes from 327 census sectors (small territorial units that comprise the city of Atibaia). The proposed framework is applied to learn a PRM structure and parameters from the database. The model is used to verify if the Social Class of a person can be explained by the location where they live, their neighbors, and the companies nearby. Preliminary experiments have been conducted and a paper published in the 2017 Symposium on Knowledge Discovery, Mining and Learning (KDMiLe). The algorithm performance was further evaluated by extensive experimentation, and a broader study using Serasa Experian data was conducted. Finally, the package in the R language that supports our method was refined along with proper documentation and a tutorial. / Embora a maioria dos métodos de aprendizado estatístico tenha sido desenvolvida para se trabalhar com dados armazenados em uma única tabela, muitas bases de dados estão armazenadas em bancos de dados relacionais. Modelos Probabilísticos Relacionai (PRM) estendem Redes Bayesianas introduzindo relações e indivíduos, tornando possível a representação de informação em uma base de dados relacional. Entretanto, aprender um PRM através de dados relacionais é uma tarefa mais complexa que aprender uma Rede Bayesiana de uma única tabela. As maiores dificuldades que se impõe enquanto se aprende um PRM são estabelecer quais são as estruturas de dependência legais, procurar por possíveis estruturas, e avalia-las. Esta tese foca em desenvolver um novo método de aprendizado de estruturas de PRM, descrever um pacote na linguagem R que suporte este método e aplica-lo a um cenário real e de grande escala, a cidade de Atibaia, no estado de São Paulo, Brasil. Esta pesquisa está baseada em uma base de dados combinando três tabelas distintas, cada uma representando uma classe no domínio de estudo. A primeira tabela contém 27 atributos de 110.816 habitantes de Atibaia, e a segunda tabela contém 9 atributos de 20.162 empresas da cidade. Por fim, a terceira tabela possui 8 atributos para 327 setores censitários (pequenas unidades territoriais que formam a cidade de Atibaia). A proposta é aplicada para aprender-se a estrutura de um PRM e seus parâmetros através desta base de dados. O modelo foi utilizado para verificar se a classe social de uma pessoa pode ser explicada pelo local onde ela vive, seus vizinhos e as companhias próximas. Experimentos preliminares foram conduzidos e um artigo foi publicado no Symposium on Knowledge Discovery, Mining and Learning (KDMiLe). O desempenho do algoritmo foi reavaliada através de extensiva experimentação, e um estudo mais amplo foi conduzido com os dados da Serasa Experian. Por fim, o pacote em R que suporta o método proposto foi refinado, e documentação e tutorial apropriado foram descritos.
|
5 |
An Ilp-based Concept Discovery System For Multi-relational Data MiningKavurucu, Yusuf 01 July 2009 (has links) (PDF)
Multi Relational Data Mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. However, as patterns involve multiple relations, the search space of possible hypothesis becomes
intractably complex. In order to cope with this problem, several relational knowledge discovery systems have been developed employing various search strategies, heuristics and
language pattern limitations.
In this thesis, Inductive Logic Programming (ILP) based concept discovery is studied and two systems based on a hybrid methodology employing ILP and APRIORI, namely Confidence-based Concept Discovery and Concept Rule Induction System, are proposed. In Confidence-based Concept Discovery and Concept Rule Induction System, the main aim
is to relax the strong declarative biases and user-defined specifications. Moreover, this new method directly works on relational databases. In addition to this, the traditional definition
of confidence from relational database perspective is modified to express Closed World Assumption in first-order logic. A new confidence-based pruning method based on the improved definition is applied in the APRIORI lattice. Moreover, a new hypothesis evaluation criterion is used for expressing the quality of patterns in the search space. In addition to this, in Concept
Rule Induction System, the constructed rule quality is further improved by using an improved generalization metod.
Finally, a set of experiments are conducted on real-world problems to evaluate the performance of the proposed method with similar systems in terms of support and confidence.
|
6 |
Data Mining For Rule Discovery In Relational DatabasesToprak, Serkan 01 September 2004 (has links) (PDF)
Data is mostly stored in relational databases today. However, most data mining algorithms are not capable of working on data stored in relational databases directly. Instead they require a preprocessing step for transforming relational data into algorithm specified form. Moreover, several data mining algorithms provide solutions for single relations only. Therefore, valuable hidden knowledge involving multiple relations remains undiscovered. In this thesis, an implementation is developed for discovering multi-relational association rules in relational databases. The implementation is based on a framework providing a representation of patterns in relational databases, refinement methods of patterns, and primitives for obtaining necessary record counts from database to calculate measures for patterns. The framework exploits meta-data of relational databases for pruning search space of patterns. The implementation extends the
framework by employing Apriori algorithm for further pruning the search space and discovering relational recursive patterns. Apriori algorithm is used for finding large itemsets of tables, which are used to refine patterns. Apriori algorithm is modified by changing support calculation method for itemsets. A method
for determining recursive relations is described and a solution is
provided for handling recursive patterns using aliases. Additionally, continuous attributes of tables are discretized utilizing equal-depth partitioning. The implementation is
tested with gene localization prediction task of KDD Cup 2001 and
results are compared to those of the winner approach.
|
7 |
COMOVI: um framework para transformação de dados em aplicações de credit behavior scoring baseado no desenvolvimento dirigido por modelosOlLIVEIRA NETO, Rosalvo Ferreira de 11 December 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-07-12T12:11:15Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Tese_Rosalvo_Neto_CIN_2015.pdf: 7674683 bytes, checksum: 99037c704450a9a878bcbe93ab8b392d (MD5) / Made available in DSpace on 2016-07-12T12:11:15Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Tese_Rosalvo_Neto_CIN_2015.pdf: 7674683 bytes, checksum: 99037c704450a9a878bcbe93ab8b392d (MD5)
Previous issue date: 2015-12-11 / CAPEs / A etapa de pré-processamento em um projeto de descoberta do conhecimento é custosa,
em geral, consome cerca de 50 a 80% do tempo total de um projeto. É nesta etapa que um
banco de dados relacional é transformado para aplicação de um algoritmo de mineração de
dados. A transformação dos dados nesta etapa é uma tarefa complexa, uma vez que exige uma
forte integração entre projetistas de banco de dados e especialistas do domínio da aplicação. Os
frameworks que buscam sistematizar a etapa de transformação dos dados encontrados na literatura
apresentam limitações significativas quando aplicados a soluções comportamentais, como Credit
Behavior Scoring. Estas soluções visam a auxiliar as instituições financeiras a decidirem sobre
a concessão de crédito aos consumidores com base no risco das solicitações. Este trabalho
propõe um framework baseado no Desenvolvimento Dirigido por Modelos para sistematizar
esta etapa em soluções de Credit Behavior Scoring. Ele é composto por um meta-modelo que
mapeia os conceitos do domínio e um conjunto de regras de transformações. As três principais
contribuições do framework proposto são: 1) aumentar o poder discriminatório da solução,
através da construção de novas variáveis que maximizam o conteúdo estatístico da informação
do domínio; 2) reduzir o tempo da transformação dos dados através da geração automática de
código e 3) permitir que profissionais e pesquisadores de Inteligência Artificial e Estatística
realizem a transformação dos dados sem o auxílio de especialistas de Banco de Dados. Para
validar o framework proposto, dois estudos comparativos foram realizados. Primeiro, um estudo
comparando o desempenho entre os principais frameworks existentes na literatura e o framework
proposto foi realizado em duas bases de dados. Uma base de dados de um conhecido benchmark
de uma competição internacional organizada pela PKDD, e outra obtida de uma das maiores
empresas de varejo do Brasil, que possui seu próprio cartão de crédito. Os frameworks RelAggs
e Validação de Múltiplas Visões Baseado em Correção foram escolhidos como representantes
das abordagens proposicional e mineração de dados relacional, respectivamente. A comparação
foi realizada através do processo de validação cruzada estratificada, para definir os intervalos de
confiança para a avaliação de desempenho. Os resultados mostram que o framework proposto
proporciona um desempenho equivalente ou superior aos principais framework existentes, medido
pela área sob a curva ROC, utilizando uma rede neural MultiLayer Perceptron, K vizinho mais
próximos e Random Forest como classificadores, com um nível de confiança de 95%. O segundo
estudo verificou a redução de tempo proporcionada pelo framework durante a transformação dos
dados. Para isso, sete times compostos por estudantes de uma universidade brasileira mensuraram
o tempo desta atividade com e sem o framework proposto. O teste pareado Wilcoxon Signed-Rank
mostrou que o framework proposto reduz o tempo de transformação com um nível de confiança
de 95%. / The pre-processing stage in knowledge discovery projects is costly, generally taking
between 50 and 80% of total project time. It is in this stage that data in a relational database are
transformed for applying a data mining technique. This stage is a complex task that demands
from database designers a strong interaction with experts who have a broad knowledge about
the application domain. The frameworks that aim to systemize the data transformation stage
have significant limitations when applied to behavior solutions such as the Credit Behavior
Scoring solutions. Their goal is help financial institutions to decide whether to grant credit to
consumers based on the credit risk of their requests. This work proposes a framework based on
the Model Driven Development to systemize this stage in Credit Behavioral Scoring solutions.
It is composed by a meta-model which maps the domain concepts and a set of transformation
rules. This work has three main contributions: 1) improving the discriminant power of data
mining techniques by means of the construction of new input variables, which embed new
knowledge for the technique; 2) reducing the time of data transformation using automatic code
generation and 3) allowing artificial intelligence and statistics modelers to perform the data
transformation without the help of database experts. In order to validate the proposed framework,
two comparative studies were conducted. First, a comparative study of performance between
the main existing frameworks found in literature and the proposed framework applied to two
databases was performed. One database from a known benchmark of an international competition
organized by PKDD, and another one obtained from one of the biggest retail companies from
Brazil, that has its own private label credit card. The RelAggs and Correlation-based Multiple
View Validation frameworks were chosen as representatives of the propositional and relational
data mining approaches, respectively. The comparison was carried out through by a 10-fold
stratified cross-validation process with ten stratified parts in order to define the confidence
intervals. The results show that the proposed framework delivers a performance equivalent or
superior to those of existing frameworks, for the evaluation of performance measured by the area
under the ROC curve, using a Multilayer Perceptron neural network, k-nearest neighbors and
Random Forest as classifiers, with a confidence level of 95%. The second comparative study
verified the reduction of time required for data transformation using the proposed framework.
For this, seven teams composed by students from a Brazilian university measured the runtime of
this stage with and without the proposed framework. The paired Wilcoxon Signed-Rank’s Test
showed that the proposed framework reduces the time of data transformation with a confidence
level of 95%.
|
8 |
Organisation et exploitation des connaissances sur les réseaux d'intéractions biomoléculaires pour l'étude de l'étiologie des maladies génétiques et la caractérisation des effets secondaires de principes actifs / Organization and exploitation of biological molecular networks for studying the etiology of genetic diseases and for characterizing drug side effectsBresso, Emmanuel 25 September 2013 (has links)
La compréhension des pathologies humaines et du mode d'action des médicaments passe par la prise en compte des réseaux d'interactions entre biomolécules. Les recherches récentes sur les systèmes biologiques produisent de plus en plus de données sur ces réseaux qui gouvernent les processus cellulaires. L'hétérogénéité et la multiplicité de ces données rendent difficile leur intégration dans les raisonnements des utilisateurs. Je propose ici des approches intégratives mettant en oeuvre des techniques de gestion de données, de visualisation de graphes et de fouille de données, pour tenter de répondre au problème de l'exploitation insuffisante des données sur les réseaux dans la compréhension des phénotypes associés aux maladies génétiques ou des effets secondaires des médicaments. La gestion des données sur les protéines et leurs propriétés est assurée par un système d'entrepôt de données générique, NetworkDB, personnalisable et actualisable de façon semi-automatique. Des techniques de visualisation de graphes ont été couplées à NetworkDB pour utiliser les données sur les réseaux biologiques dans l'étude de l'étiologie des maladies génétiques entrainant une déficience intellectuelle. Des sous-réseaux de gènes impliqués ont ainsi pu être identifiés et caractérisés. Des profils combinant des effets secondaires partagés par les mêmes médicaments ont été extraits de NetworkDB puis caractérisés en appliquant une méthode de fouille de données relationnelles couplée à Network DB. Les résultats permettent de décrire quelles propriétés des médicaments et de leurs cibles (incluant l'appartenance à des réseaux biologiques) sont associées à tel ou tel profil d'effets secondaires / The understanding of human diseases and drug mechanisms requires today to take into account molecular interaction networks. Recent studies on biological systems are producing increasing amounts of data. However, complexity and heterogeneity of these datasets make it difficult to exploit them for understanding atypical phenotypes or drug side-effects. This thesis presents two knowledge-based integrative approaches that combine data management, graph visualization and data mining techniques in order to improve our understanding of phenotypes associated with genetic diseases or drug side-effects. Data management relies on a generic data warehouse, NetworkDB, that integrates data on proteins and their properties. Customization of the NetworkDB model and regular updates are semi-automatic. Graph visualization techniques have been coupled with NetworkDB. This approach has facilitated access to biological network data in order to study genetic disease etiology, including X-linked intellectual disability (XLID). Meaningful sub-networks of genes have thus been identified and characterized. Drug side-effect profiles have been extracted from NetworkDB and subsequently characterized by a relational learning procedure coupled with NetworkDB. The resulting rules indicate which properties of drugs and their targets (including networks) preferentially associate with a particular side-effect profile
|
9 |
Enhancing supervised learning with complex aggregate features and context sensitivity / Amélioration de l'apprentissage supervisé par l'utilisation d'agrégats complexes et la prise en compte du contexteCharnay, Clément 30 June 2016 (has links)
Dans cette thèse, nous étudions l'adaptation de modèles en apprentissage supervisé. Nous adaptons des algorithmes d'apprentissage existants à une représentation relationnelle. Puis, nous adaptons des modèles de prédiction aux changements de contexte.En représentation relationnelle, les données sont modélisées par plusieurs entités liées par des relations. Nous tirons parti de ces relations avec des agrégats complexes. Nous proposons des heuristiques d'optimisation stochastique pour inclure des agrégats complexes dans des arbres de décisions relationnels et des forêts, et les évaluons sur des jeux de données réelles.Nous adaptons des modèles de prédiction à deux types de changements de contexte. Nous proposons une optimisation de seuils sur des modèles à scores pour s'adapter à un changement de coûts. Puis, nous utilisons des transformations affines pour adapter les attributs numériques à un changement de distribution. Enfin, nous étendons ces transformations aux agrégats complexes. / In this thesis, we study model adaptation in supervised learning. Firstly, we adapt existing learning algorithms to the relational representation of data. Secondly, we adapt learned prediction models to context change.In the relational setting, data is modeled by multiples entities linked with relationships. We handle these relationships using complex aggregate features. We propose stochastic optimization heuristics to include complex aggregates in relational decision trees and Random Forests, and assess their predictive performance on real-world datasets.We adapt prediction models to two kinds of context change. Firstly, we propose an algorithm to tune thresholds on pairwise scoring models to adapt to a change of misclassification costs. Secondly, we reframe numerical attributes with affine transformations to adapt to a change of attribute distribution between a learning and a deployment context. Finally, we extend these transformations to complex aggregates.
|
Page generated in 0.1029 seconds