Global ETD Search

Return to search

Topological and domain Knowledge-based subgraph mining : application on protein 3D-structures

This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequentsubgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis.

[INFO:INFO_OH] Computer Science/Other

[INFO:INFO_OH] Informatique/Autre

[SPI:OTHER] Engineering Sciences/Other

Feature selection

Pattern mining

Frequent subgraph

Representative unsubstituted subgraph

Topological representative subgraph

Protein structure

Identifer	oai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-00946989
Date	11 December 2013
Creators	Dhifli, Wajdi
Publisher	Université Blaise Pascal - Clermont-Ferrand II
Source Sets	CCSD theses-EN-ligne, France
Language	English
Detected Language	English
Type	PhD thesis

Page generated in 0.0015 seconds

Topological and domain Knowledge-based subgraph mining : application on protein 3D-structures

Description

Links & Downloads

Tags

Additional Fields