Return to search

Statistical Relational Learning for Proteomics: Function, Interactions and Evolution

In recent years, the field of Statistical Relational Learning (SRL) [1, 2] has
produced new, powerful learning methods that are explicitly designed to solve
complex problems, such as collective classification, multi-task learning and
structured output prediction, which natively handle relational data, noise,
and partial information. Statistical-relational methods rely on some First-
Order Logic as a general, expressive formal language to encode both the data
instances and the relations or constraints between them. The latter encode
background knowledge on the problem domain, and are use to restrict or bias
the model search space according to the instructions of domain experts. The
new tools developed within SRL allow to revisit old computational biology
problems in a less ad hoc fashion, and to tackle novel, more complex ones.
Motivated by these developments, in this thesis we describe and discuss the
application of SRL to three important biological problems, highlighting the
advantages, discussing the trade-offs, and pointing out the open problems.

In particular, in Chapter 3 we show how to jointly improve the outputs
of multiple correlated predictors of protein features by means of a very gen-
eral probabilistic-logical consistency layer. The logical layer — based on
grounding-specific Markov Logic networks [3] — enforces a set of weighted
first-order rules encoding biologically motivated constraints between the pre-
dictions. The refiner then improves the raw predictions so that they least
violate the constraints. Contrary to canonical methods for the prediction
of protein features, which typically take predicted correlated features as in-
puts to improve the output post facto, our method can jointly refine all
predictions together, with potential gains in overall consistency. In order
to showcase our method, we integrate three stand-alone predictors of corre-
lated features, namely subcellular localization (Loctree[4]), disulfide bonding
state (Disulfind[5]), and metal bonding state (MetalDetector[6]), in a way
that takes into account the respective strengths and weaknesses. The ex-
perimental results show that the refiner can improve the performance of the
underlying predictors by removing rule violations. In addition, the proposed
method is fully general, and could in principle be applied to an array of
heterogeneous predictions without requiring any change to the underlying
software.

In Chapter 4 we consider the multi-level protein–protein interaction (PPI)
prediction problem. In general, PPIs can be seen as a hierarchical process
occurring at three related levels: proteins bind by means of specific domains,
which in turn form interfaces through patches of residues. Detailed knowl-
edge about which domains and residues are involved in a given interaction has
extensive applications to biology, including better understanding of the bind-
ing process and more efficient drug/enzyme design. We cast the prediction
problem in terms of multi-task learning, with one task per level (proteins,
domains and residues), and propose a machine learning method that collec-
tively infers the binding state of all object pairs, at all levels, concurrently.
Our method is based on Semantic Based Regularization (SBR) [7], a flexible
and theoretically sound SRL framework that employs First-Order Logic con-
straints to tie the learning tasks together. Contrarily to most current PPI
prediction methods, which neither identify which regions of a protein actu-
ally instantiate an interaction nor leverage the hierarchy of predictions, our
method resolves the prediction problem up to residue level, enforcing con-
sistent predictions between the hierarchy levels, and fruitfully exploits the
hierarchical nature of the problem. We present numerical results showing
that our method substantially outperforms the baseline in several experi-
mental settings, indicating that our multi-level formulation can indeed lead
to better predictions.

Finally, in Chapter 5 we consider the problem of predicting drug-resistant
protein mutations through a combination of Inductive Logic Programming [8,
9] and Statistical Relational Learning. In particular, we focus on viral pro-
teins: viruses are typically characterized by high mutation rates, which allow
them to quickly develop drug-resistant mutations. Mining relevant rules from
mutation data can be extremely useful to understand the virus adaptation
mechanism and to design drugs that effectively counter potentially resistant
mutants. We propose a simple approach for mutant prediction where the in-
put consists of mutation data with drug-resistance information, either as sets
of mutations conferring resistance to a certain drug, or as sets of mutants with
information on their susceptibility to the drug. The algorithm learns a set
of relational rules characterizing drug-resistance, and uses them to generate
a set of potentially resistant mutants. Learning a weighted combination of
rules allows to attach generated mutants with a resistance score as predicted
by the statistical relational model and select only the highest scoring ones.
Promising results were obtained in generating resistant mutations for both
nucleoside and non-nucleoside HIV reverse transcriptase inhibitors. The ap-
proach can be generalized quite easily to learning mutants characterized by
more complex rules correlating multiple mutations.

Identiferoai:union.ndltd.org:unitn.it/oai:iris.unitn.it:11572/367705
Date January 2013
CreatorsTeso, Stefano
ContributorsTeso, Stefano, Passerini, Andrea
PublisherUniversità degli studi di Trento, place:TRENTO
Source SetsUniversità di Trento
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis
Rightsinfo:eu-repo/semantics/openAccess
Relationfirstpage:1, lastpage:123, numberofpages:123

Page generated in 0.0026 seconds