Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.
View/ Open
Metadata
Show full item recordAbstract
Understanding the etiology of complex disease remains a challenge in biology. In recent
years there has been an explosion in biological data, this study investigates machine
learning and network analysis methods as tools to aid candidate disease gene prioritisation,
specifically relating to hypertension and cardiovascular disease.
This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide
polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties
using a classifier to provide a model for predicting deleterious nsSNPs. The degree
of sequence conservation at the nsSNP position was found to be the single best attribute
but other sequence and structural attributes in combination were also useful. Predictions
for nsSNPs within Ensembl have been made publicly available.
Secondly, predicting protein function for proteins with an absence of experimental
data or lack of clear similarity to a sequence of known function was addressed. Protein
domain attributes based on physicochemical and predicted structural characteristics
of the sequence were used as input to classifiers for predicting membership of large and
diverse protein superfamiles from the SCOP database. An enrichment method was investigated
that involved adding domains to the training dataset that are currently absent
from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers
achieved 66.3% for single domain proteins and 55.6% when including domains from
multi domain proteins. The domains from superfamilies with low sequence similarity,
share global sequence properties enabling applications to be developed which compliment
profile methods for detecting distant sequence relationships.
Thirdly, a topological analysis of the human protein interactome was performed. The
results were combined with functional annotation and sequence based properties to build
models for predicting hypertension associated proteins. The study found that predicted
hypertension related proteins are not generally associated with network hubs and do
not exhibit high clustering coefficients. Despite this, they tend to be closer and better
connected to other hypertension proteins on the interaction network than would be expected
by chance. Classifiers that combined PPI network, amino acid sequence and functional
properties produced a range of precision and recall scores according to the applied
3
weights.
Finally, interactome properties of proteins implicated in cardiovascular disease and
cancer were studied. The analysis quantified the influential (central) nature of each protein
and defined characteristics of functional modules and pathways in which the disease
proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential
(p<0.05) in the interactome. Additionally, they cluster in large, complex, highly
connected communities, acting as interfaces between multiple processes more often than
expected. An approach to prioritising disease candidates based on this analysis was proposed.
Each analyses can provide some new insights into the effort to identify novel disease
related proteins for cardiovascular disease.
Authors
Dobson, Richard James ButlerCollections
- Theses [4340]