The positive training dataset comprises 232 protein sequences derived from UniProKB/SwissProt
(release 2021_4). Proteins are from 37 organisms. Only experimentally validated (evidence code ECO:0000269)
proteins with the annotation “N-myristoyl glycine” are included.
For each protein, the Gly-starting octapeptide was extracted
and proteins with identical octapeptides were clustered together, choosing one representative per
cluster for a total of 232 examples.
A negative training dataset, comprising non-myristoylated proteins from Homo sapiens and
Arabidopsis thaliana, was derived from an in vitro study
described in literature [1]. In this study, authors measure the enzymatic activity of the N-
myristoyltransferase (NMT) in the presence of diverse octapeptides using macro-arrays.
From the set of 1126 negative examples i.e., proteins experimentally tested as not myristoylated,
we randomly selected 232 representative negative proteins to obtain a training dataset balancing
positive and negative examples.
Ten-fold cross-validation sets were generated such that similar octapeptides are collected whithin the same
subset. To measure similarity, we adopted the Hamming distance, requiring pairs of octapeptides whose distance is less than 4 residues
to be placed in the same subset. IN this way, we avoid training/testing redundancy.
Table 1. Summary statistics on the training dataset
Positive training set | Negative training set | |
---|---|---|
# proteins: | 232 | 232 |
# species: | 37 | 2 |
Download TSV | Download TSV |
Figure 1. Taxonomic distribution of the positive training set.
In the legend, numbers between square brakets refer to distinct species.
We derived a blind test set from the in vitro experiment previously reported [1] and in vivo
experiments on four parasites [2-5]. From [1], we selected octapeptides non-redundant
with respect to the positive training set adopted to train SVMyr. To measure similarity between octapeptides, we adopted
the Hamming distance, including in the blind set only octapeptides with distance higher than four residues with any peptide used in training.
To these examples, we added MYR proteins found in global profiling studies from Trypanosoma brucei [2], Trypanosoma cruzi [3],
Leishmania donovani [4], and Plasmodium falciparum [5]. We ended up with 88 positive examples.
To collect negative examples for the blind test, we used the remaining part of the in vitro negative dataset [1].
To this we added proteins with an annotated N-terminal acetyl glycine in SwissProt.
After removing redundancy with the same Hamming distance threshold, we ended up with 528 negative examples.
To test our SVMyr in the task of detecting internal, post-translational myristoylation sites in Metazoe, we derived
from SwissProt an additional dataset including 4 proteins experimentally annotated with the feature
(3 human and one bovin). Furthermore, we included 11 human proteins reported as post-translational
myristoylated, described in an in vivo experiment with induced apoptosis [6], and not included in UniProt.
Table 2. Summary statistics on the blind test dataset
Positive MYR set | Negative set | Post-translational MYR set | |
---|---|---|---|
# proteins: | 88 | 528 | 15 |
# species: | 6 | 10 | 2 |
Download TSV | Download TSV | Download TSV |
Figure 2. Taxonomic distribution of the blind test dataset (co-translational).
In the legend, numbers between square brakets refer to distinct species.
We performed full-proteome analysis on 8 different organisms, including:
MYR predictions were compared with available experimental and electronic MYR annotations in UniProtKB, as well as with proteome-scale
experimental studies on the 8 organisms, when available [1-5]. Overall, SVMyr identified 931 new MYR substrates that are good candidates
for further validation (Table 3). These new substrates along with functional annotations can be downaloded here.
Table 3. Summary statistics on proteome-wide analysis
Species | Proteome size | Gly-ome size | New MYR substrates | |
---|---|---|---|---|
79038 | 5243 | 183 | Download TSV | |
55341 | 3788 | 272 | Download TSV | |
39334 | 3457 | 68 | Download TSV | |
6050 | 288 | 21 | Download TSV | |
8587 | 463 | 63 | Download TSV | |
19242 | 867 | 181 | Download TSV | |
7960 | 412 | 99 | Download TSV | |
5384 | 232 | 44 | Download TSV |