The SVMyr training dataset

The positive training dataset comprises 232 protein sequences derived from UniProKB/SwissProt (release 2021_4). Proteins are from 37 organisms. Only experimentally validated (evidence code ECO:0000269) proteins with the annotation “N-myristoyl glycine” are included. For each protein, the Gly-starting octapeptide was extracted and proteins with identical octapeptides were clustered together, choosing one representative per cluster for a total of 232 examples.
A negative training dataset, comprising non-myristoylated proteins from Homo sapiens and Arabidopsis thaliana, was derived from an in vitro study described in literature [1]. In this study, authors measure the enzymatic activity of the N- myristoyltransferase (NMT) in the presence of diverse octapeptides using macro-arrays. From the set of 1126 negative examples i.e., proteins experimentally tested as not myristoylated, we randomly selected 232 representative negative proteins to obtain a training dataset balancing positive and negative examples.
Ten-fold cross-validation sets were generated such that similar octapeptides are collected whithin the same subset. To measure similarity, we adopted the Hamming distance, requiring pairs of octapeptides whose distance is less than 4 residues to be placed in the same subset. IN this way, we avoid training/testing redundancy.

Download:

Table 1. Summary statistics on the training dataset

Positive training setNegative training set
# proteins: 232 232
# species: 37 2
Download TSV Download TSV

Figure 1. Taxonomic distribution of the positive training set.

In the legend, numbers between square brakets refer to distinct species.

The SVMyr blind test datasets

We derived a blind test set from the in vitro experiment previously reported [1] and in vivo experiments on four parasites [2-5]. From [1], we selected octapeptides non-redundant with respect to the positive training set adopted to train SVMyr. To measure similarity between octapeptides, we adopted the Hamming distance, including in the blind set only octapeptides with distance higher than four residues with any peptide used in training. To these examples, we added MYR proteins found in global profiling studies from Trypanosoma brucei [2], Trypanosoma cruzi [3], Leishmania donovani [4], and Plasmodium falciparum [5]. We ended up with 88 positive examples.
To collect negative examples for the blind test, we used the remaining part of the in vitro negative dataset [1]. To this we added proteins with an annotated N-terminal acetyl glycine in SwissProt. After removing redundancy with the same Hamming distance threshold, we ended up with 528 negative examples.
To test our SVMyr in the task of detecting internal, post-translational myristoylation sites in Metazoe, we derived from SwissProt an additional dataset including 4 proteins experimentally annotated with the feature (3 human and one bovin). Furthermore, we included 11 human proteins reported as post-translational myristoylated, described in an in vivo experiment with induced apoptosis [6], and not included in UniProt.

Download:

Table 2. Summary statistics on the blind test dataset

Positive MYR setNegative setPost-translational MYR set
# proteins: 88 528 15
# species: 6 10 2
Download TSV Download TSV Download TSV

Figure 2. Taxonomic distribution of the blind test dataset (co-translational).

In the legend, numbers between square brakets refer to distinct species.

Proteome analysis of co-translational MYR

We performed full-proteome analysis on 8 different organisms, including: H.sapiens, M.musculus, A.thaliana, S.cerevisiae, S.cerevisiae, T.brucei, T.cruzi, L.donovani and P.falciparum. The respective reference proteomes were downloaded from UniProtKB.
MYR predictions were compared with available experimental and electronic MYR annotations in UniProtKB, as well as with proteome-scale experimental studies on the 8 organisms, when available [1-5]. Overall, SVMyr identified 931 new MYR substrates that are good candidates for further validation (Table 3). These new substrates along with functional annotations can be downaloded here.

Download:

Table 3. Summary statistics on proteome-wide analysis

SpeciesProteome sizeGly-ome sizeNew MYR substrates
H.sapiens 79038 5243 183 Download TSV
M.musculus 55341 3788 272 Download TSV
A.thaliana 39334 3457 68 Download TSV
S.cerevisiae 6050 288 21 Download TSV
T.brucei 8587 463 63 Download TSV
T.cruzi 19242 867 181 Download TSV
L.donovani 7960 412 99 Download TSV
P.falciparum 5384 232 44 Download TSV

References

[1] Castrec, B. et al. (2018). Structural and genomic decoding of human and plant myristoylomes reveals a definitive recognition pattern. Nat. Chem. Biol., 14, 671-679.
[2] Wright, M. H. et al. (2016). Global Profiling and Inhibition of Protein Lipidation in Vector and Host Stages of the Sleeping Sickness Parasite Trypanosoma brucei. ACS Infect. Dis., 2, 427-441.
[3] Roberts, A. J., & Fairlamb, A. H. (2016). The N-myristoylome of Trypanosoma cruzi. Sci. Rep., 6, Article 31078.
[4] Wright, M. H. et al. (2015). Global Analysis of Protein N-Myristoylation and Exploration of N-Myristoyltransferase as a Drug Target in the Neglected Human Pathogen Leishmania donovani. Chem. Biol., 22, 342-354.
[5] Schlott, A. C. et al. (2018). N-Myristoylation as a Drug Target in Malaria: Exploring the Role of N-Myristoyltransferase Substrates in the Inhibitor Mode of Action. ACS Infect. Dis., 4, 449-457.
[6] Thinon, E. et al. (2014). Global profiling of co- and post-translationally N- myristoylated proteomes in human cells. Nat. Commun., 5, Article 4919.