SVMyr - Bologna Biocomputing Group

The SVMyr training dataset

The positive training dataset comprises 232 protein sequences derived from UniProKB/SwissProt (release 2021_4). Proteins are from 37 organisms. Only experimentally validated (evidence code ECO:0000269) proteins with the annotation “N-myristoyl glycine” are included. For each protein, the Gly-starting octapeptide was extracted and proteins with identical octapeptides were clustered together, choosing one representative per cluster for a total of 232 examples.
A negative training dataset, comprising non-myristoylated proteins from Homo sapiens and Arabidopsis thaliana, was derived from an in vitro study described in literature [1]. In this study, authors measure the enzymatic activity of the N- myristoyltransferase (NMT) in the presence of diverse octapeptides using macro-arrays. From the set of 1126 negative examples i.e., proteins experimentally tested as not myristoylated, we randomly selected 232 representative negative proteins to obtain a training dataset balancing positive and negative examples.
Ten-fold cross-validation sets were generated such that similar octapeptides are collected whithin the same subset. To measure similarity, we adopted the Hamming distance, requiring pairs of octapeptides whose distance is less than 4 residues to be placed in the same subset. IN this way, we avoid training/testing redundancy.

Download:

Complete training dataset (Excel .xlsx)

Table 1. Summary statistics on the training dataset

	Positive training set	Negative training set
# proteins:	232	232
# species:	37	2
	Download TSV	Download TSV

Figure 1. Taxonomic distribution of the positive training set.

In the legend, numbers between square brakets refer to distinct species.

The SVMyr blind test datasets

We derived a blind test set from the in vitro experiment previously reported [1] and in vivo experiments on four parasites [2-5]. From [1], we selected octapeptides non-redundant with respect to the positive training set adopted to train SVMyr. To measure similarity between octapeptides, we adopted the Hamming distance, including in the blind set only octapeptides with distance higher than four residues with any peptide used in training. To these examples, we added MYR proteins found in global profiling studies from Trypanosoma brucei [2], Trypanosoma cruzi [3], Leishmania donovani [4], and Plasmodium falciparum [5]. We ended up with 88 positive examples.
To collect negative examples for the blind test, we used the remaining part of the in vitro negative dataset [1]. To this we added proteins with an annotated N-terminal acetyl glycine in SwissProt. After removing redundancy with the same Hamming distance threshold, we ended up with 528 negative examples.
To test our SVMyr in the task of detecting internal, post-translational myristoylation sites in Metazoe, we derived from SwissProt an additional dataset including 4 proteins experimentally annotated with the feature (3 human and one bovin). Furthermore, we included 11 human proteins reported as post-translational myristoylated, described in an in vivo experiment with induced apoptosis [6], and not included in UniProt.

Download:

All blind test datasets (Excel .xlsx)

Table 2. Summary statistics on the blind test dataset

	Positive MYR set	Negative set	Post-translational MYR set
# proteins:	88	528	15
# species:	6	10	2
	Download TSV	Download TSV	Download TSV

Figure 2. Taxonomic distribution of the blind test dataset (co-translational).

In the legend, numbers between square brakets refer to distinct species.

Proteome analysis of co-translational MYR

We performed full-proteome analysis on 8 different organisms, including: H.sapiens, M.musculus, A.thaliana, S.cerevisiae, S.cerevisiae, T.brucei, T.cruzi, L.donovani and P.falciparum. The respective reference proteomes were downloaded from UniProtKB.
MYR predictions were compared with available experimental and electronic MYR annotations in UniProtKB, as well as with proteome-scale experimental studies on the 8 organisms, when available [1-5]. Overall, SVMyr identified 931 new MYR substrates that are good candidates for further validation (Table 3). These new substrates along with functional annotations can be downaloded here.

Download:

New MYR substrates predicted in 8 species

Table 3. Summary statistics on proteome-wide analysis

Species	Proteome size	Gly-ome size	New MYR substrates
H.sapiens	79038	5243	183	Download TSV
M.musculus	55341	3788	272	Download TSV
A.thaliana	39334	3457	68	Download TSV
S.cerevisiae	6050	288	21	Download TSV
T.brucei	8587	463	63	Download TSV
T.cruzi	19242	867	181	Download TSV
L.donovani	7960	412	99	Download TSV
P.falciparum	5384	232	44	Download TSV

References

[1] Castrec, B. et al. (2018). Structural and genomic decoding of human and plant myristoylomes reveals a definitive recognition pattern. Nat. Chem. Biol., 14, 671-679.
[2] Wright, M. H. et al. (2016). Global Profiling and Inhibition of Protein Lipidation in Vector and Host Stages of the Sleeping Sickness Parasite Trypanosoma brucei. ACS Infect. Dis., 2, 427-441.
[3] Roberts, A. J., & Fairlamb, A. H. (2016). The N-myristoylome of Trypanosoma cruzi. Sci. Rep., 6, Article 31078.
[4] Wright, M. H. et al. (2015). Global Analysis of Protein N-Myristoylation and Exploration of N-Myristoyltransferase as a Drug Target in the Neglected Human Pathogen Leishmania donovani. Chem. Biol., 22, 342-354.
[5] Schlott, A. C. et al. (2018). N-Myristoylation as a Drug Target in Malaria: Exploring the Role of N-Myristoyltransferase Substrates in the Inhibitor Mode of Action. ACS Infect. Dis., 4, 449-457.
[6] Thinon, E. et al. (2014). Global profiling of co- and post-translationally N- myristoylated proteomes in human cells. Nat. Commun., 5, Article 4919.