Q. How are the GO Function Predictions generated for the protein sequences?
A.
To assign GO Function(s) to proteins computationally, we assume that
a protein domain, contained within a protein, is responsible for a particular
function.
For example, a DNA binding domain, within a protein, is responsible
for the GO Function DNA binding.
Using this premise, we created protein domain-GO Function rules associating
GO Function(s) with protein domains.
To create the protein domain-GO Function rules, we utilized protein
domains and a set of fly, yeast and mouse proteins that had manually annotated
GO Functions.
The protein domains used were defined by the ProDom domain database
and the conserved domain database at NCBI: CDD-pfam, CDD-smart and CDD-LOAD.
The yeast, fly and mouse proteins, which had manually annotated GO Functions,
were obtained from the Gene Ontology (GO) Consortium.
An example of how a protein domain was associated with particular GO Function(s), to create the protein domain - GO Function rule, is illustrated below:
The protein domain pfam00125 is a protein sequence described as Core histone H2A/H2B/H3/H4.
Through BLAST similarity searching, the protein domain is found in three
yeast, fly and mouse proteins with annotated GO Function(s).
| Name Identifier Annotated GO Function(s) p-value of similarity to pfam domain |
| H2AX P27661 nucleic acid binding:DNA binding 5 x 10-19 |
| HTB1 YDR224C nucleic acid binding:DNA binding 9 x 10-22 |
| His4r FBgn0013981 nucleic acid binding:DNA binding 1 x 10-7 |
The intersection of the GO Function(s) for the three proteins is in
good agreement, so the the following GO Function Rule is generated for
the protein domain:
pfam00125 nucleic acid binding:
DNA binding 1 x 10-7
In the above example, if a protein sequence has the domain with a BLAST
similarity (p-value) of 1 x 10-7 or lower, the sequence will
be given the predicted GO Function(s)
nucleic acid binding:DNA binding.
Many protein domain-GO Function rules were applied to translated DoTS
transcripts or the predicted protein sequences within PlasmoDB to generate
the GO Function predictions.
Not all proteins will have a protein domain which will generate predicted
GO Functions.