AnnotCompute

Content

About AnnotCompute

AnnotCompute is a tool to identify similar functional genomics experiments (mainly microarray experiments) based on standardized annotations containing the MGED Ontology (MO) [1] terms.

AnnotCompute contains three components:

  1. Extractor: extracts annotations from the MAGE-TAB[2] files.
  2. Comparator: computes dissimilarity measures between all pairs of experiments using the extracted annotations.
  3. Query Processor: identifies similar experiments based on the dissimilarity matrix generated by the second component, comparator.

Currently we provide searches for similar experiments available in the ArrayExpress[3] public repository. The experiments in ArrayExpress are in MAGE-TAB format and make use of the MGED ontology (MO) as indicated below.

The following eight annotation components covering the biological intent and context of experiments were retrieved from the MAGE-TAB files:

  1. Experiment Name: free text
  2. Experiment Description: free text
  3. Experiment Design Types: MO terms
  4. Experiment Factor Types: MO terms
  5. Experiment Factor Values: free text or measurement data or ontology terms
  6. Biomaterial Characteristics of Biosources: types are ontology terms, values are free-text or measurement data, or ontology terms. It includes Taxons, which is ontology terms
  7. Protocol Types: MO terms
  8. Protocol Descriptions: free text

Annotations were successfully extracted for 34391 experiments from a total of 34622 ArrayExpress experiments (as of January 2013).

We first evaluated how well the experiments were annotated by a sum of scores of various annotation components. The scores assigned to different components were as follow:

The higher scores indicate more richly annotated experiments.

Dissimilarity measures for pairs of experiments were computed based on the Kulczynski distance as described below.

For each annotation component, let A and B be the sets of annotation terms for the two experiments relative to that component, respectively. We defined the component-wise distance between the two experiments for this component as:

         Kulczynski Distance = 1 - ½ ( |A ∩ B| /A + |A ∩ B| /B )

The dissimilarity between two experiments was defined as the weighted average of their component-wise distances. The similarity score between two experiments is defined as (1 - dissimilarity score). If two experiments are exactly the same in all annotation components, their similarity score is 1. If none of the terms in each annotation component for two experiments are the same, the similarity score is 0.

We tried various combinations of 1-0 weights to include/exclude annotation components to optimize the results according to gold standards which generated manually based on keyword searches. Including annotation components Experiment Name, Experiment Design Types, Experiment Factor Types, Experiment Factor Values, and Biomaterial Characteristics of Biosources generally gave the best results and is used by the AnnotCompute tool to compute dissimilarity measures.

We can provide the option for users to choose the weights assigned to each components if it is desired.

AnnotCompute provides two kinds of experiment searches: experiments of interest will be identified (i) by a set of keywords or (ii) by annotation-based similarity to a query experiment in the MAGE-TAB format. The experiments returned are further categorized using hierarchical clustering (average agglomeration method) based on annotation dissimilarity. The number of clusters is defined by the user. AnnotCompute also provides the suggested cluster number according to the following formula:

         Cluster Number = max(min (square_root(Number of experiments), 10), 3)

[ Top ]   

Find similar experiments using keywords

The user provides a set of keywords in the keywords input field. Multiple keywords should be separated by a comma. A keyword can contain spaces. After the keywords are input, AnnotCompute displays the number of experiments containing the given keywords next to the input field (it may take a few seconds). This value is given as guidance on how many clusters might make sense. The user needs to provide the number of clusters that (s)he would like to categorize the search results into. AnnotCompute suggested cluster number will be shown on the right hand of the input field.

The keywords can be searched in all the extracted annotation components or limited to one component. In addition, the experiments can contain all or one of the given keywords.

After clicking on the "Find experiments" button, AnnotCompute will provide experiments containing the given keyword and clustered in the user defined clusters number with description and commonly used terms in each annotation component for a given group. The description contains a list of popular terms used in a cluster generated by the tf-idf method [4] that commonly used in information retrieval and text mining. The commonly used terms list the top 3 terms that annotated at least two experiments in the cluster. The cluster description will help users to evaluate whether a cluster is the one s/he will be interested in.

By clicking any given returned cluster, detailed information about the experiments in that cluster is displayed and each experiment in that cluster is linked directly to the corresponding experiment page at the ArrayExpress website.

[ Top ]   

Find similar experiments based on annotations

The user provides a query experiment in a valid MAGE-TAB format or its ArrayExpress ID. The MAGE-TAB format query experiment should contain two files: one is the IDF file and the other one is the SDRF file. The SDRF filename should be indicated in the IDF file.

The user can optionally provide keywords. AnnotCompute will list the top 100 similar experiments should more than 100 experiments be found.

The experiment ID, experiment name, and similarity score (to the query experiment) are shown for each returned experiment and each such experiment is linked to the corresponding experiment page at the ArrayExpress website.

The experiments similar to the query experiment and the query experiment itself can be further categorized into number of clusters that is defined by the user. The suggested cluster number will be set as default value in the input field. After hit "Cluster experiments", AnnotCompute then categorizes the experiments using hierarchical clustering based on annotation dissimilarity scores between pairs of experiments.The display of clusters are shown the same as those described in the "Find similar experiments using keywords". The cluster which include the query experiment will be indicated.

[ Top ]   

References

  1. MGED ontology [http://mged.sourceforge.net/ontologies/MGEDontology.php]
  2. MAGE-TAB [http://www.mged.org/mage-tab/]
  3. ArrayExpress [http://www.ebi.ac.uk/microarray-as/ae]
  4. Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-2

[ Top ]   

AnnotCompute is supported by NHGRI grant R21 HG004521 | Home