Methods

This page contains descriptions of the methods we used to generate the data in ErythronDB. Details about the cell fractionation protocol can be found on the Data and Resources page.


Gene Expression Profiles

Global gene expression of primitive, fetal definitive, and adult definitive erythroid cells was performed using the Affymetrix Mouse 430_2 array. The affy package of the Bioconductor toolbox was used to normalize the raw data and to make MAS5 presence/absence calls for each probeset. Within each lineage, a probeset called present in two or more replicates of at least one condition (erythroid cell stage) was considered to be associated with a transcript expressed in our study system. Expressed probesets were then mapped to genes using Bioconductor and the mouse4302.db annotation library. In the case that a gene mapped to multiple expressed probesets, values were averaged.

Average Expression Profiles: To summarize expected gene expression during each lineage of erythrocyte develop, lineage-specific expression values were averaged across cell stage replicates.

On ErythronDB gene detail pages, expression profiles and probeset-level data are provided for genes that mapped to probesets found to be expressed in at least one erythroid lineage. Only probeset-level expression is available for genes that mapped to probesets not significantly expressed in the dataset. No data are available for genes that did not map to Affymetrix Probeset IDs.

On ErythronDB strategy pages (lists of results), lineage-specific expression profiles are provided only for genes that mapped to expressed probesets. All others are annotated as being not expressed or data unavailable, depending on whether the gene could be mapped to the array.

Genes not present or not expressed in the dataset can be removed from search results by applying the relevant filter for expression in all or a subset of the erythroid lineages.

 

Predicting Regulatory Relationships between Transcription Factors and Genes

Transcriptional regulation of gene expression is carried out in part by transcription factors (TFs) that recognize a suite of short, similar DNA sequences, or binding sites. A family of binding sites can be represented by a positional weight matrix (PWM) and the PWM can be used in turn predict putative binding sites on the genome. We have made transcriptional regulation predictions using these matrices for several factors key to erythropoiesis. However, because binding sites are short and diverse relative to their length, many predicted sites are non-functional. Thus, we have also taken steps to increase the reliability of the predictions.

We assembled a collection of PWM for each TF of interest. Partial weight matrices were selected from TRANSFAC 9.4, JASPAR 2005/01/08. For some factors, a PWM was not available (e.g, Klf1) and so published binding consensus sequences were used instead. For each PWM or sequence we determined a scoring threshold that control the false positive rate to a reasonable maximum. We then predicted binding sites in conserved regions within 1000bp of the transcription start site (TSS) of EntrezGenes aligned to the genome. EntrezGene coordinates were obtained from the UCSC Genome Browser.

Putative binding sites were identified as sections of the mouse genome that had at least 70% similarity to the PWM. Similarity was determined using the Transcription Element Search System (TESS). If a predicted binding site falls within 1000bp of the TSS of an EntrezGene alignment to the genome, then the gene is considered to potentially regulated by the corresponding transcription factor.

 

Identifying Tissues with Bone Marrow-Specific Expression

Surveys of gene expression across a variety of tissues allow us to assess the specificity of such expression. ErythronDB uses the H and Q metrics to identify tissue-specific expression of genes. We consider two types of specificity: overall and categorical. Genes whose expression distribution across tissues is non-uniform are considered to exhibit tissue-specific expression. Genes that have an expression pattern biased toward a particular tissue (or category, e.g., bone marrow) are specific to that category. Overall specificity is measured by H. Categorical specificity is measured by Q. In both cases, lower values of these metrics indicate more specific genes.

H is a meaure of Shannon Entropy. For a gene, g, Hg can range from 0 (a gene that is expressed in a single tissue) to log2(n) (a gene expressed evenly across all tissues), where n is the numbers of tissues being compared. Thus, the lower the value of Hg, the more specific the expression of the gene among the set of tissues. The higher the Hg, the more ubiquitous its expression pattern among the tissues being compared.

Qg,t is a measure of the conditional specificity. It is calculated as the difference between the Shannon entropy and the log of the probability the gene is expressed in the tissue of interest. Qg,t emphasizes tissues that are secondary sites of expression when the gene is specific overall. Q is 0 for a gene that is expressed only in the tissue of interest. It will be equal to 2 * lg2(n) for a gene that is uniformly expressed across all tissues.

H-values for genes in ErythronDB range from 0.18 (specific) - 5.2 (non-specific).

Q-values for genes in ErythronDB range from 2.3 (specific) - 16.2 (non-specific).

The H and Q statistics are described in this paper:

Schug, J., Schuller, W.-P., Kappen, C., Salbaum, M.J., Bucan, M., Stoeckert, C.J., Promoter features related to tissue specificity as measured by Shannon entropy, Genome Biology 2005. [html]

Data Sources

The tissue surveys used in ErythronDB to compute H values are those from the Gene Expression Atlas 2; in particular we used the survey of Mouse Normal Adult Tissues, Brain and Skin Tissues Collapsed. You can view the H-value and expression pattern in this tissue survey for any gene by browsing to its detail page (search on the gene symbol, EntrezGene, or MGI identifier) and following the link to GeneAtlas Expression (e.g., for Gata1). Tissue specificity in ErythronDB is based on the dataset listed as GeneAtlas GNF1M, gcrma in the visualization tool.

H and Q values were computed at the reporter (Affymetrix ProbeSet) level and, for a given gene, the minimum value over its reporters was taken.

 

Identifying Sentinel Contaminants

The presence of alpha-fetoprotein and transthyretin expression in some primitive erythroid samples derived from E9.5 and E10.5 yolk sac suggested the possibility of visceral endoderm cell contamination. Likewise, the presence of the mast cell- and lymphoid-specific transcripts in definitive erythroid samples from the adult bone marrow suggested contamination of some samples with non-erythroid hematopoietic cells.

We identified a list of 59 additional transcripts potentially derived from non-erythroid cells contaminating the FACS-isolated cell populations. These sentinel contaminants were determined as genes known to exhibit tissue-specific expression in non-erythroid cells within the yolk sac (visceral endoderm), fetal liver(hepatocytes), or bone marrow (myeloid and lymphoid cells).

Pearson correlations were then used to identify additional transcripts with a similar expression fingerprint to the 59 sentinal contaminants across the 60 cell populations. Using a cut-off of r > 0.9, 264 probe-sets were identified as likely contaminants and excluded from all computational and functional annotation analyses accessible in ErythronDB.

View the complete list of genes associated with these transcripts.

 

Pairwise Comparisons of Expression (Differential)

Lists of genes differentially expressed between pairs of conditions (between equivalent stages across lineages and between subsequent stages within lineages) were generated using the PaGE algorithm, a permutation-based method with False Discovery Rate (FDR) multiple testing correction.

Comparisons were made with the Perl version of PaGE with the option --level_confidence set to 0.8 for expression (which corresponds to an FDR of 20%) and with --tstat on (see the PaGE website for command options and their meaning). One of the conditions is chosen as the "reference_condition" and the other conditions are compared to this (so up or down for a given condition refer to how it compares to the reference condition). The analyses were run at the reporter level, after the dataset was filtered to removed non-expressed or contaminant-associated reporters (see section above on Sentinel Contaminants).

 

Gene-Interaction Networks

Expression profiles for each gene were determined as described above (see Expression Profile section), except values for cell stage replicates were not averaged and treated as independent samples.

Within each erythroid lineage, the pairwise Pearson Corrleation (r) was a calculated between profiles of all expressed genes. The significance of correlations was assessed using a t-statistic and p-values <= 0.05 were considered significant. A co-expression network was generated by drawing edges between all genes found to be significantly correlated. This analysis resulted in 3 erythroid-lineage specific networks comprised of ~38,000,000 relationships involving ~14,000 genes expressed in the ErythronDB dataset. Although this seems large, the inferred co-expression network is actually sparse, representing a selected subset (<20%) of all possible gene-interactions.

As part of an ongoing effort; this network is supplemented by external annotations to increases its utility and confidence in predictions of gene-interactions, including KEGG pathway interactions and published ChIP-Seq data.

Transcriptional Regulatory Network: A transcriptional regulatory network was inferred by identifying 1080 putative transcriptional regulators using Gene Ontology annotations: GO:0003700, GO:0006350 and GO:0006351. Within each erythroid lineage, these factors and their putative targets (genes significantly co-expressed) were isolated from the complete co-expression network. Putative factor-target relationships were further annotated based on computational predictions of transcription factor binding (see section on Predicting Regulatory Relationships).