COGRIM -
Clustering of Genes into Regulons using Integrated Modeling
This tool is an R program implementation of the gibbs sampler for
COGRIM. To learn about R, please visit The R Project for Statistical
Computing.
G. Chen, S. T. Jensen, C. Stoeckert, "Clustering of Genes into Regulons
using Integrated Modeling-COGRIM", Genome Biology, 2007, Jan.
4;8(1):R4 [PMID:17204163]
[GB link]
Download
R program-cogrim.R and sample input files:
Gene ORF lists of B+/C+, B-/C+ and B+/C-
clusters for each transcription factor in application to Yeast
Background
The computational approaches that are used to identify regulatory
modules and networks have traditionally used information either from
expression data, sequence features (ChIP binding data or binding motif
data) of transcription factors (TF). Although those approaches have
been proven useful, their power is inherently limited by the fact that
each data resource provides only partial information: expression data
provides only functional or indirect evidence, whereas binding data or
binding motifs only provide physical location information. Recent
efforts on integrating these data types have drawbacks, such as
arbitrary parameter cutoffs or too heuristic with little systematic
modeling.
We present a Bayesian hierarchical model and Markov Chain Monte Carlo
implementation that integrates heterogeneous information including
expression data, sequence features in a principled and robust fashion.
Our model, COGRIM, does not require the prior clustering of expression
data or many of the arbitrary parameter thresholds of previous methods.
Our applications represent both unicellular and mammalian organisms as
well as several scenarios of available data. We apply our model to S.
cerevisiae, where large amounts of ChIP binding data and gene
expression data are available. Our validation analyses show that
our predicted gene-TF interactions are very likely to be biologically
relevant. We also examine two transcription factors in
mammals: C/EBP-beta where TF binding site data, ChIP binding data and
expression data are all available, and SRF, where only TF binding site
data and gene expression data are available. In both of these
applications, we demonstrate the ability to predict gene-TF
interactions with reduced levels of false positives.
Our general
approach of Bayesian modeling for integrating heterogeneous biological
data to discover regulatory networks provides a framework for
overcoming the intrinsic limitations of available methods, and should
prove useful in applications to other organisms.
Workflow
About
the Modeling - COGRIM
Based on the assumption that the expression levels of regulated genes
are affected by the expression levels of regulators, the first level of
our model incorporates our gene expression data by specifying the
observed gene log-expression git as a
linear function of TF gene log-expression fjt
,
Cij: indicator variable for whether or not gene i is
regulated by TF j; where Cij = 1 if gene
i is regulated by TF j or 0 otherwise
git: expression levels for gene i in
state t: git i = 1, . . . ,N; t =
1, . . . , T; where there are N genes and T states (or tissues)
fjt : expression levels for TF j in state
t: fjt j = 1, . . . , J; t = 1, . . . ,
T; where there are transcription factors and T states ( or
tissues) : baseline expression for gene I in absence of known TFs ( ie.
Cij = 0 for all j) :
effect of TF j on gene expression
Each TF j has a single effect (bj) on the expression of gene i, which
does not take into account the biological reality that expression is
often the result of synergistic or antagonistic binding of multiple
TFs. We acknowledge these synergistic relationships by including
interaction terms in our linear model:
where we now have additional coefficients
that can be interpreted as the synergistic (or antagonistic) effect of
both TFs j and k binding together to the same upstream region (in
addition to the effects of TF j or k binding in isolation).