COGRIM - Clustering of Genes into Regulons using Integrated Modeling

This tool is an R program implementation of the gibbs sampler for COGRIM. To learn about R, please visit The R Project for Statistical Computing.

The R source code is available free under the general CBIL software agreement.
Please email questions to


G. Chen, S. T. Jensen, C. Stoeckert, "Clustering of Genes into Regulons using Integrated Modeling-COGRIM", Genome Biology, 2007, Jan. 4;8(1):R4  [PMID:17204163] [GB link]

R program-cogrim.R and sample input files:

Supplementary Material

The computational approaches that are used to identify regulatory modules and networks have traditionally used information either from expression data, sequence features (ChIP binding data or binding motif data) of transcription factors (TF). Although those approaches have been proven useful, their power is inherently limited by the fact that each data resource provides only partial information: expression data provides only functional or indirect evidence, whereas binding data or binding motifs only provide physical location information. Recent efforts on integrating these data types have drawbacks, such as arbitrary parameter cutoffs or too heuristic with little systematic modeling.

We present a Bayesian hierarchical model and Markov Chain Monte Carlo implementation that integrates heterogeneous information including expression data, sequence features in a principled and robust fashion. Our model, COGRIM, does not require the prior clustering of expression data or many of the arbitrary parameter thresholds of previous methods.

Our applications represent both unicellular and mammalian organisms as well as several scenarios of available data. We apply our model to S. cerevisiae, where large amounts of ChIP binding data and gene expression data are available.  Our validation analyses show that our predicted gene-TF interactions are very likely to be biologically relevant.   We also examine two transcription factors in mammals: C/EBP-beta where TF binding site data, ChIP binding data and expression data are all available, and SRF, where only TF binding site data and gene expression data are available. In both of these applications, we demonstrate the ability to predict gene-TF interactions with reduced levels of false positives.

Our general approach of Bayesian modeling for integrating heterogeneous biological data to discover regulatory networks provides a framework for overcoming the intrinsic limitations of available methods, and should prove useful in applications to other organisms.


About the Modeling - COGRIM
Based on the assumption that the expression levels of regulated genes are affected by the expression levels of regulators, the first level of our model incorporates our gene expression data by specifying the observed gene log-expression git as a linear function of TF gene log-expression fjt ,


Cij: indicator variable for whether or not gene i is regulated by TF j; where Cij = 1 if gene i is regulated by TF j or 0 otherwise
git: expression levels for gene i in state t: git  i = 1, . . . ,N; t = 1, . . . , T; where there are N genes and T states (or tissues)
fjt : expression levels for TF j in state t: fjt j = 1, . . . , J; t = 1, . . . , T; where there are transcription  factors and T states ( or tissues)
  : baseline expression for gene I in absence of known TFs ( ie. Cij = 0 for all j)
 : effect of TF j on gene expression

Each TF j has a single effect (bj) on the expression of gene i, which does not take into account the biological reality that expression is often the result of synergistic or antagonistic binding of multiple TFs. We acknowledge these synergistic relationships by including interaction terms in our linear model:

where we now have additional coefficients  that can be interpreted as the synergistic (or antagonistic) effect of both TFs j and k binding together to the same upstream region (in addition to the effects of TF j or k binding in isolation).