Jonathan Schug, Ph.D.


I am interested in the regulation of gene expression and in understanding and modeling biological systems. My current work includes machine learning of tissue-specific gene regulatory modules using grammar formalism (BCGs) of my own design. The system integrates many of the approaches currently in use but extends them as well. Using grammars for this problem allows one to easily compose smaller CRM parts together to create a more complete model.

My earlier modeling work used hybrid systems to model bacterial quorum sensing. As part of the Biocomputation Group at UPenn, Calin Belta and I designed a visual interface for building hybrid models of gene regulation. This interest was sparked by studying process algebras and my real-time programming experiences.

I hope to bring these two areas together to better understand how cells and organisms function and evolve.


You can find me at the IDOM Functional Genomics Core at Penn. You can find me here.

752A CRB
415 Curie Blvd.
Philadelphia, PA, 19104

email:jschug at
voice:215 898-0773
fax:215 898-5408

I used to work at the Computational Biology and Informatics Lab in the Penn Center for Bioinformatics at the School of Medicine at the University of Pennsylvania.


TessLA Genome Browser and Data Analysis System

TessLA is an extension of the tools in TESS and AnGEL to support the IDOM/FGC and to perform data analysis for HTS data. We are using it here.

More to follow...


AnGEL is system for querying for patterns of sequence annotation. AnGEL extracts sequence and annotation from databases, webservers, DAS servers, local programs, or GFF files. It can also simply string matching against DNA or other sequence. The combination and arrangements of annotations are described using a non-recursive context free grammar with special additions for finding sets, multisets, and list of features. The TESS website now uses AnGEL to let you search for CRMs genome-wide in a collection of organisms.

A description of the system has appeared in the proceedings of the DILS07 workshop. Source code will be available by request from me once the paper appears.


TESS is a web tool for predicting transcription factor binding sites in your DNA sequence. You can also query for information about the transcription factors. It has been on the web since 1996 and has done more than 500,000 TFBS searches. At last count it was cited by about 150 papers.


EPConDB is CBIL's contribution to the Beta Cell Biology Consortium (BCBC). It is a web site for pancreas genomics including special cDNA micro arrays, promoter chip arrays, expression information, and a search to find pancreas CRMs in the promoter or introns of human or mouse genes using simple collection productions.


The Genomics Unified Schema (GUS) is a comprehensive relational schema and software infrastructure covering many aspects of modern genomics. I develop much of the application architecture. I am the schema master of the TESS section of GUS which stores information about transcription factors, binding sites, and CRM models.


The Transcription Analysis Interest Group (TAIG) is an informal group of people in PCBI (and beyond) that meet bi-weekly to present chalk talks of their research to get feedback, identify potential collaborations, or simply see what data and expertise is available locally. If you are interested in joining, you can subscribe at here.


PlasmoDB is a web site containing information about the Plasmodium falciparum genome as well as mRNA and protein expression information. It has been expanded to include other plasmodia as well as included in the ApiDB. I was a member of the original team, but now just contribute a graph here and there.

BioSketch Pad

BioSketchPad is a visual front-end for building hybrid models of gene regulation models that was developed as part of the Biocomputation Group at UPenn.

GO Function Annotation

This is an approach to predicting gene ontology function based on the domains a protein contains. By training on hand-curated assignments, we were able to associate protein domains with GO functions. This method takes advantage of the hierarchical structure of the gene ontology to make a more general prediction when there is conflict among the possible annotations. Sharon Diskin provided lots of help. We have a browser that displays the rules we learn for each protein domain.


XDC was a beamline control and data collection software package for X9 at NSLS.


University of Pennsylvania Ph.D. Computer and Information Science 2005
University of Pennsylvania M.S.E. Computer and Information Science 1994
University of Pennsylvania B.A. Mathematics 1980


University of Pennsylvania Functional Genomics Core of the Institute for Diabetes, Obesity, and Metabolism, School of Medicine, Philadelphia, PA 19104 2008 to present/TD>
developing and applying algorithms to analyze high-throughput sequencing and microarray data to understand a variety of biological phenomema.
University of Pennsylvania Computational Biology and Informatics Lab, School of Medicine, Philadelphia, PA 19104 1995 to 2008
developed algorithms and tools for understanding gene regulation, building large genomics databases and web sites
Computer Command and Control Corporation Philadelphia, PA 19103 1994 to 1995
developed software re-engineering tools and techniques
Biostructures Institute University City Science Center, Philadelphia, PA 19104 1984 to 1994
developed equipment control and data acquisition software for X-ray beamline X9 at the National Synchrotron Light Source


I graduated from the Department of Computer and Information Science in the School of Engineering at the University of Pennsylvania. My advisors were Max Mintz and Christian J. Stoeckert, Jr. My committee consisted of Maja Bucan, Klaus Kaestner, David Searls, and Lyle Ungar.

In my dissertation [PDF 4.1MB] I develop and apply a grammar formalism, bounded collection grammars, to the problem of modeling cis-regulatory modules (CRM) which are partly responsible for controlling gene expression.

Here's the abstract.

Tissue-specific expression is one of the most obvious and important patterns of gene expression in complex eukaryotes. Every cell in an organism has the same set of genes, yet only a subset of the genes are expressed in a given cell type. This regulation is accomplished in large part by transcription factors (TF's) that bind to short degenerate genomic sequences called binding sites near the genes they regulate. TF's work in combination to provide precise regulation of gene expression. Understanding the combinatorics of TF regulation is still an open problem in post-genomic biology. In this dissertation we develop and apply a bounded collection grammar (BCG) formalism, similar to permutation grammars, and a machine-learning algorithm to model, search for, and learn the combinations and arrangements of TF's that regulate tissue-specific expression. Our machine-learning algorithm allows for the optimization of free parameters in a grammar such as spacing and scores to identify the best possible performance of a rule. This system provides a unique combination of modeling power and learning ability. To identify tissue-specific genes from tissue surveys of gene expression, we apply Shannon entropy Hg to quantify overall specificity, then develop and apply a new metric entropy-based Qg|t to quantify specificity to a particular tissue, t. We take a stepwise approach to promoter analysis by first studying specific and ubiquitous promoters in general to determine global characteristics. We then study the genes specific to a particular tissue in this global context. Our analysis of mouse and human promoters ranked by Hg identifies the TATA box and CpG island as the major determinants of tissue-specificity. We find there are functional correlates of the TATA/CpG class of a gene's promoter. We identified TF's enriched in liver promoters and studied their arrangements to refine and extend earlier results by identifying one known rule and many new rules. Finally, we performed sequence analysis of ChIP-chip experiments to identify the companion factors of the ChIP-chip target factor that help define the active sites in the direct target genes demonstrating that our machine learning system can also contribute to the understanding of other regulatory events.

Publications and Conference Proceedings

A • marks the papers for which I am a lead author or a major contributer.

High-throughput Sequencing

• Gao Y, Schug J, McKenna LB, Le Lay J, Kaestner KH, Greenbaum LE., Tissue-specific regulation of mouse MicroRNA genes in endoderm-derived tissues, Nucleic Acids Res. 2010 Sep 14. [Epub ahead of print]

McKenna LB, Schug J, Vourekas A, McKenna JB, Bramswig N, Friedman JR, Kaestner KH., MicroRNAs Control Intestinal Epithelial Differentiation, Architecture, and Barrier Function., Gastroenterology. 2010 Jul 23. [Epub ahead of print]

Zhao J, Schug J, Li M, Kaestner KH, Grant SF., Disease-associated loci are significantly over-represented among genes bound by transcription factor 7-like 2 (TCF7L2) in vivo., Diabetologia. 2010 Jul 17. [Epub ahead of print]

Gao N, Le Lay J, Qin W, Doliba N, Schug J, Fox AJ, Smirnova O, Matschinsky FM, Kaestner KH., Foxa1 and Foxa2 maintain the metabolic and secretory features of the mature beta-cell., Mol Endocrinol. 2010 Aug;24(8):1594-604. Epub 2010 Jun 9.

Steger DJ, Grant GR, Schupp M, Tomaru T, Lefterova MI, Schug J, Manduchi E, Stoeckert CJ Jr, Lazar MA., Propagation of adipogenic signals through an epigenomic transition state., Genes Dev. 2010 May 15;24(10):1035-44.

Govin J, Schug J, Krishnamoorthy T, Dorsey J, Khochbin S, Berger SL., Genome-wide mapping of histone H4 serine-1 phosphorylation during sporulation in Saccharomyces cerevisiae., Nucleic Acids Res. 2010 Aug;38(14):4599-606. Epub 2010 Apr 7.

• Bhandare R, Schug J, Le Lay J, Fox A, Smirnova O, Liu C, Naji A, Kaestner KH. Genome-wide analysis of histone modifications in human pancreatic islets., Genome Res. 2010 Apr;20(4):428-33. Epub 2010 Feb 24.

Rieck S, White P, Schug J, Fox AJ, Smirnova O, Gao N, Gupta RK, Wang ZV, Scherer PE, Keller MP, Attie AD, Kaestner KH., The transcriptional response of the islet to pregnancy in mice., Mol Endocrinol. 2009 Oct;23(10):1702-12. Epub 2009 Jul 2.

Tuteja G, White P, Schug J, Kaestner KH., Extracting transcription factor targets from ChIP-Seq data., Nucleic Acids Res. 2009 Sep;37(17):e113. Epub 2009 Jun 24.

Bochkis IM, Schug J, Rubins NE, Chopra AR, O'Malley BW, Kaestner KH., Foxa2-dependent hepatic gene regulatory networks depend on physiological state., Physiol Genomics. 2009 Jul 9;38(2):186-95. Epub 2009 May 5.

Schupp M, Cristancho AG, Lefterova MI, Hanniman EA, Briggs ER, Steger DJ, Qatanani M, Curtin JC, Schug J, Ochsner SA, McKenna NJ, Lazar MA., Re-expression of GATA2 cooperates with peroxisome proliferator-activated receptor-gamma depletion to revert the adipocyte phenotype., J Biol Chem. 2009 Apr 3;284(14):9458-64. Epub 2009 Jan 9.

Lefterova MI, Zhang Y, Steger DJ, Schupp M, Schug J, Cristancho A, Feng D, Zhuo D, Stoeckert CJ Jr, Liu XS, Lazar MA., PPARgamma and C/EBP factors orchestrate adipocyte biology via adjacent binding on a genome-wide scale., Genes Dev. 2008 Nov 1;22(21):2941-52.

Gene Regulation

• Schug, J., Mintz, M., Stoeckert, C.J., Data Integration and Pattern-Finding in Biological Sequence with TESS's Annotation Grammar and Extraction Language (AnGEL), DILS 2007, LNBI 4544 188-203, 2007. [PDF]

• Schug, J., Mintz, M., Stoeckert, C.J., Using Bounded Collection Grammars to Identify cis-Regulatory Modules, in prep.

• Schug, J., Schuller, W.-P., Kappen, C., Salbaum, M.J., Bucan, M., Stoeckert, C.J., Promoter features related to tissue specificity as measured by Shannon entropy, Genome Biology 2005. [PDF]

Phuc Le P, Friedman JR, Schug J, Brestelli JE, Parker JB, Bochkis IM, Kaestner KH., Glucocorticoid Receptor-Dependent Gene Regulatory Network, PLoS Genetics, Aug 5;1(2):e16, 2005. [PDF]

Friedman JR, Larris B, Le PP, Peiris TH, Arsenlis A, Schug J, Tobias JW, Kaestner KH, Greenbaum LE, Orthogonal analysis of C/EBPbeta targets in vivo during liver proliferation, Proc Natl Acad Sci U S A. 2004 Aug 31;101(35):12986-91. [PDF]

• Schug, J., Unit 2.6: Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence, Current Protocols in Bioinformatics, John Wiley and Sons, 2003. [URL]

• Schug, J., Overton, G. Christian, Modeling Transcription Factor Binding Sites with Gibbs Sampling and Minimum Description Length Encoding, short-paper, Proceedings of ISMB 1997.

Transcription factors, proteins required for the regulation of gene expression, recognize and bind short stretches of DNA on the order of 4 to 10 bases in length. In general, each factor recognizes a family of "similar" sequences rather than a single unique sequence. Ultimately, the transcriptional state of a gene is determined by the cooperative interaction of several bound factors. We have developed a method using Gibbs Sampling and the Minimum Description Length principle for automatically and reliably creating weight matrix models of binding sites from a database (TRANSFAC) of known binding site sequences. Determining the relationship between sequence and binding affinity for a particular factor is an important first step in predicting whether a given uncharacterized sequence is part of a promoter site or other control region. Here we describe the foundation for the methods we will use to develop weight matrix models for transcription factor binding sites.

Hybrid Systems

• Rajeev Alur, Calin Belta, Franjo Ivancic, Vijay Kumar, Harvey Rubin, Jonathan Schug, Oleg Sokolsky, and Jonathan Webb, Visual Programming for Modeling and Simulation of Biomolecular Regulatory Networks, Lecture Notes In Computer Science; Vol. 2552, Proceedings of the 9th International Conference on High Performance Computing pages 702-712, 2002 [PDF]

• Alur, R., Belta, C., Kumar, V., Mintz, M., Pappas, G.J., Rubin, H., Schug, J., Modeling and Analyzing Biomolecular Networds, Computing in Science and Engineering, 4(1), 20-31. 2002. [PDF]

Alur, R., Belta, C., Ivancic, F., Kumar, V., Mintz, M., Pappas, G., Schug, J., Hybrid modelling and simulation of biomolecular networks, 4th International Workshop on Hybrid Systems: Computation and Control, Rome, Italy, March 2001. [PDF]

• Belta, C., Schug, J., Dang, T., Kumar, V., Pappas, G.J., Rubin, H., Dunlap, P.V., Stability and reachability analysis of a hybrid model of luminescence in the marine bacterium Vibrio fischeri 40th IEEE CDC, Orlando , Florida, 2001. [PDF]


Gunasekera AM, Patankar S, Schug J, Eisen G, Kissinger J, Roos D, Wirth DF, Widespread distribution of antisense transcripts in the Plasmodium falciparum genome, Mol Biochem Parasitol., Jul;136(1):35-42, 2004.

Gunasekera AM, Patankar S, Schug J, Eisen G, Wirth DF, Drug-induced alterations in gene expression of the asexual blood forms of Plasmodium falciparum, Mol Microbiol., Nov;50(4):1229-39, 2003.

Kissinger JC, Brunk BP, Crabtree J, Fraunholz MJ, Gajria B, Milgram AJ, Pearson DS, Schug J, Bahl A, Diskin SJ, Ginsburg H, Grant GR, Gupta D, Labo P, Li L, Mailman MD, McWeeney SK, Whetzel P, Stoeckert CJ, Roos DS., The Plasmodium genome database., Nature, Oct 3;419(6906):490-2, 2002.

Bahl, A., Brunk, B., Coppel, R., Crabtree, J., Diskin, S., Fraunholz, M., Grant, G., Gupta, D., Huestis, R., Kissinger, J., Labo, P., Li, L., McWeeney, S., Milgram, A., Roos, D.R., Schug, J.,Stoeckert, C.J. Jr., PlasmoDB: The Plasmodium falciparum Genome Resource, Nucleic Acids Research, 30(1), 87-90, 2002.

The Plasmodium Genome Database Collaborative, PlasmoDB: an integrative database of the Plasmodium falciparum genome. Tools for accessing and analyzing finished and unfinished sequence data, Nucleic Acids Research 29(1), 66-69, 2001.


• Schug, J., Diskin, S., Mazzarelli, J., Brunk, B., Stoeckert, C.J. Jr., Predicting Gene Ontology functions from ProDom and CDD protein domains, Genome Research, Apr;12(4):648-55, 2002. [PDF]

Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C.J., Jr., Bucan, M., High-resolution bac-based map of the central portion of mouse chromosome 5, Genome Research, 11(10), 1746-57, 2001.

Davidson, S.B., Crabtree, J.C., Brunk, B. Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J., Jr., K2/Kleisli and GUS: Experiments in integrated access to genomic data sources, IBM Systems Journal 40(2), 512-31 2001. [PDF]

Stoeckert, C., Pizarro, A., Manduchi, E., Gibson, M., Brunk, B., Crabtree, J., Schug, J., Shen-Orr, S., Overton, G.C., A relational schema for both array-based and SAGE gene expression experiments, Bioinformatics 17(4), 300-8, 2001.

G.C. Overton, C. Bailey, J. Crabtree, M. Gibson, S. Fischer and J. Schug, The GAIA Software Framework for Genome Annotation, Pacific Symposium on Biocomputing 3:291-302 1998.

Bailey, L. C., Jr., Fischer, S., Schug, J., Crabtree, J., Gibson M., Overton, G. C., GAIA: framework annotation of genomic sequence, Genome Research, 8(3), 234-250, 1998.

Ancient History

Rehmet, P, Schug, J., Zigman, F., Prywes, N.S., Integration of Software Reengineering with Domain/Application Engineering - Concept of Operation, Demonstration, and Pilot Projects, Proceedings of the Fifth Systems Reengineering Technology Workshop, February 7-9 1995, Monterey CA, Johns Hopkins University APL Research Center Report RSI-95-001.

Rosenbaum, G., Schug, J. Ultra-high precision coupling of angular motion for a constant exit-height monochromator, Review of Scientific Instruments, 60(7), 2130, 1988.

Posters and Presentations

A • marks the posters that I presented.

BCBC Investigator's Retreat 2006 - Learning transcriptional regulators of tissue-specific biological processes, Elisabetta Manduchi, Jonathan Schug, Christian J. Stoeckert Jr.

• CSHL Systems Biology Meeting - Bounded Collection Grammars and the Language of Gene Regulation, Jonathan Schug, Max Mintz & Christian J. Stoeckert, Jr. [PDF]

• BCBC Investigator's Retreat 2005 - Pancreas Pairs Using BCGs [PDF]

MGED8 - An Approach To Identify Regulatory Modules For Tissue-Specific Transcripts Sharing A Tissue-Specific Gene Ontology Biological Process, Elisabetta Manduchi, Jonathan Schug, Christian J. Stoeckert Jr., 2005

• BCBC Investigator's Retreat 2004 - Entropy? [PDF]

MGED7 - An Approach To Identify Regulatory Modules For Tissue-Specific Transcripts Sharing A Tissue-Specific Gene Ontology Biological Process, Elisabetta Manduchi, Jonathan Schug, Christian J. Stoeckert Jr., 2004

ISMB 2004 - Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores, E. Manduchi, J. Schug, C.J. Stoeckert Jr.

• BCBC Investigator's Retreat 2003 - [PDF]

ISMB 2003 - BCGs Jonathan Schug, C.J. Stoeckert Jr.