UPenn. Center for Bioinformatics Computational Biology and Informatics Laboratory


  Upcoming lab. meetings
 

  Past lab. meetings
 
Thursday, May 1, 2008 3:30 PM -- Deborah Pinney, CBIL -- Using the GUS Pipeline
Most data acquisition and analysis tasks can be broken down into individual operations and run in series known as pipelines. In addition, the more efficient, automatic flow of steps increases the efficiency of the entire process measured in total time to completion. We have been using a particularly simple form of pipeline based on a Pipeline API written by Steve Fischer. I will demonstrate the construction of pipelines and discuss variations of the pipelines in use currently.

Thursday, April 24, 2008 3:30 PM -- Elisabetta Manduchi, CBIL -- Meta-analyses of RAD studies
In the past few years we have been accumulating studies in our RAD repository, many of which are related to the EPConDB and BCBC projects. These studies are highly curated in terms of annotation and we have worked on analyzing many of them both at the low level (pre-processing) and at a higher level (differential expression analyses). Our most recent goal is to leverage this work and move from within-study analyses and presentations to across-studies ones. Moreover we also want to expand from analyses where the unit of interest is a gene (or transcript) to those where the unit of interest is a gene set. In this talk, I will describe some of the issues with meta-analyses and our recent approach to provide across-view studies of our data, both at the gene level and at the gene set level. I will also describe how we plan to expand our analyses beyond 2-condition expression comparisons, e.g., to time series analyses and will discuss a couple of algorithms to this end. Finally, if time permits, I will describe possible applications of the Genomica software to create module maps from a collection of studies and gene sets.

Thursday, April 17, 2008 3:30 PM -- Greg Grant, CBIL -- Microarray genomewide association studies: Population genetics meets genomics
The pursuit of the genetic basis for disease has turned heavily to SNP microarray based methods. But the ultimate success of this approach depends on many unknown factors. We will review the evolution of methods, the main contentious issues, and both optimistic and pessimistic perspectives, with a goal of determining the bioinformatics necessities of the field as it moves forward.

Thursday, April 3, 2008 3:30 PM -- Junmin Liu, CBIL -- From J2EE frameworks to CSS framework
I will talk about a couple frameworks I used in the Genomics Beta Cell website and RADQ codebase. AOP (aspecting programming) for security and query histories potentially. JMock and StrutTestCase are used for QA testing on pages and web flows. SiteMesh for jsp screen contral, and Dojo lib for Ajax effects. YAML for creating modern and flexible layouts.

Thursday, March 20, 2008 3:30pm -- Charles Treatman, CBIL, Roos Lab -- OrthoMCL 2
I will be discussing the work Steve, Jerric, and I did to update OrthoMCL for the latest release, as well as work to be done for the future releases.

Thursday, March 13, 2008 3:30pm -- John Iodice, CBIL -- ApiDB at MAM 2008
Along with David Roos, I went to the Molecular Approaches to Malaria conference last month in Lorne, Victoria, Australia. We led a workshop on ApiDB. I'll talk about the questions and reactions we heard, and describe some of the presentations and posters I saw in the course of the three-day event.

Thursday, March 6, 2008 3:30pm -- Kobby Essien, CBIL -- Assessing the contribution of transcriptional regulation and coding sequence changes to phenotypic differences between malaria parasites
Malaria parasites share many conserved features but also have some striking differences including host cell preferences and cell cycle lengths. I will describe efforts to understand how differences between Plasmodium species may be influenced by transcriptional regulation and evolution of coding sequences.

Thursday, February 28, 2008 3:30pm -- Jonathan Schug, CBIL -- Chromatin Modifications Part II
Since last summer/fall I have been involved with 3 more projects relating to chromatin structure. In this lab meeting I'll review the software and db infrastructure I've developed for these projects, the projects themselves, as well as the main outlines of the grant proposal.

Thursday, December 20, 2007 3:30pm -- Greg Grant, CBIL -- Effective use of BLAST
BLAST is a widely used algorithm in bioinformatics, but there is considerable confusion over the meanings and proper use of the various parameters and statistical measures involved. We will review BLAST, BLAST statistics and BLAST output - with emphasis on the difference between NCBI and WU BLAST, and finally a look at how GUS handles BLAST results and how we might modify GUS to store BLAST results more effectively.

Thursday, December 13, 2007 3:30pm -- Elisabetta Manduchi, CBIL -- Computing with assay annotations
In this presentation I will return to a topic that I had discussed in a lab meeting over a year ago. Namely, leveraging the recent developments in terms of biological investigation ontologies, to compute based on an input consisting of annotations reflecting the biological intent/context of an experiment, as opposed to numerical values resulting from the experiment. The focus is on microarray experiments, since these are typically those for which curated annotation is more easily available in public databases. A microarray experiment is a collection of related assays (one or two-channel hybridizations). Last year I presented a preliminary study where computations were done at the "experiment level", i.e. where the input was a collection of experiments. Moreover I focused on experiments whose annotation we directly controlled (RAD). In this presentation I will instead work at the finer level of assays, i.e. where the input is a collection of assays (typically from a variety of experiments), and with data coming from public repositories (ArrayExpress and GEO). I will discuss possible measures of dissimilarity between assays (based on their annotation) and clustering based on these measures. I will illustrate what the theoretical and practical issues are by working with a specific use case. This is work in progress and any feedback will be appreciated.

Thursday, November 29, 2007 3:30pm -- Jerric Gao, CBIL, Roos Lab -- Status report of OrthoMCL DB2.0
Description: Comparing to the current OrthoMCL DB 1.0 (55 organisms, 627,098 proteins, 70,388 ortholog groups), the new GUS based OrthoMCL DB 2.0 includes 88 organisms, 786,818 proteins, 79,695 ortholog groups, and 338,697,250 blast similarity hits). Such huge volume of the information posed a challenge for processing, storing and querying the data efficiently. In this lab meeting, I will review the current progress of OrthoMCL DB 2.0, and then focus on the remaining work to bring up the new 2.0 website.

Thursday, November 1, 2007 3:30pm -- Debbie Pinney, CBIL -- Mouse Release 11 and Human Release 10 DoTS - progress report
I will report on the current state of the mouse and human DoTS builds. The mouse DTs have been completed and mapped to Entrez Gene Ids and to MGI Ids. Human ESTs and mRNAs have been clustered and the DTs are being assembled. I will review the procedures followed and will report on the final mouse DoTS statistics and future efforts to generate genes.

Thursday, October 25, 2007 3:30pm -- Shailesh Date, CBIL -- Reverse engineering Plasmodium falciparum
Over the past 3 years or so, we have been actively investigating the proteome and interactome of the malarial parasite Plasmodium falciparum, using techniques of network and cluster analysis. I will recap what we have accomplished with interactome modeling, and discuss the latest results coming out of the gene/protein family analyses. I will also describe how we are putting together experimental assays to measure the accuracy of our predictions, and how all the data obtained using different techniques could be used to generate information about pathways and systems in the parasite.

Thursday, October 18, 2007 3:30pm -- John Brestelli, CBIL -- Representing Multiple Sequence Alignments as Dots Assemblies
Mercator uses information from multiple genomes to identify an orthology map. Each orthologous segment is subsequently aligned using MAVID. I will discuss our tentative plan for storing these alignments in GUS.

Thursday, October 11, 2007 3:30pm -- Kobby Essien, CBIL -- Comparative genomics approaches for understanding differences between malaria parasites
Malaria parasites share many conserved features but also have some striking differences including host cell preferences and cell cycle lengths. I will describe my efforts to link differences in phenotype to transcriptional and coding sequence differences in these parasites.

Thursday, October 4, 2007 3:30pm -- John Iodice, CBIL -- Database Query Performance Tuning
To maintain acceptable web page load times, it is sometimes necessary to reduce the running time of database queries. Techniques for this include creating indexes, restructuring queries, building tables or materialized views that contain pre-computed partial solutions, and adding hints to queries. These efforts can be guided by the examination of query plans. We will discuss the application of these techniques in current ApiDB development.

Thursday, September 27, 2007 3:30pm -- Brian Brunk, CBIL -- Integrating and presenting SNP data for ApiDB
PlasmoDB and ToxoDB now have a wealth of SNP data from a variety of resequencing efforts. How we are currently dealing with this data and providing it to users will be presented. Ideas for future development will also be discussed.

Thursday, September 13, 2007 3:30pm -- Jonathan Schug, CBIL -- Chromatin Modifications and Yeast Sporulation
S. cerevisae forms spores when facing starvation and other conditions. Spores are gametes and aspects of gameteogenesis are well-conserved from yeast through flies to mammals. One common feature of gametogenesis are the chromatin modifications which control the compaction of the genome as well as mark expression levels of genes. I'll present the data analysis I've been performing on ChIP-chip data from Shelley Berger's lab (especially Thanuja Krishnamoorthy and Jerome Govin) that measures the amount of acetylation and phosphorylation of histone H4 during sporulation.

Thursday, August 30, 2007 3:30pm -- Steve Fischer, CBIL -- Workflow requirements
I will present a minimal set of requirements that an improved pipeline system would include. The requirements do not assume any underlying implementation. I will also present a sketch of a home brew implementation to use a basis of comparison against third party implementations.

Thursday, August 16, 2007 3:30pm -- Jennifer Dommer, CBIL -- Trichomonas vaginalis: An Overview
The ApiDB team was recently asked to create web portals for two new genomes, G. lamblia and T. vaginalis. I will be providing an introduction to T. vaginalis by presenting some background on the organism, and an overview of the genome data that will be available for the September 1 release.

Thursday, August 9, 2007 3:30pm -- Junmin Liu, CBIL -- The new RADQuerier codebase using J2EE frameworks
Besides using some parts of WDK, the new RADQ impose rigid architecture design and clear-cut layers with the help of some J2EE frameworks. In addition a number of other technologies including IBatis for O/R mapping, AOP (aspecting programming) for security and query histories, JUnit, JMock, and StrutTestCase are used for QA and interface design. Also used are XFire and JSR181 for web services, DisplayTag and SiteMesh for jsp screen and web layouts, and Dojo lib for Ajax control. A walk through of these and the spring framework's IoC container that wires all the pieces together will be presented.

Thursday, May 24, 2007 3:30pm -- Joan Mazzarelli, CBIL -- EPConDB...past, present and future
EPConDB, a web resource of the Beta Cell Biology Consortium, provides an integrated view of pancreas- and islet-related gene annotation, expression and regulation data. We are now in process of a new EPConDB web site design. During this presentation, I will discuss the functionalites of the web site, data and annotations used by the site, EPConDB and its implementation in the new WDK and future site development plans.

Thursday, May 17, 2007 3:30pm -- Kobby Essien, CBIL -- Examining sequence-based features of binding sites
Properties such as conservation and clustering upstream of target genes are exhibited by transcription factor binding sites. My hope is to exploit combinations of these features to predict binding sites in Plasmodium. I will describe preliminary work looking at known binding sites in the context of such features in yeast and present some thoughts on a binding site prediction strategy that may be useful in Plasmodium.

Thursday, May 10, 2007 3:30pm -- Shailesh Date, CBIL -- We got the network! What's next?
Over the past 3 years, we have tackled a number of different projects that dealt with reconstruction of protein-protein interaction networks for different organisms, as well as development of new tools and techniques for comparative genomics and sequence analyses. In this next phase, we plan to use these tools and the generated data to answer important questions pertaining to the biology of parasites and higher eukaryotes. I will briefly describe our specific aims for future planned projects, and present a summary of completed tasks.

Thursday, May 3, 2007 3:30pm -- Gregory Grant, CBIL -- Hemodynamics and Atherosclerosis
Blood flow in arteries is characterized by regions of smooth laminar flow and regions of distrubed turbulent flow. It has been observed that the endothelial cells most susceptibile to atherosclerosis are those in the distrubed flow regions. To date however, the underlying genetic basis for this difference is poorly understood. We have performed a series of microarray studies on porcine endothelial cells of populations with well controlled diets, towards an understanding of the cellular mechanisms underlying this differential susceptibility. Some of the results of these studies will be described, with some attention to the computational issues.

Thursday, April 26, 2007 3:30pm -- Frank Innamorato, CBIL -- Code-based SQL Performance Enhancements
Embedded SQL statements form the interface we use to create and retrieve our relational data. The lab implements well-structured and efficient SQL in plugins and applications, however there are some structural changes that could improve the overall performance. These improvements include reducing hash sizes, preparing statements, using cursors, batch commits, use of single and nested materialized views, altering join conditions, minimizing selected columns and column sizes, using temporary tables, index data retrieval, and index creation strategies.

Thursday, April 19, 2007 3:30pm -- Elisabetta Manduchi, CBIL -- Analyzing ChIP-chip data
The use of ChIP-chip experiments to identify transcription factor binding sites is becoming increasingly widespread in general and in particular in the labs of our collaborators. Albeit the platform types (microarrays) utilized are the same as for gene expression experiments, there are differences in the biological assumptions that can be made and that are at the base of low and high level analysis methods. We have just started exploring ChIP-chip data and investigating some of the algorithms proposed for their analysis. In this presentation I will illustrate some of these approaches with a particular focus on individual-reporter based methods and with an application to FoxA2 adult (mouse) liver ChIP-chip data generated by the Kaestner lab.

Thursday, April 12, 2007 3:30pm -- Jerric Gao, CBIL -- New Features in WDK
There are several new features in WDK, such as sorting, column customization, dataset/bulk-upload, reporter, new UI controls; these features will be added into the upcoming release of PlasmoDB/ToxoDB. In this presentation, I will talk about the designs and implementations of these new WDK features, along with short tutorials on how to use them. We will also discuss more (possible) enhancements for WDK, such as query filter, cross-site login.

Thursday, April 5, 2007 3:30pm -- Debbie Pinney, CBIL -- Comparative Proteome Analysis of the Apicomplexa
The aims of this project include annotation of proteins whose functions are unknown, gaining insight into the evolution of the Apicomplexans, and the discovery of novel biological processes. Clustering based on sequence similarity was performed within single genomes and across multiple genomes. Single genome analyses indicate how the genomes differ in terms of gene duplication and can provide evidence of a hypothetical protein's membership in a gene family. Analysis across multiple genomes may allow the inference of a protein's functional identity via members of the same cluster with known function or via domains characterized through multiple-alignment of the clusters constituent sequences. In addition, multiple genome clustering is a rich source of information concerning the conservation of proteins and the phylogenetic relationship among members of the Apicomplexa. The initial work has concentrated on the Plasmodium species falciparum, vivax, yoelii, berghei, and chabaudi but will be expanded to include all available Apicomplexa genomes in the future.

Thursday, March 29, 2007 3:30pm -- John Iodice, CBIL -- Big Fat Materialized Views: Database objects to support new WDK functionality
Until now, WDK result sets were always retrieved in the order determined by the underlying query. New WDK code, which will be included in the upcoming ApiDB releases, lets the user change the ordering and columns of a result set dynamically. To make this possible, we have replaced the multiple queries that define the attributes of each WDK record (gene, sequence, SNP, et c.) with a single materialized view. We'll discuss the requirements for these new database objects, the WDK model changes involved, and performance issues we've faced, and we'll see the new web site functionality they enable.

Thursday, March 22, 2007 3:30pm -- John Brestelli, CBIL -- Extending MR_Ti: Array Data Analysis Tools and a User Friendly Interface
MR_Ti comprises a set of tools which allow for the reading and processing Meta Data contained in Mage Documents. I will discuss the addition of Array Data nalysis tools into this package. Because these tools are aimed at Biologists, I will discuss options for presenting these tools and ways to simplify the generation of MAGE-TAB.

Thursday, March 15, 2007 3:30pm -- Junmin Liu, CBIL -- MR_Ti: MAGE RAD Translator's Importer Project, part 2
Documents in the MAGE standard format contain information about microarray experiments such as those performed to generate ChIP-chip or gene expression data. MR_Ti is a framework we developed to handle the varieties of MAGE documents in different formats including MAGE-ML, MAGE-Tab, and related non-MAGE documents from GEO, SOFT and MINiML. Based on this framework we wrote several tools: the loadMageDoc GUS plugin, mage2tab (mage docs to mage-tab converter), mage2graph (mage docs visualization tool), mage-checker (mage docs validation tool). I will discuss those tools, the issues we faced during the development, and plans for new tools as well.

Thursday, March 8, 2007 3:30pm -- Trish Whetzel, CBIL -- Update on the Ontology for Biomedical Investigations (OBI) project
The Ontology for Biomedical Investigations (OBI) project, formerly named FuGO, is being developed to support the annotation of biological and biomedical investigations. OBI is being developed as a candidate OBO Foundry ontology and therefore is designed to be interoperable with other OBO Foundry ontologies, e.g. PATO, SO, GO. This presentation will highlight development efforts of the OBI project to date, including the use of a formal upper level ontology, the implementation of metadata and the overall development process and timeline.

Thursday, March 1, 2007 3:30pm -- Gary Chen, CBIL -- Modelling global gene expression and beyond
Modeling global gene expression has emerged in recent years with the rich omic data generated from high-throughput technologies such as expression, chip binding and SNPs data. These omic data reveal the working mechanisms within the cell from different perspectives and light the road to understand living systems as a whole, but also raise the challenge to model and interpret them. In general, biological features can be characterized as known or unknown statistical variables, and the relations between them can be inferred based on assumed models. I will first use transcriptional regulation case to illustrate the Variable Selection and Bayesian Regression techniques involved in, and then go beyond to show that these statistical methods also hold the promise for genome-wide association studies (SNP-Single nucleotide polymorphism, QTL-Quantitative trait locus). Related developments and applications with our COGRIM will be illustrated.

Thursday, February 22, 2007 3:30pm -- Brian Brunk, CBIL -- Toward a vision for ApiDB
I'll describe my evolving vision for ApiDB and how I envision ApiDB developing as we move forward. I'm hoping for a meeting where we can all participate in helping to crystalize thinking in this area.

Thursday, February 15, 2007 3:30pm -- Jonathan Schug, CBIL -- The ENCODE Data Coordination Center RFA
The Encyclopedia of DNA Elements (ENCODE) is an NIH-funded (NHGRI) project to identify functional sequence elements in the human genome. (There is now a parallel effort in model organisms.) The ENCODE project is nearing the end of its pilot phase which focused on technology development a feasibility by considering only selected regions that cover 1% of the human genome. Two new requests for applications (RFAs) were released in November 2006. The first, RFA-HG-07-030, is for further experimental work, but now aimed at covering the full genome as well as pilot studies on the original 1% regions. The second, RFA-HG-07-031, is for a data coordination center (DCC) that will track, display, and make available, all of the data from the ENCODE project. We are preparing an application for the DCC RFA (due March 29.) I will describe the RFA requirements in more detail as well as the system we are proposing to build and hope for feedback from the lab.

Thursday, February 8, 2007 3:30pm -- Praveen Chakravarthula, CBIL -- Migrating OrthoMCL to GUS
I will be talking about my recent project, migrating OrthoMCL from a stand-alone research project to a production system. Currently, OrthoMCL exists as a mysql database and a collection of Perl scripts. We're collaborating with the Roos Lab to move OrthoMCL-DB to a more stable and maintanable GUS environment.

Thursday, January 25, 2007 3:30pm -- Steve Fischer, CBIL -- A Report on the GMOD UI/Middleware Conference
The GMOD (Generic Model Organism toolkit) group held a conference on UI and middleware issues that genomics databases confront. I will summarize the presentations from the conference, focusing mostly on the UI issues (as they are more relevent to CBIL/GUS/ApiDB)

Thursday, January 18, 2007 3:30pm -- Jennifer Dommer, CBIL -- Mapping Epitopes to Proteins: From GenBank to ApiDB
The Immune Epitope Database and Analysis Resource (IEDB) provides epitope data we would like to use for our ApiDB related projects. However, IEDB maps epitopes to GenBank sequences using the GenBank accession numbers, which PlasmoDB and ToxoDB do not store. Additionally, the epitopes from IEDB were frequently mapped to older gene models than those contained in our databases, as many epitopes were discovered before the genome annotation stabilized. The problem we now face is how to map the epitopes from IEDB onto the proteins in PlasmoDB and ToxoDB. I will present the options we have explored to this point, and the direction we think we may take in the future.

Thursday, December 21, 2006 3:30pm -- Junmin Liu, CBIL -- MR_Ti: MAGE Importer project
MR_Ti(mporter) is a plugin to load different MAGE documents into the GUS database, or translate them into other formats. I will talk about the current status of the project, the configuration xml file and how to use it.

Thursday, December 14, 2006 3:30pm -- Dave Barkan, UCSF (Sali lab) -- Predicting host-pathogen interactions by sequence similarity
I attempt to expand the knowledge of the Human-P. falciparum protein interactome by predicting interactions based on similarity of each interacting partner to proteins interacting in other species. Experimentally validated protein interactions were taken from EBI's IntAct database; proteins involved in these interactions were searched for in sequence profiles built against human and P. falciparum targets. Matches on both sides meeting an appropriate sequence identity threshold were filtered through different biological contexts to generate a final set of predictions. My highest-confidence filter resulted in 26 predictions which are targets for experimental validation.

Thursday, November 30, 2006 3:30pm -- Greg Grant, CBIL -- Algorithms for array CGH data analysis
Array CGH technology has allowed for a fine resolution mapping of the genomic aberrations that are characteristic of tumor cells. These aberrations can be driving factors in tumor genesis and progression, and as a result profiling them with array CGH has become very popular. We have been developing algorithms for the analysis of CGH data, from pre-processing to higher level analysis - with particular emphasis on the search for concordant aberratioons across multiple samples. This lab meeting will update the status of these projects, and illustrate with examples, and will also look at some other questions being asked of aCGH data that involve analytical challenges.

Thursday, November 16, 2006 3:30pm -- John Brestelli, CBIL -- Test Driven Development with emphasis on GUS Plugins
Unit Testing is a cornerstone of Extreme Programming. XP claims that unit testing causes fewer bugs, creates more maintainable code, and allows for continuous integration. I will provide and introduction to Unit Testing with examples from perl code and describe the concepts of "Test Driven Development" and "Test Driven Maintenance." I will focus on the perl package "PerlUnit" which is based on JUnit for java. I will then describe how to integrate automated tests into our GUS Plugins.

Thursday, November 9, 2006 3:30pm -- Shailesh Date, CBIL -- The next steps: Towards using sequence analysis and comparative genomics to study conservation of systems and host-pathogen interactions
Over the past two years, we have used a number of computational methods to investigate the interactome of the malarial parasite Plasmodium falciparum. We are now ready to expand our analyses to include more apicomplexan parasites, and use comparative genomics and sequence analysis to investigate issues such as conservation of systems and host-pathogen interactions. I will present our plans to tackle these topics, and briefly describe the results of our motif discovery method. During the course of our investigations, we have also come to better understand the limitations of some of our computational methods. I will describe how we are trying to improve our in silico analysis techniques to make them more universally applicable.

Thursday, November 2, 2006 3:30pm -- Gary Chen, CBIL -- Recent developments in COGRIM
There has been substantial recent research into the integration of biological data sources for the discovery of regulatory networks. Different approaches taken have included heuristic algorithms, linear models and probabilistic models. I will go over some representative ones and discuss the pros and cons. We proposed several comparison strategies which may help interpret the effectiveness of different approaches and highlight the strengths of our COGRIM. Recent studies on PDX1 & NeuroD1 grammar will be used to demonstrate the model control and monitoring schema in COGRIM. This work includes contributions by Jonathan Schug.

Thursday, October 26, 2006 3:30pm -- Kobby Essien, CBIL -- Integrative Approaches for Mapping Regulatory Networks in Plasmodium falciparum
Little is known about gene regulation in Plasmodium falciparum. Few transcription factors are known, approximately 5 binding sites have been experimentally validated and only 1 gene regulatory network has been identified. will describe some of the binding sites I have associated to various putative transcription factors and talk about approaches I am considering to rigorously integrate diverse data types into a P. falciparum regulatory network.

Thursday, October 19, 2006 3:30pm -- John Iodice, CBIL -- A Text-Search Feature for the ApiDB Sites: a WDK Web-Service Case Study
The impending release 5.2 of PlasmoDB will include the ability to find genes by using a regular expression to search the text associated with them. This is implemented with the WDK Web-Service Framework, and is a new implementation for the WSF. I will talk about the steps needed to add a web service to a WDK site, and look at the functionality and the code of the text search. If time permits, I will also present plans for TheileriaDB, a genome resource for the apicomplexan parasite Theileria parve.

Thursday, October 12, 2006 3:30pm -- Jonathan Schug, CBIL -- The further adventures of the (NP)^2 regulon
Three genes, insulin, islet amyloid polypeptide, and glucokinase, are regulated in the adult beta-cell by the transcription factors NeurodD1, Pdx1, Nkx2.2 and Pax6. We ask whether any other genes are regulated by this combination of TFs and what limitations there may be on the arrangement of binding sites for a functional CRM based on these TFs. In this presentation we focus on the preliminary evaluation of putative regulons using COGRIM, annotation enrichment analysis, and mRNA expression clustering. This work includes contributions by Gary Chen and Elisabetta Manduchi.

Thursday, September 28, 2006 3:30pm -- Debbie Pinney, CBIL -- Mouse DoTS Release 11 - Progrss report
DoTS release 11 is underway and like release 9 and 10, clustering and assembly of the consensus transcripts will be based on alignment of EST and mRNA sequences to the genome. However, we have used the latest mouse genome release and there will be several fundamental changes to the procedure. Rather than our usual incremental update to the assemblies, we have created a new instance of GUS (musbld) and are building the assemblies from the ground up. Genbank and Refseq mRNA and dbEST ESTs have been clustered based on BLAT alignment to the essentially complete NCBI Release 36 (UCSC mm8) mouse (Mus musculus) reference genome derived from C57BL/6J mice. Clustered mRNA and ESTs will be assembled to produce consensus transcripts using CAP4 assembly software allowing individual transcripts to be assembled into more than one DT. The final DoTS consensus transcripts, many of which represent alternative splice variants, will be extensively annotated and will be further clustered via genome alignment to create DoTS genes. Following the mouse build, human DoTS build 10 will be undertaken following the same procedure using updated transcript sequences and the March 2006 human reference genome (UCSC hg18 : NCBI Build 36.1).

Thursday, September 21, 2006 3:30pm -- Jerric Gao, CBIL -- WDK Revisited - A technical review of current and future WDK
WDK is designed to accelerate the creation of data mining websites. It has been used in PlasmoDB, ToxoDB, CryptoDB, and has been evolving along with those websites. In this presentation, I will provide a technical overview on the architecture of WDK, and review the features and limitations of the current version. Then I will focus on some of the new features in WDK: the user login module, the permanent query history module, the customization/extension capability, and the interaction with WSF (the Web Service Framework), the column configuration and sorting features in the summary page, etc. At last I will review the installation process of WDK related projects.

Thursday, August 31, 2006 3:30pm -- Trish Whetzel, CBIL -- Pronto - a system for the automated submission of ontology terms
Pronto is a system for automated ontology term proposal. Pronto consists of a web service, a set of web forms and a database backend. The system allows the proposal of terms via the web service or via the forms and the curation is performed using the web forms. The motivation for building this system is the ability for users of annotation applications to automatically submit user defined terms for curation and potential inclusion in the ontology. This system has been initially developed for use with the MGED Ontology and the annotation of microarray data.

Thursday, August 24, 2006 3:30pm -- Ghislain Bidaut, CBIL -- Characterization of stem cells/progenitor cells in Human Prostate and Mouse Stomach Epithelium
Adult tissues are permanently regenerated by stem cells; however, precise locations as well as differentiation and self-renewal capabilities of stem cells/progenitors cells are still poorly understood in most organs. On the other hand, stem cell hierarchies have been clearly established in other systems (such as in Hematopoietic stem cells) and can help us characterize key progenitor population in other tissues. To this end, we integrated various microarray datasets generated by the SCGAP Consortium in several organs in mouse and human. Using a controlled vocabulary taking into account differentiation and self-renewal capabilities of the different stem cells/progenitors cells measured, we built an integrated dataset of gene expression in several tissues in human and mouse. To capture genes having common expression trend among tissues, expression profiles were projected on model vectors. After filtering, projected data was learned by a single layer architecture artificial neural network, yielding lists of markers specifics to stem cell differentiation stages. As a result, the system characterized human prostate progenitors and mouse stomach epithelium progenitors. Algorithm and result interpretation will be presented during the meeting.

Thursday, August 17, 2006 3:30pm -- Elisabetta Manduchi, CBIL -- Computing with MGED Ontology experiment annotations
The MGED Ontology (MO) and MAGE model provide the means to store in a structured fashion all the MIAME recommended annotation for microarray experiments. We have been employing these in our RNA Abundance Database (RAD; www.cbil.upenn.edu/RAD) to provide highly curated experiments and to allow for ontology-driven queries of the data. In our curated gene expression database, each experiment has a collection of MO terms associated with it describing the experimental design and factors (the experiment _intent_). Moreover, each assay in an experiment has a collection of MO terms associated with it to describe the biomaterials utilized and the types of treatments that these were subjected to (the experiment _context_). There is very valuable information in this kind of annotation that can provide the raw material for meta-analyses aimed at facilitating an investigator's task of selecting experiments or individual assays in a database that are closely related to his/her work and possibly at guiding some of the choices made when processing or analyzing these data. We have begun investigating the former. Twenty four published experiments made available at the EPConDB web site (www.cbil.upenn.ed/EPConDB/studyQuery.php) were manually classified into 5 groups: pancreas development and growth; targets and roles of transcriptional regulators; differentiation of insulin producing cells; islet/beta cell stimulation/ injury; tissue expression, surveys and comparisons. Jaccard and Kulczynki distance measures between pairs of experiments were defined based on their annotation for design types, factor types, taxonomy, biomaterial characteristics, and treatment types. These distances were then used to define within-group and between-groups similarity measures. Preliminary results demonstrate that the within-group similarities for the manually-defined groups were generally higher than between-group similarities and that most experiments could be correctly classified (based solely on their annotation) in a leave-one-out test using a classifier in the spirit of nearest centroids. We will present these and additional analyses examining how choice of annotations can improve classification and how the annotations can suggest alternative classifications.

Thursday, July 13, 2006 3:30pm -- Jonathan Schug, CBIL -- Identifying the NeuroD1/Pdx1 Regulon
A significant focus of the EPConDB/BCBC project is to understand gene regulation not just in mature islets but in all stages of pancreatic development. Many of the important genes in the adult islet, e.g., insulin, have been well studied with many binding sites identified by 'promoter bashing' experiments. This work reveals that the regulation of these genes is very complex. Fortunately, the BCBC members are committed to performing a number of mRNA expression level and ChIP-chip TF-binding experiments which will be placed in EPConDB. The challenge for the EPConDB team is to use a variety of data resources to dissect a rich regulatory environment and make the results useful to the BCBC members. I'll present my work towards a prototype of a process to collect, correct, and apply data to identify the NeuroD1/PDX1 regulon.

Thursday, July 6, 2006 3:30pm -- Steve Fischer, CBIL -- Review of Genome Alignment Algorithms
I will discuss what I have informally learned recently about genome alignment and three particular approaches: MUMmer, BLASTZ and Mercator. Genome alignment is being used by the ApiDB project to align intra-genus species and also different strains of a given species.

Thursday, June 8, 2006 3:30pm -- John Iodice, CBIL -- Using GUS with PostgreSQL
The Genomics Unified Schema can use either Oracle or PostgreSQL as its database back-end. Many of the database projects and web sites at CBIL are based on the GUS, but none of them use PostgreSQL. To gain in-house experience in the GUS/PostgreSQL combination, we plan to use it to develop a small web site. We will discuss the development of PostgreSQL and its use (elsewhere) with GUS. We'll also see overviews of two possible applications: a genome database for the apicomplexan bovine parasite Theileria parva and a replacement for the ApiDots site.

Thursday, June 15, 2006 3:30pm -- Bindu Gajria, CBIL/ Roos Lab -- ToxoDB, a new ball game
ToxoDB, a database resource for Toxoplasma gondii, began as a simple website that I built, before taking on the PlasmoDB responsibilities (in 2002). It went through some changes and updates, mainly by Martin, to keep it on par, to some extent, with features that the PlasmoDB site was providing. Lately ToxoDB project, as a part of ApiDB, has undergone a complete transformation. The Toxoplasma data is now in GUS, and the new site makes use of the new WDK. The upcoming release of this site, in many ways, represents success in using the ApiCommon repository and setup, so as to create relatively quickly another sister site. I aim to highlight the key steps taken on this journey.

Thursday, June 1, 2006 3:30pm -- Greg Grant, CBIL -- Measuring and Analyzing Genomic Aberrations: High-Resolution Methods
Genomic aberrations are characteristic of cancer genomes, and some congenital diseases. The application of microarray technology has given the investigator very powerful tools over the last few years to measure these aberrations at high resolution. We will first survey the landscape of methods available for measuring gain, loss, and LOH events (we will look at six different platforms). Then we will describe our statistical focus, which is to identify aberrations that are concordant across a class of samples, and therefore might be involved in the carcinogenesis and progression of that tumor type (for example MYCN amplification indicates poor outcome in Neuroblastoma). To this end, we will describe our multiple-sample analysis package (MSA). MSA performs class analysis looking for statistically significant regions of concordant aberration across a class of samples. MSA incorporates pre-processing and higher level statistics to achieve rigorous results and clear visual depictions of the concordant aberrations.

Thursday, May 25, 2006 3:30pm -- Shailesh Date, CBIL -- Trying old tricks on new genomes
For some time now, we have been developing and extending various in silico approaches to study the genome of Plasmodium falciparum. I will be talking about the application of these approaches to genomes of other apicomplexans besides P. falciparum, such as Plasmodium vivax and Toxoplasma gondii, and higher order eukaryotes, such as Homo sapiens. Based on suggestions and requests from biologists working on the T. gondii genome, I have developed some new web-based tools that allow researchers to interact with the results of these experiments, which often take the form of complex data sets. I will also be talking about the limitations of these approaches when applied to large proteomes and proteins with complex domain organizations. Besides, I will briefly describe at least two other interesting projects that are currently underway in the lab, dealing with data representation and data mining.

Thursday, May 11, 2006 3:30pm -- Joan Mazzarelli, CBIL -- EPConDB, a web resource for gene expression and regulation related to beta cell development and function
EPConDB, a resource of the Beta Cell Biology Consortium provides information on genes expressed in cells of the pancreas. EPConDB provides access to over twenty-five datasets from pancreatic expression studies. The datasets were derived using different microarray platforms, including the human and mouse PancChips, described on the EPConDB website, and affymetrix gene chips. Additional datasets include RT-PCR and MPSS expression studies. For some studies, lists of differentially expressed genes, derived from the analysis of the microarray data sets, are displayed on the site. These gene lists can be queried for genes that have either up-regulated or down-regulated expression using a fold change. EPConDB can also be used to examine individual genes and their expression profiles, obtained from available expression studies. A mouse PromoterChip, described on the EPConDB website, is available for use in chromatin IP studies. EPConDB is linked to T1DBase.org providing tools, including Gbrowse, displaying the mouse promoter chip elements for each gene and predicted transcription factor binding sites. EPConDB RNA pages are also linked to Betacell.org antibody pages, describing these available protein reagents. More recently, a pancreatic transcription factor binding site query has been added to EPConDB, providing lists of predicted transcription factor target genes. These genes can be queried, based upon the available expression studies, for their expression in pancreatic tissues.

Thursday, May 4, 2006 3:30pm -- Kobby Essien, CBIL -- Searching for known core cis-elements in Plasmodium vivax
Despite their differences from other eukaryotes, domains associated with certain core transcription factors have been identified in Plasmodium species. I will describe some of my efforts to identify putative binding sites for these factors in P. vivax and some interesting cytosine-rich regions located in the process.

Thursday, April 20, 2006 3:30pm -- John Brestelli, CBIL -- Data integration in RAD
The RAD Schema was designed to capture MIAME compliant microarray expression data. I will discuss how this is stored. Also, I will describe other types of data which we are putting there and how these are being made to fit into RAD. I will briefly mention where we are getting this data and also mention some uses on EPConDB.

Thursday, April 13, 2006 3:30pm -- Debbie Pinney, CBIL -- DoTS release 11 and the future of Allgenes.org
Mouse and human DoTS was a pioneering index of transcripts that continues to be useful for several projects being conducted here at CBIL and in collaboration with other groups. DoTS consensus transcripts (DTs) have been built incrementally, primarily by the clustering and assembly of new input ESTs and mRNA with existing DTs based on their BLAST similarity. Starting with release9, new input sequences were assembled with existing DTs based on BLAT alignment to the genome. Transition to genome alignment based clustering coincided with condensation of the set of consensus transcripts which is an apparent improvement. However, several problems remain including unresolved chimeric transcripts, undistinguished orthologs, and inclusion of obsolete and inappropriate input sequences. The recent pre-release of essentially finished mouse genome sequence and the release of FANTOM3 transcript sequences have triggered a new mouse DoTS build. I will discuss some of the alternative approaches to update the mouse consensus transcripts and to improve their quality. The most conservative approach considered is to cluster and assemble all current Genbank, dbEST, and Refseq input sequences based on BLAT alignment to the genome. The resulting DTs would be mapped to release10 DTs to preserve ids. If the conservative approach is taken, there are several details involving this cluster and assembly process that should be discussed. Alternatively, DTs could be built using published clustering and assembly algorithms developed elsewhere and some of these approaches will be presented. Finally, we could obtain consensus transcripts from external sources, such as TIGR or ECgene, and add our own annotation. My purpose in giving this lab meeting is to decide on the general approach and to discuss the most important details of that approach as the time allows.

Thursday, April 6, 2006 3:30pm -- Elisabetta Manduchi, CBIL -- Working with scoring data for gene-condition pairs
Since the original Gene Expression Atlas 1, several other (human and mouse) gene expression tissue surveys have become publicly available. Each such survey can be used to generate gene-tissue specificity scores (e.g. Schug's Q-scores). This presents us with yet another rich resource to mine. I have been looking at two kinds of questions: (1) how to use these surveys to generate tissue-specificity profiles for gene sets, e.g. defined by GO Biological Processes (GO BPs) and to identify tissue-specific GO BPs; (2) how to exploit the availability of multiple surveys, e.g. to assess robustness of the scores used, or to guide in the choice of appropriate cutoffs, etc. I will illustrate my most recent thoughts on the above. For (1) I have been revising my original approach; there are various subtle issues. For (2), I have just started and currently have only some "embryonic" ideas. I intend to use this lab meeting as a bouncing board and will appreciate feedback. Note that, albeit I'll focus on tissue specificity and GO BPs, the methods are applicable more generally to any gene-condition scoring dataset (where scores are "absolute"). (If (1) takes longer than expected and generates enough discussion, I'll postpone (2) to some other time.)

Thursday, March 30, 2006 3:30pm -- Jonathan Schug, Gary Chen, CBIL -- Highlights from the 2006 Cold Spring Harbor Systems Biology meeting
Jonathan and Gary will present highlights from the recent CSHL Systems Biology meeting. There is a Wiki page, CshlSystemsBiology, that contains our notes from the conference. Read it first and bring questions if you want.

Thursday, March 16, 2006 3:30pm -- Trish WHetzel, CBIL -- Building a Functional Genomics Investigation Ontology - A Lesson in Herding Cats, Chapter 1
The development of the Functional Genomics Investigation Ontology is a collaborative, international effort which will provide a resource for annotating functional genomics investigations including the study design, protocols and instrumentation used, the data generated and the types of analyses performed on the data. This talk will present the results of from the first FuGO workshop and the policies that have been developed.

Thursday, February 23, 2006 3:30pm -- Ghislain Bidaut, CBIL -- Transcriptional Comparison of Stem Cell Tissues
I will present the last developments on the stem cell tissues comparison. We integrated the data generated by SCGAP members and are now researching for gene expression signatures, i.e. groups if genes that are actively linked to Hematopoiesis and stem cells differentiation in several organs. The data is projected in a space defined by model vectors that correspond to over-expression in a given stem cell differentiation state (Totipotency, Multipotency, Progenitors, Lineage-Committed Progenitors, or Differentiated cells), allowing us to cluster genes expressed in tree or more tissues, including Hematopoietic Stem Sells, Liver and Bone progenitors. Details of the methodology and clusters found will be presented.

Thursday, February 16, 2006 3:30pm -- Junmin Liu, CBIL -- FuGE, MAGEv2 and the ontology: conceptual data modeling and ontology engineering
I will briefly introduce the FuGE model, how the MAGE v2 extend it, the relationship between MAGE v1 and MGED ontology. I will also talk about the similarities and differences between data modeling and ontology engineering. I will also talk about my experiences with reusing external ontology in MGED ontology by namespace importing and versioning control for ontology development.

Thursday, February 9, 2006 3:30pm -- Aaron Mackey, Roos lab/ CBIL -- CDAT: character data and trees
CDAT is a Tangram-based, user-customizable relational schema and Perl software library for the storage and analysis of phylogenetic datasets (Character Data And Trees) maintained within the context of multiple annotated genomes. With such large scale phylogenomic information stored in CDAT, users can (programmatically) construct evolutionary metaanalyses of intron gain/loss events, horizontal gene transfer, positive selection, protein domain evolution, gene fusions, etc. I will spend some time describing the CDAT architecture, and how it provides a very simple OO-Perl interface and query language, while also allowing sophisticated "raw" SQL-based data access. This work will hopefully provide some fresh thinking about phylogenetic schema alternatives as well as user-oriented object-modeling, either of which might motivate future GUS development.

Thursday, February 2, 2006 3:30pm -- Sarah Cohen Boulakia, Davidson / Database Group -- BioGuide: Supporting the scientist during the selection of sources and tools
Life sciences are continuously evolving so that the number and size of new sources providing specialized information in biological sciences have augmented significantly in the last few years, as well as the number of tools required to carry out bioinformatics tasks. As a consequence, scientists are increasingly confronted with the problem of selecting appropriate sources and tools. Following a thorough analysis of scientists' needs during the querying process, we found that biologists express preferences concerning the sources to be queried and the tools to be used. Interviews also showed that the querying process itself -- the strategy followed -- differs between scientists. In response to these findings, we have designed BioGuide, a user-centric framework that helps scientists choose sources and tools according to their preferences and strategy, by specifying queries through a user-friendly visual interface. Recently, we have developed a module to enable the use of BioGuide on top of the EBI SRS platform in order to automatically obtain instances of data.

Thursday, January 26, 2006 3:30pm -- Gary Chen, CBIL -- Learning conserved tissue specific regulatory rules
Tissue Specific transcriptional regulation drives the differentiation of cells and tissues during development, and maintains correct physiological functions. We report here a comparative genomics approach to address the problem of conserved tissue specific regulation for pancreas and muscle. Grammar learning is applied to report over-represented TF solos and pairs. Our analysis suggests the multiple combinational rules exist to specify transcription in these tissues.

Thursday, December 15, 2005 3:30pm -- Shailesh Date, CBIL -- Functional Genomics of Apicomplexa
Using Bayesian techniques, we have combined different functional genomics data to create a functional interaction map of the malarial parasite Plasmodium falciparum. This project, titled 'plasmoMAP', aims to provide users with functional information about the proteome, as well as provide a model of the parasite interactome. Details can be accessed via the plasmoMAP website (http://cbil.upenn.edu/plasmoMAP). I will be describing the latest plasmoMAP developments, along with details of investigation into gene families of P. falciparum.

Thursday, December 1, 2005 3:00pm -- Junmin Liu, CBIL -- SOFG Anatomy Entry List (SAEL) web service
I will talk about the SAEL web service architecture, CBIL use cases and implementations for it.

Thursday, November 10, 2005 3:30pm -- Gary Chen, CBIL -- Identifying functional regulatory networks by integrating heterogeneous biological data
The computational approaches that are used to identify regulatory modules and networks have traditionally used information either from expression data, sequence features (ChIP binding data or binding motif data) of transcription factors (TF). Although those approaches have been proven useful, their power is inherently limited by the fact that each data resource provides only partial information: expression data provides only functional or indirect evidence, whereas binding data or binding motifs only provide physical location information. Recent efforts on integrating these data types have drawbacks, such as arbitrary parameter cutoffs or too heuristic with little systematic modeling. We present a Bayesian hierarchical model and Markov Chain Monte Carlo implementation that integrates heterogeneous information including expression data, sequence features in a principled and robust fashion. By applying our model COGRIM to geome-wide ChIP binding data and approximately 500 expression experiments on S. cerevisiae, our model successfully captures essential regulatory activities. I will review the model COGRIM briefly and report our ongoing progress on integrating chip binding data, TFBS scan data and expression data all together.

Thursday, November 3, 2005 3:30pm -- Kobby Essien, CBIL -- Identification of transcriptional regulators and cis-elements in apicomplexa via comparative genomics
Little is known about gene regulation in apicomplexa. In an attempt to shed light on how these parasites control their gene expression I will discuss my efforts to identify regulators and putative cis-elements in the phylum by integrating different data types.

Thursday, October 27, 2005 3:300pm -- Joan Mazzarelli, CBIL -- Novel genes identified by manual annotation and microarray expression analysis in the pancreas.
The mouse PancChip, a microarray developed for studying endocrine pancreatic development and diabetes, represents over 13,000 cDNAs. After computationally assigning the cDNAs on the array to known genes, manual curation of the remaining sequences identified 211 novel transcripts. In microarray experiments, we found that 196 of these transcripts were expressed in total pancreas and/or pancreatic islets. Of 50 randomly selected clones from these 196 transcripts, 92% were confirmed as expressed by qPCR. We evaluated the coding potential of the novel transcripts and found that 74% of the clones had low coding potential. Since the transcripts may be partial mRNAs, we examined their translated proteins for transmembrane or signal peptide domains, and found that about 40 proteins had one of these predicted domains. Interestingly, when we investigated the novel transcripts for their overlap with non-coding microRNAs, we found that one of the novel transcripts overlapped a known microRNA gene.

Thursday, October 20, 2005 4:000pm -- Trish Whetzel, CBIL -- Ontologies for functional genomics
With the advent of high-throughput genomics, scientific communities have developed data standards, including object models and ontologies, to manage the wealth of data from these experiments. The first of these groups, the MGED Society, has developed a checklist for reporting requirements (MIAME), an object model (MAGE-OM) and supporting ontology (MGED Ontology). In time, many other communities such as the Protein Standards Initiative (PSI), Reporting Structures for Biological Investigations (RSBI), and the Metabolomics Society, are working to develop data standards for their communities. In an effort to combine the universal aspects of these different technological and biological domains, a Functional Genomics Onotlogy will be developed to provide the semantic glue between these disparate data types.

Thursday, October 13, 2005 3:30pm -- Ghislain Bidaut, CBIL -- Tissue specificity in stem cells
I will present data analysis on the SCGAP data: Understanding stem cell differentiation and self-renewal mechanisms, as well as elucidation is an essential step towards their application. To this end, we created a compendium dataset from gene expression data measured by the SCGAP (Stem Cell Genome Anatomy Projects - http://www.scgap.org) consortium participants. The data spans various stem cells populations (hematopoietic SCs, liver, prostate, bladder, kidney and bone, in mouse, human, and zebrafish). For the comparison of gene expression in different tissues in a global fashion, all experiments were normalized in terms of expression call and annotations: Stem cells populations were annotated with a controlled vocabulary describing stem cell differentiation stages (multipotent, totipotent, progenitors, lineage-committed progenitors, differentiated cells). Finally, a heat map was generated for visualization and clustering, allowing the study of the compendium using standard microarray analysis algorithms. To overcome the problem of heterogeneous data, we measured the enrichment of KEGG pathways and created a map of significantly enriched cellular processes across all the populations. Preliminary results shows conserved modules shared by all stem cells populations as well as modules specific to a restricted set of tissues.

Thursday, October 6, 2005 3:30pm -- Regina Gorski, CBIL / Kaestner Lab -- Derivation of antibodies to cell surface antigens of the endocrine pancreas
One of the current obstacles in monitoring beta-cell mass and beta-cell regeneration is the inability to accurately identify beta-cells and their progenitors. Antibodies specific to different stages of beta-cell maturation would be useful for imaging and sorting of progenitor cells and their descendents and would also be useful for monitoring beta-cell mass in vivo. Due to the lack of available antibodies specific to cell surface proteins in beta-cells, we have used a bioinformatics approach to determine possible beta-cell surface antigens. We exploited the Enodcrine Pancreas Consortium mouse and human cDNA libraries and the DoTS database to identify potential beta-cell surface antigens. Subsequently, we used genetic immunization to develop an antibody to interferon induced transmembrane protein 2 (Ifitm2), which we show is expressed in the beta-cell lineage of the endocrine pancreas.

Thursday, September 29, 2005 3:30pm -- Elisabetta Manduchi, CBIL -- Tissue-GO Biological Process Productions: assessments, revisions, issues
This is a follow-up to the work presented in my May 5 lab meeting. I will discuss some of the steps in that approach that I have been modifying and re-assessing, with a particular focus on the issue of how to best utilize human-mouse conservation in the learning process.

Thursday, September 22, 2005 3:30pm -- Steve Fischer, CBIL -- Understanding the Plugin API
The Plugin Api is the set of methods available from the Plugin.pm superclass. I will discuss the API, strategies for using it, and its rationale.

Thursday, September 8, 2005 3:30pm -- Thomas Gan, CBIL -- A discussion of data integration strategies in WDK
Building ApiDB related websites has presented WDK with data integration challenges. We have implemented boolean questions, DBMS-level database federation, and is currently working on query history plus boolean expression. Here I would like to take a step back and review the up-to-date and industry standard data integration strategies. I will briefly review the three E's (ETL, EII, EAI), SOA, and semantic web technologies. Hopefully this will be a brain-storm session and inspire vision on the directions of WDK as an even more powerful data integration tool ready for the next generation web.

Thursday, September 1, 2005 3:30pm -- Greg Grant & Mitch Guttman, CBIL -- The processing and analysis of microarray copy-number aberration (CNA)
Cancer cells tend to exhibit large scale genomic alterations, such as deletions of any possible length - sometimes as long as an entire chromosome - or duplication of genetic material sometimes leading to many copies of the same sequence. These alterations are known or believed to be driving factors in many tumors. Microarrays offer new methods of interrogating such changes on a whole genome scale at fairly high resolution. Just as with gene expression microarray data, data processing issues are complicated. Once the alterations are sufficiently well mapped, the goal becomes determining which alterations are common to particular phenotypes. We are developing software for both pre-processing and downstream analysis of CNA data. Both have been implemented into "user friendly" GUI interfaces in Java. We will describe the biological problem, the data, the pre-processing issues, and the statistical problem. The software will be demonstrated on real data.

Thursday, August 18, 2005 3:30pm -- John Iodice, CBIL -- Old Wine in New Skins: PlasmoDB Development Plans
Years of work on the part of CBIL developers and others have provided PlasmoDB with a rich complement of features and data. Our challenge now is to bring the latest generation of our Web development tools to bear on this large body of legacy data. Issues include porting to the new WDK, redesigning the PlasmoDB user interface, moving to GUS 3.5, creating a build pipeline, removing legacy data, incorporating new annotation for existing species and adding new ones, and putting the whole site under Subversion source control.

Thursday, August 11, 2005 3:30pm -- Jennifer Dommer, CBIL -- Updating GUS, TESS Queue, and RAD plug-ins
As a work-study student at CBIL since early this year, I have been fortunate enough to work on several interesting and challenging projects. I will be taking the opportunity at Thursday's lab meeting to provide an overview of my work on three of these projects that I feel will be of interest to the lab: the updating of GUS to 3.5, the creation of a TESS Queue, and my current work creating plugins for the RAD group. The focus of the presentation will be on introducing the lab to the TESS Queue.

Thursday, August 4, 2005 3:30pm -- Jonathan Schug, CBIL -- AUC P-values & Cross Entropy Optimization
We've been using the Kolmogorov-Smirnov distribution to assess the statistical significance of ROC curves. We are interested in improving upon this method, in understanding the effect of over-learning in complex models, as well as gaining and insight into the effects of sample size and sequence length. Serendipitously, I stumbled on a book 'The Cross-Entropy Method' in the bookstore which describes a clever technique for solving discrete or continuous optimization problems using importance sampling. We apply Cross-Entropy to more efficiently implement permutation-based assessments of AUC p-values.

Thursday, July 28, 2005 3:30pm -- Mike Saffitz, CBIL -- CBILBLD 3.5 upgrade overivew
Plans for moving the current GUS database to the GUS 3.5 schema will discussed. Included will be an introduction of the administrative toolkit in GUS 3.5.

Thursday, July 21, 2005 3:30pm -- Aaron Mackey, Roos Lab -- Using WDK, GBrowse, and GUS 3.5 to Buildi a ToxoDB site
In 16 hours of coding, 8 people working in 4 groups were able to use the new WDK, the GUS adapter for GBrowse from CryptoDB, and a fresh copy of GUS 3.5 preloaded with SAGE tags to set up a working ToxoDB web site that is the seed for replacing the production ToxoDB web site.

Thursday, May 26, 2005 3:30pm -- Regina Gorski, CBIL/Kaestner -- Genome Wide Location Analysis
Gene expression is controlled by the recognition of specific sequences by transcriptional regulatory proteins. Genome wide location analysis is a method used to find sites bound by transcriptional regulators with the goal of elucidating transcriptional regulatory networks. There are three popular approaches to genome wide location analysis. We've focused on one method, CHiP-Chip, where we've combined chromatin immunopreciptation with a DNA microarray we've developed containing portions of promoter regions. We are currently expanding the microarray to include various regulatory elements.

Thursday, May 12, 2005 3:30pm -- Greg Grant, CBIL -- Comparison of six amplification protocols for Affymetrix array sample


Thursday, May 19, 2005 3:30pm -- Kobby Essien, CBIL -- Identifying Putative Transcriptional Regulators in the Phylum Apicomplexa
Little is known about transcriptional control in parasitic Apicomplexans. I will discuss computational efforts to identify Apicomplexan regulators based on the presence of known DNA-binding domains and orthology of proteins to regulators from other organisms.

Thursday, May 5, 2005 3:30pm -- Elisabetta Manduchi, CBIL -- An approach to identify regulatory modules for tissue-specific transcripts sharing a tissue-specific Gene Ontology Biological Process.
The regulation of eukaryotic genes remains poorly understood. We have utilized recent human and mouse tissue survey microarray datasets to discover putative regulatory modules characterizing genes preferentially expressed in a given tissue and also sharing a biological process that is highly correlated with that tissue. Here a regulatory module may be described by expressions of the form: {binding sites for transcription factors A, B and C within 300 pairs of each other} etc. For a given tissue survey, we utilize a Shannon entropy based method to attach to each gene representative a score for each of the surveyed tissues. This score reflects both the genes overall tissue specificity (i.e. how much its expression pattern differs from ubiquitous uniform expression) and its categorical specificity, i.e. its specificity to that particular tissue. We then utilize these scores to identify Gene Ontology (GO) Biological Processes that are significantly specific for a given tissue. For a given tissue and given such GO biological process we construct a suitable positive training set and negative training set of promoter sequences, which are then input into a grammar-based approach to identify discriminating modules. We have refined this approach by exploiting human/mouse consistency to establish a final set of potentially relevant grammar productions (each representing a regulatory module), which we subsequently use to build a classifier using random forests.

Thursday, April 14, 2005 3:30pm -- Thomas Gan, CBIL -- Master of the beasts, servant of the blindmen
A discussion of the controller and view components of GUS Web Development Kit. A look into the past, the present, and the phuture.

Thursday, April 21, 2005 3:30pm -- Shailesh Date, CBIL -- Computational analyses of the Plasmodium falciparum genome - Part II
We have integrated experimental and computational functional genomics datasets with a Bayesian framework to reconstruct the functional interaction network of the _P.falciparum_ genome. This interaction map provides functional information for 68% of the genome at varying levels of confidence, including more than 2000 genes which are as yet uncharacterized. The map can be directly used to estimate gene function, and understand relationships between genes on a genome-wide scale. Additionally, we have endeavored to identify conserved functional linkages between apicomplexans by superimposing the map on genomes of 3 other apicomplexan parasites.

Thursday, March 31, 2005 3:30pm -- Gary Chen, CBIL -- A Summary on 2005 CSHL Systems Biology: Global Regulation of Gene Expression
Topics presented include: Computational Approaches to Identifying Cis-Regulatory Elements; Advances in Detection of Transcription Factor/DNA-Interactions; Transcriptional and Posttranscriptional Network Modeling; Comparative Genomics of Global Gene Regulation.

Thursday, March 17, 2005 3:30pm -- Junmin Liu, CBIL -- RAD StudyAnnotator refactoring
The RAD StudyAnnotator 1.0 is a php package providing user interface to annotate the microarray experiment in RAD. Refactoring SA 1.0 is a result of observing some programming philosophies: "modularization", "data comes up, dependance goes down" and MVC. I will present the refactoring tasks and the rationales behind them. I will also briefly discuss he future direction of RAD SA including seamless integrating the RAD SA and RAD Querier using portal technology. Brief survey of portal engines like liferay, jetspeed and phpnuke will be presented.

Thursday, March 3, 2005 3:30pm -- Jonathan Schug, CBIL -- Transcription Factors in the Liver-Specific Promoter
Here we apply Q to identify liver-specific genes then analyze their promoters for enriched TFs. We consider genes with and without CpG islands, as well as two ranges of specificity. In the second phase, enriched TFs are combined to identify combinations and arrangements that are enriched. We consider the effects of the number and order of TF binding sites.

Thursday, March 10, 2005 3:30pm -- Aaron Mackey, Roos Lab -- GLEAN: Improved eukaryotic gene prediction by statistical consensus of gene evidence
Computational prediction of eukaryotic gene structure continues to be a challenge. Genome sequencing projects thus require human curation to yield a high-quality structural annotation; this effort is based on a consensus obtained from examining multiple sources of gene evidence, including alternative gene models, homologous sequence alignments, and any available functional genomic resources. We have developed an algorithm, GLEAN, for use as a first-pass structural annotation tool. GLEAN uses latent class analysis (LCA) to estimate false positive (FP) and false negative (FN) rates for identification of start, stop, donor, and acceptor sites by gene model predictions, homologous alignments, and any other evidence that might provide support for particular sites (such as SAGE tags, partial proteomic alignments, and microarray expression data). Unlike similar methods, however, LCA does not require independent training to estimate these parameters; rather, the observed pattern and extent of (dis)agreement between sources of evidence is a reflection of the unknown involvement of each site, allowing maximum likelihood (ML) estimates of FP and FN rates to be acquired directly. The posterior probability of a site's involvement in a gene is thus based on the sources of evidence that support the site and the ML-estimated FP/FN rates determined for each source. GLEAN uses a dynamic programming algorithm to generate consensus gene models made up of sites that maximize the overall probability of site usage, while concomitantly providing gene-, exon-, and site-specific probabilistic confidence scores. Performance of GLEAN using various sources of evidence from the Toxoplasma gondii, Plasmodium falciparum, and Drosophila melanogaster genome projects demonstrates that GLEAN predictions in these genomes more closely reflect the human-curated structural annotation than any other source of gene predictions. Thus, GLEAN predictions will likely be useful for the initial, automated structural annotation of tomorrow's unannotated genomes, particular those with limited (or nonexistent) training sets.

Thursday, February 24, 2005 3:30pm -- Joan Mazzarelli, CBIL -- Novel genes expressed in the mouse pancreas discovered by manual and computational analysis
The mouse pancreas-enriched microarray or mouse PancChip provides a tool for studying pancreatic development and diabetes research. In designing the PancChip, cDNA clones from mouse pancreas libraries were picked to represent unique mouse transcripts as identified by computational analysis using DoTS. Manual and computational annotation of the transcripts has identified novel transcripts. Using the PancChip, we show that these novel transcripts are expressed in purified islets and the pancreas.

Thursday, February 10, 2005 3:30pm -- Mike Saffitz, CBIL -- GUS 3.5
A status report will be given on the upcoming release of GUS 3.5. Discussion will include an overview of changes and their impact, lessons learned from the GUS community, and the release process for GUS 3.5.

Thursday, February 3, 2005 3:30pm -- Steve Fischer, CBIL -- A proposal for the GUS 4.0 Schema - work in progress
I will present the proposal generated by the GUS Schema Party Collaborative for modifications to the GUS Schema, oriented towards a 4.0 release. So far the effort has covered approximately half of the DoTS schema. The proposal includes a categorization of tables, and modifications to the schema including renaming tables, combining tables and introducing some new schema design patterns and corresponding tables.

Thursday, January 27, 2005 3:30pm -- Bindu Gajria, CBIL -- PlasmoDB as a production community resource
PlasmoDB started largely from a academic web site tying together a variety of resources relating to Plasmodium falciparum. Now, PlasmoDB is established as the central resource of genomics and functional genomics datasets for several Plasmodium species. As a world-wide community depends on PlasmoDB, the development and maintenance has required significant project management to effectively meet their needs. The path to making PlasmoDB a truly production resource will be presented.

Thursday, January 6, 2005 3:30pm -- John Iodice, CBIL -- The ApiDoTS project
The ApiDots project estimates the transcriptomes of apicomplexan parasites by clustering and assembling ESTs and mRNA sequences from the dbEST and INV divisions of GenBank. The resulting database is published in a website which was created with CBIL's Web Development Kit (Classic). I will describe the history of the project, summarize its current state, and outline plans for release 5, currently in progress, which will include data for 13 species of 8 genuses.

Thursday, December 16, 2004 3:30pm -- Dave Barkan, CBIL -- The OrthoMCL Pipeline
I will report on efforts to streamline Li Li's Ph.D. research on using the MCL algorithm to cluster orthologous sequences from multiple species. Details will include integrating the project into the GUS Pipeline API, collaboration with the Roos lab to run the process on different platforms, and an in-depth look at CBIL's new architecture for managing data from external sources, including its trial by fire to load genomes from 80 species into GUS. The fact that Li is now working at a department to which I just submitted my application for graduate school is purely coincidental.

Thursday, December 9, 2004 3:30pm -- Jonathan Schug, CBIL -- Tissue Specificity - Who, What, Where, Why and When: Episode 3 in Further Adventures with Entropy
We continue the goal of studying tissue-specific genes by investigating what they are doing, where they are going, and why they are tissue-specific. First, some highlights from a recently submitted paper about the who, what and where. Second, results about the (lack) of importance of binding sites in the core promoter for the control of tissue specificity (why). Finally, a look ahead to what we can find in the first 1KB of promoter of liver, pancreas, and muscle genes.

Thursday, December 2, 2004 3:30pm -- Aaron Mackey, Dept. of Biology (Roos Lab) -- Structural and Functional Annotation of the Toxoplasma gondii Genome
The genome of the Apicomplexan T. gondii has been shotgun-sequenced to approx. 10X coverage; WGS scaffolds have been super-assembled into chromosomes via classical genetic mapping, and refined computationally with BAC-end mapping. Various gene structure prediction algorithms have been trained and used for structural annotation of the genome; homologues found within EST assembly and protein databases have been aligned with the genome for both structural and functional annotation. Results from a SAGE tag sequencing project are used to inform the structural annotation. Because our analyses and data sources are currently in flux, we chose to implement the GMOD project's Gbrowse application (and accompanying dbGFF relational database schema) for lightweight integration, visualization and navigation of these diverse datasets. I will discuss the motivation and use of Gbrowse (and associated aspects of the BioPerl toolkit) in the context of our lab's ongoing efforts to rapidly finish a first-draft annotation of the T. gondii genome.

Thursday, November 11, 2004 3:30pm -- Y. Thomas Gan, CBIL -- Annotating genomes with DoTS
The sequencing of many genomes are completed or nearly completed, whilethe annotation of them lags far behind. DoTS, a human and mouse transcript index that integrates a rich collection of expression and functional information, is an excellent resource for genome annotation. In addition to the "dots build" pipeline that clusters and assembles EST/mRNAs into DoTS transcripts (DTs), we have built a "dots gene" pipeline to annotate human and mouse genomes using DTs. I will review the current status of the pipeline, and discuss its applications in the mouse chr5 annotation project (a collaboration with Dr. Bucan's lab).

Thursday, October 28, 2004 3:30pm -- Gary Chen, CBIL -- Assessing tissue-specific transcriptional regulatory modules and networks
The development of multi-cellular organisms is, to a large extent, dictated by a carefully choreographed progression of domain & tissue specific gene expression. There is great interest in understanding the transcriptional program of tissue differentiation and development given the importance of tissue function and disease. I will illustrate Bayesian models, MCMC algorithms, and other heuristic ideas that have been used to predict regulatory modules and networks with various data resources. Particularly, I will focus on our recent progress on (1) utilizing comparative genomics techniques to identifying gene candidates potentially regulated by some specific transcription factors and their modules. (2) developing a statistical model using Gibbs Sampling method to combine all available information from experiments (e.g. expression, ChIP-ChIP data) and genomic sequences to identify co-regulated genes and their regulatory modules. Muscle specific SRF-CArG box (Serum response factor) will be illustrated as case study.

Thursday, October 21, 2004 3:30pm -- Angel Pizarro, CBIL -- GUS Proteomics
A look at the current state of proteomics standards for proteomics data and how these fit in with GUS.

Thursday, September 16, 2004 3:30pm -- Shailesh Date, CBIL -- Annotating uncharacterized genes in the Plasmodium falciparum genome
A number of genes in the many completely sequenced genomes have yet to be characterized. In certain genomes, like that of the malarial parasite P. falciparum, a striking 60% or more of the genes are missing confident annotations. Absence of information about such large numbers of genes not only hinders our ability to understand the biology of the organism in greater detail, but also prevents us from exploring the various biochemical pathways and cellular systems for new targets of possible pharmaceutical importance. To address this issue, we applied various functional genomics methods, both experiemental and computational, to the P. falciparum genome, and subsequently integrated the results from the individual methods, within a Bayesian framework, to reconstruct a high confidence interaction map of the genome. Using this map, we were able to explore functional assignments of more than 600 genes, about 100 of which are as yet uncharacterized.

Thursday, September 2, 2004 3:30pm -- Hongxian He, CBIL -- Temporal Profile of Differential Gene Expression Following Glutamate Exposure in Cultured Hippocampal Neurons
Glutamate is the most common excitatory neurotransmitter in the brain, involved in physiologic processes of learning and memory. It is known that high levels of glutamate exposure can be toxic to neurons, but less is known about how gene expression changes after a modest exposure to glutamate that does not produce significant cell death. The understanding of genetic changes produced by a non-toxic level of glutamate signaling may shed light on the processes of physiologic learning and memory. In this study, we are examining in detail the differential gene expression of cultured hippocample neurons in the early and intermediate time periods following exposure to 10mM glutamate/10mM glycine. The ANOVA approach and clustering method were used to identify temporal changes in the gene expression . The results showed that glutamate exposure in the 10mM range for 30 minutes produces significant gene expression changes in cultured hippocampal neurons. Several interesting temporal profiles have also been identified. This study provides the important background data based on which a detailed analysis of gene expression events following mechanical stretch injury in cultured primary neurons can be carried out.

Thursday, August 26, 2004 3:30pm -- Greg Grant (with Sharon Diskin and Tom Eck), CBIL -- Statistical significance testing for aCGH data.
Array Comparative Genomic Hybridization data (aCGH) is microarray data designed to measure genomic alterations, such as large scale deletions or duplications, which are common for example in tumor cells. Clones spotted on the array are genomic sequences which (approximately) tile the genome. The data is geometric in nature and presents a difficult problem for significance testing. We have developed a method, using permutation p-values, for locating significant regions of alteration. We will describe the algorithm, the implementation, and illustrate the methods on neuroblasoma cell line data.

Thursday, August 19, 2004 3:15pm -- Elisabetta Manduchi, CBIL -- Discovering regulatory modules by creating profiles for Gene Ontology Biological Processes based on tissue-specificity scores.
I'll be describing recent work, done in collaboration with Jonathan Schug and Chris Stoeckert, utilizing recent human and mouse tissue surveys to investigate tissue-specificity of gene sets. For a given tissue survey, we first attach to each gene representative (i.e. each spot on a microarray) a score for each of the surveyed tissues. This score is based on Shannon entropy and reflects both the gene's overall tissue specificity and its categorical specificity. We then utilize the rankings of genes in each tissue according to this score to define suitable tissue-specificity profiles for gene sets of interest. These gene sets could be, for example, sets of genes with given GO biological process annotations or any other a priori defined gene sets. These profiles can be used to select sets consisting of genes sharing a relevant biological property (e.g. a biological process) and showing specificity for a given tissue as a whole. The genes in such sets might share regulatory modules. Using a grammar-based approach, we analyze the upstream regions of such genes to identify putative regulatory modules.

Thursday, July 15, 2004 3:30pm -- Hongxian He, CBIL -- Using Order Statistics to Identify Functionally Relevant Gene Neighborhoods in Hematopoietic Stem Cell Regulation
In order to study the molecular mechanism underlying the regulation of hematopoietic stem cells (HSCs), we have identified pairs of genes that are co-expressed across the hematopoietic developmental hierarchy (HSCs, LCPs: lineage-committed progen itors, MBCs: mature blood cells) in each of the four human tissues (BM: adult bone marrow, FL: fetal liver, CB: umbilical cord blood, PB: mobilized peripheral blood) and whose co-expressions are also preserved across all four tissues, using microarray data on the Affymetrix HG-U95 set. The preservation of co-expression implies that these genes are functionally related in the process of HSC regulation. This work is analogous to that of Stuart et al. (Science, 302: 249-255, 2003) where a gene co-expression network was constructed by identifying genetic modules conserved across different tissues. We computed the significance of the co-expression and correlations of a pair of genes via the technique of order statistics, resulting an interaction p-value for every pair. This p-value reflects how strongly a pair of genes is functionally associated in the regulation of HSCs. We connected any pair of genes whose Bonferroni corrected interaction p-value is below 0.001 to form a network. To assess the confidence of these interactions (edges), we further utilized the property of high clustering coefficient of small-world networks. For each edge, we computed the clustering coefficient as the cumulative hypergeometric probability of observing at or above the number of mutual neighbors given the neighborhood sizes of the two vertices around the edge and the total number of genes in the network. We then combined the evidence from co-expression (interaction p-value) and local network topology (high clustering coefficient) in a Bayesian probalistic framework and used the clustering and EASE analyses to identify funtinoally enriched gene neighborhoods.

Thursday, July 8, 2004 3:30pm -- Mike Saffitz, CBIL -- CBIL and GUS and 10g, Oh My!
An all-encompassing look at CBIL, GUS, and Oracle 10g. Topics include: a review of the 4th OLSGU meeting, exciting 10g features, results of the Workspace Manager tests, some tips on working with Oracle, and a look forwards to GUS 3.5 and beyond.

Thursday, July 1, 2004 3:30pm -- Greg Grant and Junmin Liu, CBIL -- Introducing PaGE 5.0 - an update to our tool for microarray gene expression analysis and a software enginnering lesson taken durng the developing process
PaGE is an algorithm for associating patterns to genes across multiple conditions. PaGE bases its patterns on the False Discovery Rate (FDR) and as such can also be used as a straight differential expression analysis algorithm. We will describe the differential expression problem as it relates to microarray data, the FDR methodology, our permutation approaches, and we will demonstrate our new Java and Perl implementations of the application.

Thursday, June 24, 2004 3:30pm -- Regina Gorski, CBIL/ Kaestner Lab -- Analysis of mammalian transcriptional networks using orthogonal datasets
Recently, transcription factor location analysis has been performed suggesting that the transcriptional regulators HNF1alpha, HNF4alpha, and HNF6 function as master regulators of hepatocyte and pancreatic islet transcription. To evaluate this circuitry, we performed expression profiling on islets deficient for HNF4alpha. Strikingly, the expression of most HNF4alpha targets identified by location analysis was not significantly altered in the absence HNF4alpha in vivo. Conversely, the majority of genes dependent on HNF4alpha was not bound by the factor in the proximal promoter. Our results illustrate the importance of complementing genome-wide location analysis with expression profiling to elucidate the complex regulatory networks in metazoan organisms.

Thursday, June 10, 2004 3:30pm -- Mike Saffitz, CBIL -- The Genomics Unified Schema and Application Framework
Preview of talk to be given at the 4th Oracle Life Sciences Users Group Meeting in Reston, VA (see http://otn.oracle.com/industries/life_sciences/olsug/olsug_june2004.html)

Thursday, May 27, 2004 3:30pm -- Kobby Essien, CBIL -- Towards the computational identification of putative cis- and trans- regulatory factors in Plasmodium falciparum.
Plasmodium falciparum, the most deadly human malaria parasite, exhibits cyclic gene expression patterns during its life cycle suggesting complex regulatory control. However only 14 P. falcipaum transcription factors have been identified and genes they exert their influence on remain unknown. I will present my efforts to use gene orthology, sequence conservation and motif finding to locate putative transcription factors, cis-regulatory elements and regulons in P. falciparum.

Thursday, June 3, 2004 3:30pm -- Ela Hunt, University of Glasgow -- Data integration for functional genomics
SyntenyVista visualization software provides integrated access to synteny data, microarray probe mappings, QTL data for the rat, human and mouse, and to overviews of microarray experiments. Sequence-level browsing will be integrated via the incorporation of multiple-alignment and sequence level viewers. In sequence indexing we are testing an on-disk index to the rat genome. The aim is to enable faster exhaustive microarray probe mapping, as well as phage display dataset analysis. In data integration we are developing a data mining approach for the integration of GO, Ensembl, PubMed, OMIM, Affy, RGD, MGI and other data which is already in XML format. We will use data mining to find redundancies in data and generate a set of mappings between data sources in order to remove redundancies and to merge the data. Finally, we are developing databases which will enable the comparison of microarray and proteomics results, and integrate data generation with the day-to-day running of the proteomics facility. [Dr. Hunt is aguest speaker researching database and visualization supporting functional genomics techniques.]

Thursday, May 13, 2004 3:30pm -- Debbie Pinney, CBIL -- Hot Spots for Gene-Trap Insertion?
Reverse genetics starts with genes and mutates them in order to study the resulting phenotype. One reverse genetic approach, gene-trap mutagenesis, relies on insertional mutagenesis of genes in Embryonic Stem cells with selectable vector sequences. Gene-trap mutagenesis can be done on a large scale and the disrupted genes identified by the production of cDNA sequence tags originating from primers corresponding to the gene-trap vectors. There are over a quarter of a million mouse gene-trap sequence tags deposited into GenBank and subsequenetly integrated into GUS. These sequence tags have been aligned to mouse genomic sequences with BLAT and the alignments stored in the database. Preliminary analysis of the BLAT results suggests that there is a non-random distribution of alignments and therefore gene-trap vector insertion. It has been proposed that there are, in fact, hot spots for integration. I will summarize the BLAT alignment results and discuss future plans to test the "hot spot" proposal and to investigate the signals for enhanced integration.

Thursday, April 22, 2004 3:30pm -- Shailesh Date, CBIL -- Exploring and exploiting functional linkages between proteins
Physiological effects in cells are brought about by various proteins acting individually, or in concert with each other. Proteins that interact may be linked physically, i.e., they may physically interact, such as subunits in a protein complex, or they may be linked functionally, such as proteins that are part of the same biochemical pathway. Functional linkages between proteins can be thought of as defining linear steps in a pathway, or defining the edges of a protein-protein interaction network. In the recent past, a number of new, computational functional genomics methods, for elucidating functional linkages have been proposed, and have produced biologically meaningful results within the bounds of statistical confidence. We will briefly explore two such methods, protein phylogenetic profiles, and Rosetta stone linkages, which are proving to be useful in understanding the biology of organisms, and how they can be applied to problems such as function annotation of genes and their products, and creating biologically meaningful clusters of proteins, which may represent pathways, networks and cellular systems. We will also discuss the possible integration of these and other computational methods with the existing GUS framework, or its subset frameworks.

Thursday, April 1, 2004 3:30pm -- Steve Fischer, CBIL -- The GUS Web Development Kit (WDK) Data Mode
We are underway with a re-design of the GUS WDK. The WDK-Classic was written by Jonathan Crabtree and implemented using JavaServlets. The new version will use a Model-View-Controller design. It will use XML to specify the Model and will use JavaServerPages/Struts for the View and Controller. In this presentation I will discuss the WDK's Model, both how to use it and its implementation.

Thursday, March 25, 2004 3:30pm -- Gary Chen, CBIL -- Report on CSHL Systems Biology meeting.
Gary will give a review of the Cold Spring Harbor meeting on Systems Biology: Genomic Approaches to Transcriptional Regulation.

Thursday, March 11, 2004 3:30pm -- Trish Whetzel, CBIL -- Report on PSB and MAM
A summary of the recent Pacific Symposium on Biocomputing will be presented. Session topics include Alternative Splicing, Computational Tools for Complex Trait Gene Mapping, BioMedical Ontologies, Joint Learning from Multiple Types of Genomic Data, Informatics Approaches in Structural Genomics, and Computational and Symbolic Systems Biology. Highlights from the Molecular Approaches to Malaria meeting will also be presented and with a focus on sources of data to potentially include in PlasmoDB.

Thursday, March 4, 2004 3:30pm -- Jonathan Schug and friends, CBIL -- Starting DoTS on the genome
Currently we align DoTS to the genome at the end of the build. Plans are being made now to use the genome to perform the initial clustering of ESTs and mRNAs to start the DoTS build avoiding the self-BLAST. Also to be considered is whether we should go further than clustering and use the genome for assembly as well.

Thursday, February 12, 2004 3:30pm -- Jonathan Schug, CBIL -- Tissue specific genes - II
How to pick 'em, what they look like, and where they can be found. I will review our findings about the characteristics of promoters of tissue specific genes in preparation for finishing a paper.

Thursday, January 29, 2004 3:30pm -- Dave Barkan, CBIL -- GUS Java Object Layer: Reloaded
I will give an update on the progress of the GUS java object layer. Highlights will include taking advantage of Java's strong data typing to make code easier to understand and guard against user error, a discussion of the differences between the java and perl object layers, and some interesting programming problems and their (attempted) solutions. Guaranteed to stir debate or your money back!

Thursday, January 15, 2004 3:30pm -- Mike Saffitz, CBIL -- The Simple Object Access Protocol (SOAP)
The Simple Object Access Protocol, or SOAP, defines a cross-platform, lightweight interface for the exchange of information in a distributed environment. Often used as the underlying protocol in webservices, SOAP allows interfaces to be exposed for remote procedural calls with little to no modification. During Thursday's lab meeting, the SOAP protocol will be presented, as well as demonstrations and examples of how it can be used within the lab, and how the SCGAP consortium is using it to provide a central search tool which queries the seven member groups for gene expression data.

Thursday, December 18, 2003 3:30pm -- Angel Pizarro, CBIL -- MAGE-OM: Enter the belly of the beast
MAGE-OM v.1 was a collaborative effort to create a flexible framework for software, and academic and industry groups to share microarray data. The modeling effort stretched over a year and resulted in what many think is an object model that is too complex for its own good. A classic example of over engineering, making the common case as hard to deal with as the least common case. I'll go into a current example dealing with the most basic of encoding of array data, and the more complex case of BioMaterials. MAGE v2 is now under construction and plans to extend itself to new high throughput functional genomics data domains, such as proteomics. Will the powers-that-be do a complete refactoring to make the code base simpler? Or will the gremlins of object proliferation win the day? Stay tuned ...

Thursday, December 11, 2003 3:30pm -- Gregory Grant, CBIL -- Association Tests in Bioinformatics
There has been increasing usage of association tests in Bioinfortatics (e.g. Fisher'sexact test for a 2x2 or chi-square tests for nxn). A typical example is to test whether high expression, or differential expression, is associated with functional category. The requirements for being able to apply such tests are delicate and subtle, and it is not always obvious that they are being violated. We will discuss the general theory, a range of applications, available software package for bioinformatics applications (EASE), the various ways to go wrong with the assumptions, and the care that must be paid in what conclusions can be drawn.

Thursday, November 13, 2003 3:30pm -- Jonathan Schug, CBIL -- Assessing tissue-specificity of genes using Shannon entropy
"What genes are specific to all kinds of islet cells?" Questions like the one above are common in studies focussed on a single organ, tissue, or cell type. It is often important to identify genes that are specific or restricted to the system of interest to identify common regulatory motifs or to use as candidates for genes with clinical or developmental importance. With the generation of detailed comprehensive measurements of the transcriptome of several major organisms, we now have the ability to identify such genes for many tissues of interest. In this work we apply Shannon entropy (H) to rank genes according to the amount of specificity they show. We then define another statistic (Q) that allows us to rank genes according the amount of restriction they show to a tissue of interest. The Q statistic formalizes two notions about tissue specificity. First, the more wide-spread and uniform a gene's expression is, the less restricted it is. Second, the higher a gene's expression is in a tissue the more relevant the tissue is to the gene. We apply our statistic to microarray-based data (Su etal 2002) and EST-based (Gan et al in prep) data and it can be applied to in situ hybridization data as well. Our statistics can also be used with a hierarchical description of anatomy to answer questions like the one above. We discuss the distribution of H values for all genes in human and mouse to see how many specific genes there are and demonstrate the utility of Q by using it to identify properties of HNF-1 binding sites in liver-restricted genes and to cluster organs and tissues by correlation of Q scores for the specific genes.

Thursday, November 20, 2003 3:30pm -- Gary (Guang) Chen, CBIL -- Pattern analysis of islet gene promoters
"What TF binding motifs (TFBMs) are specific to islet genes?" "What TFBM modules are specific to islet genes?" These are the follow-up questions to "What genes are specific to all kinds of islet cells?" in Jonathan Schug's presentation. As in many other sequence based strategies in computational biology, one of key problems is the discovery of sequence patterns, in this case, the cis-regulatory motifs or TFBMs. Most pattern recognition algorithms will usually report hundreds of redundant patterns, which need to be clustered in order to provide clear and useful information. I will present my efforts to build a pilot method & software pipeline, from pattern discovery (Teiresias) to clustering (K-median, MCL) with the goal of identifying known TFBMs and to discover new ones.

Thursday, October 30, 2003 3:30pm -- No lab meeting today, -- Eugen Buehler's Thesis Defense "STATISTICAL MODELS FOR THE ANALYSIS OF HETEROGENEOUS BIOLOGICAL DATA SETS" 3-5PM Levine 307


Thursday, November 6, 2003 3:30pm -- Shailesh Date, University of Texas, Austin -- Large scale protein function prediction and systematic discovery of novel cellular systems.
A large number of proteins from the many completely sequenced genomes carry no functional annotation. Even in well-studied genomes like that of E. coli, proteins that comprise more than 30% of the proteome are considered novel, mostly due to the absence of sequence homology with other characterized proteins. Finding function for these unique proteins is being touted as one of the biggest challenges facing us today. A number of high-throughput, functional genomics methods have been recently put forth, that attempt protein annotation independent of sequence homology constraints. One successful computational genetics approach involves investigating functional interactions between proteins by measuring the similarity between their phylogenetic profiles. Phylogenetic profiles are, in essence, a description of the presence or absence of the given protein in a set of reference genomes. Proteins with similar profiles are more likely to be members of the same cellular system or pathway, and are therefore functionally linked to each other. Mutual information, an information theoretic measure, was adopted to quantitatively define profile similarity. Profiles of all proteins from a given genome were then compared with each other, leading to the reconstruction of inter-protein functional linkage maps on a genome-wide scale. Unknown proteins were then assigned function based on their links with other known proteins, and their position in the map. Further examination of the maps revealed groups or clusters composed almost entirely of proteins with no known function. Such protein clusters potentially represent novel cellular systems or pathways. A publicly available, web-accessible database of protein profiles from 89 completely sequenced genomes, and a set of computational tools were created, that allow the generation of a phylogenetic profile from any given amino acid sequence. This profile can then be compared with profiles of other proteins, and function information extrapolated. Gene neighbors of matching candidate profiles, and their Rosetta stone links can also be investigated (http://bioinformatics.icmb.utexas.edu/plex).

Thursday, October 23, 2003 3:30pm -- No lab meeting today, -- Bioinformatics retreat tomorrow


Thursday, October 16, 2003 3:30pm -- Y. Thomas Gan, CBIL -- Summary of DoTS Paper


Thursday, October 9, 2003 3:30pm -- Junmin Liu, CBIL -- RAD Querier and the RAD web site


Thursday, September 25, 2003 3:30pm -- Matt Mailman, CBIL --
Characterization of elements regulating tissue-specific alternative splicing

Thursday, September 18, 2003 3:30pm -- Regina Gorski, Kaestner Lab. -- Cell surface antigens.
Identifying cell surface antigens for pancreatic cells.

Thursday, August 14, 2003 3:30pm -- Pizarro, CBIL -- EPConDB 3.1
An overview of the new and old features in EPConDB 3.1.

Thursday, August 7, 2003 3:30pm -- Crabtree, CBIL -- PlasmoDB 4.1
An overview of the new and old features in PlasmoDB 4.1.

Thursday, July 31, 2003 3:30pm -- Dr. Cesare Furlanello, ITC-IRST (Trento, Italy) -- The control of the selection bias in the predictive modeling of array data.
Guest speaker from ITC-IRST (Trento, Italy) describing joint work with M. Serafini, S. Merler, G. Jurman.

Thursday, July 24, 2003 3:30pm -- Manduchi and Grant, CBIL -- ISMB 2003
A report on ISMB 2003 (Brisbane, Australia)

Thursday, July 10, 2003 3:30pm -- Crabtree, CBIL -- Code review #2
Code Review #2: CBIL Annotator Interface II

Thursday, June 19, 2003 3:30pm -- Pizarro, CBIL -- Code review #1
Code Review #1: Angel's code goes through the wringer

Thursday, June 12, 2003 3:30pm -- Y. Thomas Gan, CBIL -- CSH Meeting Reviews
Reviews of two recent CSH meetings.

Thursday, June 5, 2003 N/A -- N/A, -- No lab. meeting - party instead


Thursday, May 29, 2003 3:30pm -- Barkan and Fischer, CBIL -- GO function prediction
Recent work on making the GO function prediction algorithm function in GUS 3.0

Thursday, May 15, 2003 N/A -- N/A, -- No lab. meeting


Thursday, May 22, 2003 N/A -- N/A, -- No lab. meeting


Thursday, May 8, 2003 3:30pm -- Elisabetta Manduchi, CBIL -- RAD: current status
An update on the current status of the schema, data loading pipeline, and the new release of the annotation web-forms.

Thursday, May 1, 2003 3:30pm -- Pizarro and Crabtree, CBIL -- MAGE & GUS


Thursday, April 24, 2003 3:30pm -- Andy Jones, University of Glasgow -- Data standards for proteomics.
Towards a data standard for 2-dimensional gel electrophoresis.

Thursday, April 17, 2003 3:30pm -- Joan Mazzarelli, CBIL -- PancChip Annotation
A review of the effort to manually annotate the genes present on PancChip 4, a pancreas-specific microarrray chip.

Thursday, April 10, 2003 3:30pm -- Deborah Pinney, CBIL -- DoTS Update
An update on the operations used to build and annotate DoTS Transcripts (DTs).

Thursday, April 3, 2003 N/A -- N/A, -- No lab. meeting


Thursday, March 27, 2003 3:30pm -- Jonathan Schug, CBIL -- Report on CSH meeting


Thursday, March 6, 2003 3:30pm -- Hongxian He, CBIL -- The SCGAP Hematopoietic Stem Cell Research Project
An introduction to the SCGAP Hematopoietic Stem Cell Research Project at Princeton and Penn. The presentation will also include a summary of the microarray analysis of the preliminary INS-1 study by Klaus Kaestner.

Thursday, February 20, 2003 3:30pm -- Junmin Liu, CBIL -- The website of RAD (RNA Abundance Database)
A progress report on the developing RAD web interface, driven by the MAGE ontology.

Thursday, January 30, 2003 3:30pm -- Yongchang Gan, CBIL -- DoTS Genes
An update on the latest human and mouse genes based on genomic alignments of DoTS, including discussions of various evaluations of the result, a comparison with the similarity based genes, and a human-mouse comparison.

Thursday, January 23, 2003 3:30pm -- Trish Whetzel, CBIL -- The new MGED Ontology
The results of the second MGED Ontology get-together will be presented.

Thursday, January 16, 2003 3:30pm -- Dr. Jonathan (Yoni) Nissanov, MCP Hahnemann, MCP Hahnemann -- Cryoplane microscopy and QTL mapping.
Getting from phenotype to genotype with images of mouse anatomy.

Thursday, January 9, 2003 3:30pm -- Jonathan Schug, CBIL -- Learning Collection Grammars; Can't Johnny Read Why
A collection of tidbits about learning collection grammars presented in no particular order, but covering the CRE story, the beginnings of the liver story, and some background.

Thursday, December 19, 2002 3:30pm -- Dave Barkan, CBIL -- Annotate This: Annotator's Interface II
The next version of the CBIL Annotator's Interface will be a stand-alone, Java-based application with support for annotators working in an "offline" mode, i.e. without a persistent connection to the database. We'll give a summary of the work done so far, including progress on both the GUS Java object layer and also the tool's all-new graphical user interface.

Thursday, December 5, 2002 3:30pm -- Greg Grant, CBIL -- Quality Control in Microarray Data Management, and a report on CAMDA02
As the field of microarray analysis matures, issues relating to quality control in data management have become more and more difficult to ignore. This year's CAMDA conference highlighted this issue perfectly in an unintended, but serendipitous, way. We will review the conference with particular attention to the data quality control issues which arose, and then we will expand on the discussion of quality control more generally.

Thursday, November 14, 2002 3:00pm -- Chen, CBIL -- Clone correlation analysis based on seqlets and Bio-Dictionary
Seqlets refer to the patterns that are discovered by processing a given input comprising many sequences. The seqlet that we are addressing is analogous to the basic unit of natural language. Seqlets can be thought of as building blocks of protein molecules that are a necessary condition for function or family equivalence memberships. Our result showed that seqlets are meaningful for structural and functional similarity, especially in small protein family groups. It might be useful to explore seqlet-function associations as GO did with protein domains.

Thursday, November 7, 2002 3:30pm -- Andrew Selden, PCBI -- Sysadmin update.
An overview of new and upcoming developments related to the computing support infrastructure at CBIL and the Center for Bioinformatics. Topics will include the new centralized fileserver (with support for file sharing in both Unix/Linux and also Windows), planning downtime for rewiring the DMZ network, and other exciting developments.

Thursday, October 31, 2002 3:30pm -- Matt Mailman, CBIL -- Identification of alternative splicing regulatory elements, characterization of alternatively spliced genes, and correlation of the role of polymorphic regulatotory elements with phenotype
TBA

Thursday, October 24, 2002 3:30pm -- Steve Fischer, CBIL -- CVS Reorganization
CVS Reorganization

Thursday, October 17, 2002 N/A -- N/A, -- No lab. meeting
No lab. meeting

Thursday, October 10, 2002 3:30pm -- Joan Mazzarelli, CBIL -- Annotating Mouse Chromosome 5
A report on our effort to manually annotate selected regions of the proximal portion of mouse chromosome 5. The goals of this effort include: confirming the presence of known genes in the region; validating novel gene predictions; and, ultimately, determining how many protein coding genes are located in these regions and on mouse chromosome 5 as a whole.

Thursday, September 26, 2002 3:30pm -- Nic Henke, CIS -- Using Liniac & Clubmask
A workshop on running batch mode jobs on the Liniac compute cluster.

Thursday, September 19, 2002 N/A -- N/A, -- No lab. meeting
No lab. meeting

Thursday, September 12, 2002 12:15pm -- Gan & Crabtree, CBIL -- From BLAT alignments to genes
We have been using Jim Kent's BLAT (BLAST-Like Alignment Tool) to align DoTS consensus sequences to the human and mouse genomes. We have used various criteria to attempt to identify those alignments deemed most likely to represent the transcripts of (protein coding) genes. We have also developed a heuristic algorithm for merging adjacent alignments into putative "genes", with the goal of generating a reference gene index for each of the organisms in question. We will present our results, including some comparisons with BLAT alignments of TIGR THCs and RefSeq mRNAs and with the published literature on one or two regions of interest.

Thursday, September 5, 2002 12:15pm -- Liu, Manduchi & Whetzel, CBIL -- RAD forms
Helping to FORM a good impression of RAD.

Thursday, August 29, 2002 N/A -- N/A, -- No lab. meeting
No lab. meeting

Thursday, August 22, 2002 1:00pm -- Fischer Li Schug, CBIL -- ISMB recap.
A report on interesting research from this year's ISMB meeting in Edmonton, Canada.

Fischer:
  • Alexander Kel and Edgar Wingender's tutorial "In silico analysis of gene regulatory sequences. Towards target gene identification"
  • Splicing Graphs and EST assembly problem (S. Heber, et al, UCSD)
  • Efficiently detecting polymorphisms during the fragment assembly process (D.Fasulo, et al, Celera)
Li:
  • 146B BAG: A graph theoretic sequence clustering algorithm
  • 178B Detecting the Domain Structure of Proteins from Sequence Information
  • 121B Analysis of Grass Gene Families.


Thursday, August 15, 2002 1:00pm -- Crabtree, CBIL -- Oracle/SQL tutorial
As part of this extensive tutorial on Oracle databases and SQL; you will learn how to:
  • Master the black art of query optimization!
  • Discover which of your co-workers has locked the tables you want to modify!
  • Win friends and influence people with recursive queries!


Thursday, July 11, 2002 1:00pm -- Gan, CBIL -- Computing human-mouse orthologs in DoTS
Finding orthologous genes using the human and mouse DoTS EST assemblies; includes a comparison with the TIGR Eukaryotic Gene Orthologs (the database formerly known as "TOGA".)

Thursday, July 4, 2002 N/A -- N/A, -- No lab. meeting
No lab. meeting

Thursday, June 27, 2002 1:00pm -- Manduchi, CBIL -- Microarray Data Analysis
A brief tutorial on approaches to analyzing microarray data

Thursday, June 20, 2002 1:00pm -- Marie-Adele Rajandream, Sanger PSU -- GeneDB
The GeneDB project is to develop and maintain curated database resources for Schizosaccharomyces pombe, Leishmania major and Trypanosoma brucei.

Thursday, June 13, 2002 1:00pm -- Fischer, CBIL -- Liniac controller
A generic solution to distributing jobs on a compute cluster

Thursday, June 6, 2002 1:00pm -- Pizarro, CBIL -- RAD3
RAD3 -- CBIL's ongoing efforts with the RNA Abundance Database

Thursday, May 23, 2002 1:00pm -- Kondrakhin, CBIL -- TBA


Thursday, May 16, 2002 1:00pm -- Crabtree, CBIL -- CSH Recap.
A review of last week's Cold Spring Harbor Meeting on Genome Sequencing and Biology.

Thursday, April 25, 2002 1:00pm -- Pinney, CBIL -- Integration of RNA and Protein in the GUS Database


Thursday, April 11, 2002 1:00pm -- Fischer, CBIL -- Automated sequence analysis
A Java API for processing sequence analysis protocols

Thursday, March 28, 2002 1:00pm -- Grant, CBIL -- Microarrays
Studies in microarray data analysis

Thursday, March 14, 2002 1:00pm -- Hongxian He, -- TBA
Guest speaker

Thursday, March 7, 2002 1:00pm -- Manduchi, McWeeney, Stoeckert, CBIL -- MGED4 recap.
A recap. of some of the talks and events at the 4th Microarray Gene Expression Database meeting held in February, 2002, Boston.

Thursday, February 28, 2002 1:00pm -- Diskin, CBIL -- Annotator Interface - the next generation
Motivation, requirements and future plans for new annotation tool

Thursday, February 21, 2002 1:00pm -- Brunk, CBIL -- Moving from connected components to graphs for clustering BLAST results.
I will present a new graph based algorithm I'm working on to generate clusters of sequences given a set of blast results and compare the results of this algorithm to the one currently in use for DoTS.

Thursday, February 7, 2002 1:00pm -- Stoeckert, Mazzarelli, Crabtree, CBIL -- DOE Meeting Recap.
A recap. of some of the talks at this year's DOE Contractor-Grantee meeting.

Thursday, January 24, 2002 12:45pm -- Li, CBIL -- Identifying orthologous groups in eukaryotes
Clustering orthologs by comparing model eukaryotic genomes within GUS: current framework and future plans

Thursday, January 17, 2002 12:45pm -- McWeeney, CBIL -- PSB recap.
Summary of proceedings of Pacific Symposium on Biocomputing 2002 held in Kaua'i. Presentation will include slides and interpretive hula.

Thursday, January 10, 2002 12:45pm -- Stoeckert, CBIL -- MGED Ontology
Representing the biological context of a microarray experiment and making a database out of it.

Thursday, January 3, 2002 1:00pm -- Schug, CBIL -- TESS-in-GUS
An update on progress and plans for TESS-in-GUS covering grammars, schemas, and statistics.

Wednesday, December 19, 2001 12:00pm -- Le, -- Modeling Pancreatic Beta Cell Development
Building a gene network for endocrine pancreas development using laser capture, microdissection, microarrays, and modeling.

Thursday, December 13, 2001 N/A -- N/A, -- No lab. meeting
No lab. meeting

Thursday, November 29, 2001 1:00pm -- Manduchi, -- Report on Affymetrix workshop
Elisabetta will report on the Affymetrix workshop that she and Shannon attended recently.

Friday, November 23, 2001 N/A -- N/A, -- No lab. meeting
Happy Thanksgiving!

Thursday, November 15, 2001 1:00pm -- Mazzarelli, -- State of the Annotation Address - Part II
State of the Annotation Address - Part II

Thursday, November 1, 2001 1:00pm -- Kondrakhin, -- PROM_REC
An update on the latest in promoter recognition technology, PROM_REC. Note the new 1:00 p.m. time slot for lab. meeting.

Thursday, October 25, 2001 N/A -- N/A, -- No lab. meeting


Thursday,