Setting Up and Running RUM

RUM is an alignment, junction calling, and feature quantification pipeline specifically designed for Illumina RNA-Seq data.

    RUM can also be used effectively for DNA sequencing (e.g. ChIP-Seq) and microarray probe mapping.

    RUM also has a strand specific mode.

    RUM is highly configurable, however it does not require fussing over options, the defaults generally give good results.

Publication

Comparative Analysis of RNA-Seq Alignment Algorithms and the RNA-Seq Unified Mapper (RUM) Gregory R. Grant, Michael H. Farkas, Angel Pizarro, Nicholas Lahens, Jonathan Schug, Brian Brunk, Christian J. Stoeckert Jr, John B. Hogenesch and Eric A. Pierce.

Restrictions

RUM is freely available to academics and non-profit organizations. However since RUM uses BLAT, users from industry must first obtain a licence for BLAT from the Kent Informatics Website.

System Requirements

RUM should work anywhere you have most of the standard Unix command-line tools, Perl, and can get the blat, bowtie and mdust binaries to execute; however we haven't tested it on every platform. There is a self-install script, described below, for the systems we have tested.

Unless you have a relatively small genome, then you'll probably need a 64 bit machine. For the human or mouse genome this will definitely be necessary. For a lane of 20 million 100 bp reads, paired-end, expect to use about 100-200 GB disk space.

Installing RUM

Self-Install Script

RUM has a self-install script that should work for most 64 bit Linux and Mac platforms. Simply download the rum_install.pl script and run it, providing a single command line argument: the directory where you want to install the RUM pipeline and indexes (you probably want to make an install directory called "rum" somewhere on your system, to install to). If you move things around after installing, make sure to update the paths in the config file for your organism. After RUM installs itself, it will prompt you to install some indexes. You can install indexes one at a time by entering a number of an index to install. Enter 'q' or hit Control-C to quit.

Note: If you are going to make your own indexes, or are just updating the scripts, then you can skip the index installation step by just just hitting 'q' when prompted for an index to install. You can always install more indexes later by running rum_indexes, which is the bin directory. Please run rum_indexes -h to learn how to use it.

From Git

You can also install RUM by simply cloning the git repository. If you use this method, you will need to install indexes using rum_indexes.

Current Version: v1.11, released March 3, 2012
RUM release history
 

Indexes

Once you have RUM installed you can use the rum_indexes tool to list, install, and remove indexes. Please run rum_indexes -h to learn how to use it. At the moment the following indexes are available:

  1. Homo sapiens (build hg19) (human)
  2. Homo sapiens (build hg18) (human)
  3. Mus musculus (build mm9) (mouse)
  4. Danio rerio (build danRer7) (zebrafish)
  5. Drosophila melanogaster (build dm3) (fruit fly)
  6. Anopheles gambiae (build anoGam1) (mosquito)
  7. Caenorhabditis elegans (build c36) (nematode worm)
  8. Saccharomyces cerevisiae (build sacCer3) (yeast)
  9. Rattus norvegicus (build m4) (rat)
  10. Sus scrofa (build susScr2) (pig)
  11. Canis lupus familiaris (build canFam2) (dog)
  12. Pan troglodytes (build panTro2) (chimpanzee)
  13. Pongo pygmaeus abelii (build ponAbe2) (orangutan)
  14. Macaca mulatta (build rheMac2) (rhesus monkey)
  15. Gallus gallus (build galGal3) (chicken)
  16. Plasmodium falciparum (build 06-2010) (malaria parasite)
  17. Arabidopsis thaliana (build TAIR10) (arabadopsis)

We will be expanding this list regularly. If you require a different organism, instructions are given below to build your own custom indexes. Or write us, we may be able to provide it.

Platform-Specific Notes

64-bit Linux, Mac OS X >= 10.5

Make sure wget and gunzip are installed. You should be able to install either with the rum_install.pl script or by cloning the git repository.

The Amazon Cloud

Check out a 64 bit machine with Linux. Check out a powerful machine, preferably a High-Memory Quadruple Extra Large (m2.4xlarge 8 cores, 68.4 GB RAM). We recommend doing it as a spot request, with a max bid at least as high as the full instance price. You save a lot of money that way, with just a small risk of losing the machine before it finishes. We usually maintain machines for months like this (however, even if the machine is lost, the disk is not lost). Make sure wget and gunzip are installed. Attach a 1GB Volume to the machine and use the following script: http://itmat.rum.s3.amazonaws.com/ruminstalllinux64.pl You simply run that script with one argument, the directory where RUM is to be installed. Install it to the 1GB attached volume, the default volume will probably be too small.

Other Systems

RUM should work anywhere you have Unix and Perl and can get the blat, bowtie and mdust binaries to execute. You will probably need to run under the bash shell. On other systems you can use either the Linux or Mac installers and just replace those three binaries with ones that execute on your system (the mdust source is available here). Note: we haven't tested it on every platform.

Running RUM

RUM is an alignment pipeline that maps reads in three phases. First it maps against the genome using Bowtie, then it maps against a transcriptome database using Bowtie, then it maps against the genome using Blat. The information from the three mappings is merged into one mapping. This leverages the advantages of both genome and transcriptome mapping as well as combining the speed of Bowtie with the sensitivity and flexibility of Blat.

Coverage plots are generated, normalized intensities for genes, introns and exons are generated, and files describing the junctions are generated. Files are also generated that have the alignment for each read, one per line, in RUM and SAM format. These output files are described in more detail below.

 

 

 

The RUM workflow

We assume you installed to a directory named "rum" and your data is in a dicretory called "data/Lane1". The names of these directories is not important, but if you have multiple lanes, each must go in its own directory. For single end data execute the command shown below. The reads files can be fasta or fastq.

> perl rum/bin/RUM_runner.pl rum/conf/rum.config_ORGANISM data/Lane1/reads.txt data/Lane1 1 Lane1
For paired-end run as follows
> perl rum/bin/RUM_runner.pl rum/conf/rum.config_ORGANISM data/Lane1/forwardreads.txt,,,data/Lane1/reversereads.txt data/Lane1 1 Lane1
Change "ORGANISM" to your specific organism. This will run it in one piece. To parallelize, you must be on a machine with multiple cores or access to cluster nodes via 'qsub'. If so, then you can raise the fourth parameter to be larger than one.

Note on qsub: If you are using qsub and you have installed RUM somewhere other than your home directory, then you will probably need to specify everything with full paths, including in the rum.config file.

Modify accordingly for the other lanes and make sure to change it in all places. Rememeber to run each lane in its own directory or the temporary and intermediate files will collide.

Note: You can get the general usage and other options by running with no parameters as follows:
> perl rum/bin/RUM_runner.pl

Unless you have processors or nodes left over, you should wait until it's completely done running one lane before doing another, each lane can take many hours.

The "Lane1" argument right after the 8 is a name that you might want to change to be more descriptive, however it must be all letters, numbers, dashes, underscores, periods, nothing else.

The Output Files

A number of files of interest will be created, those are described below. In order to understand these files note that there are three kinds of reads:

  1. feature_quantifications_NAME, This file gives the counts and normalized counts for all features (transcripts, exons, introns). A 'min' and 'max' value is given, the 'min' value is based on the unique mappers only, the 'max' value is based on all mappers. The value is the number of fragments mapping to that feature divided by the length of the feature, by the number of reads that aligned, and by 109 (the so-called the 'RPKM' value). So as long as differential expression is reasonably well balanced between two samples, it should be meaningful to compare the RPKM values across them, even if there are different numbers of reads for each sample. Normalized intensities between different features of the same sample are also comparable, even if they have different lengths.

  2. RUM.cov and RUM_NU.cov. These are called coverage files - strictly speaking they are bedgraph files. These give the depth of coverage at every location. These files are in zero-based start, one-based end coordinates so that it can be directly uploaded to the ucsc genome browser. UCSC accepts compressed files (zip and gzip), so you should probably compress all files before uploading, as they will upload faster. Note that files might be too big to upload even compressed - in this case you should use the "BigWig" format so that the file stays on your server and the browser only downloads what it needs in real time.

  3. RUM_Unique : the unique mappers
  4. RUM_NU : the non-unique mappers

    The above two files each give one alignment per line. Forwad and reverse reads are merged into one line if their alignment overlaps, and their ID is given as a regular integer. Otherwise they are given in separate lines with the forward read ID indicated with an 'a' and the reverse read ID with a 'b'. The forward read always comes first, even if it maps downstream of the reverse. Each line has four fields:

     i) the sequence number
     ii) chromosome
     iii) spans of the alignment in genome coordinates
     iv) the sequence of the alignment
         - all sequence is plus strand
         - sequence has a colon ":" where there is a junction
         - sequence has a +XXX+ if XXX is an insertion, e.g. +AG+ means AG
           inserted in the sequenced genome w.r.t. the reference
    

  5. RUM.sam : all alignments in SAM format. This one file has all the information contained in RUM_Unique and RUM_NU and the original reads files (including quality scores if those are provided). The folowing tags are used:

    Only the "N" is a variable in the above, the entry between colons is the "type": "i" means it's an "integer" and the "A" just means it's a printable character, in this case "T" for "true" or "F" for "false".

  6. junctions_all.bed, junctions_all.rum, junctions_high-quality.bed : These files give information on junctions. The bed files can be directly uploaded to the UCSC browser.

  7. inferred_internal_exons.bed, novel_inferred_internal_exons_quantifications_NAME : These files give information on novel (unannotated) exons that are not in your gene model file but were inferred to exist from the data.

  8. mapping_stats.txt, a file that gives a breakdown of what percentage of the reads mapped, and how many reads mapped to each chromosome

  9. rum.log_master, a file that records how the parameters of the pipeline were set

  10. rum.error-log, a file that records any erros the pipeline might have thrown, always check this file for every run. This file is updated as the run proceeds so you should keep an eye on it as the job is working.

Once you have multiple lanes mapped, there is a script called 'featurequant2geneprofiles.pl' in the scripts directory that will create one spreadsheet of the normalized intensities with rows=genes and columns=samples. Run it without parameters to get the usage:
> perl rum/bin/featurequant2geneprofiles.pl

Making Your Own Indexes

We have supplied indexes for the most popular organisms which you can install when you run the installation script. But you might want to create your own. It's easiest if your organism is available on the UCSC genome browser or at ENSEMBL, but even if not, it is not difficult to make your own. In any case, just download this tar ball and follow the instructions contained within. This assumes you know how to get around the unix command-line.


Contact

Please email Gregory Grant with any questions/comments: ggrant@pcbi.upenn.edu


Credits

Gregory R. Grant1,2,4, Michael Farkas3, Angel Pizarro2, Nicholas Lahens5, Jonathan Schug, Brian Brunk1, Christian J. Stoeckert Jr1,4, John B. Hogenesch1,2,5 and Eric A. Pierce3

  1. Penn Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
  2. Institute for Translational Medicine and Therapeutics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
  3. F.M. Kirby Center for Molecular Ophthalmology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
  4. Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
  5. Department of Pharmacology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104