Setting Up and Running RUM
RUM is an alignment, junction calling, and feature quantification pipeline specifically designed for Illumina RNA-Seq data. RUM can also be used effectively for DNA sequencing (e.g. ChIP-Seq) and microarray probe mapping.
RUM also has a strand specific mode.
RUM is highly configurable, however it does not require fussing over options, the defaults generally give good results.
Comparative Analysis of RNA-Seq Alignment Algorithms and the RNA-Seq Unified Mapper (RUM) Gregory R. Grant, Michael H. Farkas, Angel Pizarro, Nicholas Lahens, Jonathan Schug, Brian Brunk, Christian J. Stoeckert Jr, John B. Hogenesch and Eric A. Pierce.
RUM is freely available to academics and non-profit organizations. However since RUM uses BLAT, users from industry must first obtain a licence for BLAT from the Kent Informatics Website.
RUM should work anywhere you have most of the standard Unix command-line tools, Perl, and can get the blat, bowtie and mdust binaries to execute; however we haven't tested it on every platform. There is a self-install script, described below, for the systems we have tested.
Unless you have a relatively small genome, then you'll probably need a 64 bit machine. For the human or mouse genome this will definitely be necessary. For a lane of 20 million 100 bp reads, paired-end, expect to use about 100-200 GB disk space.
RUM has a self-install script that should work for most 64 bit Linux and Mac platforms. Simply download the rum_install.pl script and run it, providing a single command line argument: the directory where you want to install the RUM pipeline and indexes (you probably want to make an install directory called "rum" somewhere on your system, to install to). If you move things around after installing, make sure to update the paths in the config file for your organism. After RUM installs itself, it will prompt you to install some indexes. You can install indexes one at a time by entering a number of an index to install. Enter 'q' or hit Control-C to quit.
Note: If you are going to make your own indexes, or are just updating the scripts, then you can skip the index installation step by just just hitting 'q' when prompted for an index to install. You can always install more indexes later by running
rum_indexes, which is the
bin directory. Please run
rum_indexes -h to learn how to use it.
You can also install RUM by simply cloning the git repository. If you use this method, you will need to install indexes using
Current Version: v1.11, released March 3, 2012
RUM release history
Once you have RUM installed you can use the
rum_indexes tool to list, install, and remove indexes. Please run
rum_indexes -h to learn how to use it. At the moment the following indexes are available:
We will be expanding this list regularly. If you require a different organism, instructions are given below to build your own custom indexes. Or write us, we may be able to provide it.
Make sure wget and gunzip are installed. You should be able to install either with the
rum_install.pl script or by cloning the git repository.
Check out a 64 bit machine with Linux. Check out a powerful machine, preferably a High-Memory Quadruple Extra Large (m2.4xlarge 8 cores, 68.4 GB RAM). We recommend doing it as a spot request, with a max bid at least as high as the full instance price. You save a lot of money that way, with just a small risk of losing the machine before it finishes. We usually maintain machines for months like this (however, even if the machine is lost, the disk is not lost). Make sure wget and gunzip are installed. Attach a 1GB Volume to the machine and use the following script: http://itmat.rum.s3.amazonaws.com/ruminstalllinux64.pl You simply run that script with one argument, the directory where RUM is to be installed. Install it to the 1GB attached volume, the default volume will probably be too small.
RUM should work anywhere you have Unix and Perl and can get the blat, bowtie and mdust binaries to execute. You will probably need to run under the bash shell. On other systems you can use either the Linux or Mac installers and just replace those three binaries with ones that execute on your system (the mdust source is available here). Note: we haven't tested it on every platform.
We assume you installed to a directory named "rum" and your data is in a dicretory called "data/Lane1". The names of these directories is not important, but if you have multiple lanes, each must go in its own directory. For single end data execute the command shown below. The reads files can be fasta or fastq.
> perl rum/bin/RUM_runner.pl rum/conf/rum.config_ORGANISM data/Lane1/reads.txt data/Lane1 1 Lane1For paired-end run as follows
> perl rum/bin/RUM_runner.pl rum/conf/rum.config_ORGANISM data/Lane1/forwardreads.txt,,,data/Lane1/reversereads.txt data/Lane1 1 Lane1Change "ORGANISM" to your specific organism. This will run it in one piece. To parallelize, you must be on a machine with multiple cores or access to cluster nodes via 'qsub'. If so, then you can raise the fourth parameter to be larger than one.
Note on qsub: If you are using qsub and you have installed RUM somewhere other than your home directory, then you will probably need to specify everything with full paths, including in the rum.config file.
Modify accordingly for the other lanes and make sure to change it in all places. Rememeber to run each lane in its own directory or the temporary and intermediate files will collide.
Note: You can get the general usage and other options by running with no parameters as follows:
> perl rum/bin/RUM_runner.pl
Unless you have processors or nodes left over, you should wait until it's completely done running one lane before doing another, each lane can take many hours.
The "Lane1" argument right after the 8 is a name that you might want to change to be more descriptive, however it must be all letters, numbers, dashes, underscores, periods, nothing else.
|The Output Files|
A number of files of interest will be created, those are described below. In order to understand these files note that there are three kinds of reads:
The above two files each give one alignment per line. Forwad and reverse reads are merged into one line if their alignment overlaps, and their ID is given as a regular integer. Otherwise they are given in separate lines with the forward read ID indicated with an 'a' and the reverse read ID with a 'b'. The forward read always comes first, even if it maps downstream of the reverse. Each line has four fields:
i) the sequence number ii) chromosome iii) spans of the alignment in genome coordinates iv) the sequence of the alignment - all sequence is plus strand - sequence has a colon ":" where there is a junction - sequence has a +XXX+ if XXX is an insertion, e.g. +AG+ means AG inserted in the sequenced genome w.r.t. the reference
Only the "N" is a variable in the above, the entry between colons is the "type": "i" means it's an "integer" and the "A" just means it's a printable character, in this case "T" for "true" or "F" for "false".
Once you have multiple lanes mapped, there is a script called 'featurequant2geneprofiles.pl'
in the scripts directory that will create one spreadsheet of the normalized intensities
with rows=genes and columns=samples. Run it without parameters to get the usage:
> perl rum/bin/featurequant2geneprofiles.pl
|Making Your Own Indexes|
We have supplied indexes for the most popular organisms which you can install when you run the installation script. But you might want to create your own. It's easiest if your organism is available on the UCSC genome browser or at ENSEMBL, but even if not, it is not difficult to make your own. In any case, just download this tar ball and follow the instructions contained within. This assumes you know how to get around the unix command-line.
Please email Gregory Grant with any questions/comments: email@example.com
Gregory R. Grant1,2,4, Michael Farkas3, Angel Pizarro2, Nicholas Lahens5, Jonathan Schug, Brian Brunk1, Christian J. Stoeckert Jr1,4, John B. Hogenesch1,2,5 and Eric A. Pierce3