Table of Contents: * Installation * Usage * Parameter Description * File Format * Notes * Theoretical Details INSTALLATION: ------------ Download STAR_1.0.tar and unpack. This is a command line program. On Mac you must use the terminal shell, on Windows either the Command Prompt or some kind of Unix emulator like cygwin. In the directory where the files were unpacked, try running the following test: > java -jar STAR.jar 10 1 true .5 10 true test_data1.txt false false test_out 1 true 10 5 0 This should output two files test_out_spans and test_out_wig that should look like this: http://cbil.upenn.edu/STAR/test_out_spans http://cbil.upenn.edu/STAR/test_out_wig Simply type java STAR to get full usage and parameter description. This is also copied below in this document. USAGE: ------ java -jar STAR.jar PARAMETER DESCRIPTIONS: --------------------- * extension length is the total fragment length minus the read length * control is either: 1) 'false', or 2) the name of the control file, or 3) the name of a file of regions to mask, in this case the name of the file must end in '_mask' * remove_identical_reads is 'false' if no, 'true' if yes * repeats_file is 'false' if you don't want to mask repeats, otherwise is the name of a file of repeats with (tab delimited) format: chr start end (download here: cbil.upenn.edu/STAR/repeats) * start_coordinate is 0 or 1 depending on the whether the first position in each chromosome is denoted by 0 or 1 * right-end-inclusion means whether the read end is included or not, 'false' for not included and 'true' for included. * 'normalize to this many reads' is an integer, reads will be thrown out at random until this many reads remain, set to zero to not throw out any reads. FILE FORMAT: ------------ The sample (and control) file(s) should have at least four tab delimited columns: chr name, start position, end position, strand. Strand can be any column after the end position column, it will figure it out which is the strand column automatically and intervening columns will be ignored (if it cannot determine the strand column it will prompt you to input it). Files can have header lines, they will be ignored. NOTES: ------ * NOTE 1: program will not work correctly if the genome length is greater than 21.4 Gb. * NOTE 2: you will probably need to increase the default memory. After 'java' put the option -Xmx2000m to raise it to 2000 Mb, or change 2000 to whatever is necessary. If it hangs for a long time but doesn't crash it might run faster with more memory. With 5 million reads we find it necessary to use 2000mb. Somewhere between 10 and 20 million reads is the current limit, a version handling more reads will be released. * NOTE 3: 10 permutations should be plenty as it is not computing p-values but is estimating other quantities for which not as many permutations are necessary. * NOTE 4: To output only the wig file use an output filename that ends in _wig. * NOTE 5: The program outputs two files, one with all significant locations collapsed into spans, and a wig file which gives the max count at any location over all windows containing that location. To output the spans and wig files only, just use an output file name that ends in _spans. THEORETICAL DETAILS: ------------------- We describe here the algorithm to identify regions enriched for a histone modification based on ChIP-Seq evidence. This algorithm, named Statistical Test for the Accumulation of Reads (STAR), is implemented in Java and is freely available as open source at http://www.cbil.upenn.edu/STAR. To get a usage guide, execute the java program with no parameters. In what follows, we indicate by "sample" the data from the condition whose histone modifications are being studied. We indicate by "control" the optional data coming from input which was not immunoprecipitated. When controls are available, STAR offers two options to utilize them. The first consists in initially analyzing the controls with STAR (using approach 1 described below) to determine significant peaks, i.e. regions of bias. Those regions are then masked out of the sample and the sample is then analyzed using approach 1 described below. The second consists in utilizing the control, instead of permutations, in order to determine the significant peaks in the sample, as described in approach 2 below. We observe that the Illumina Genome Analyzer sequence reads from non precipiated controls were not uniformly distributed, but showed a consistent pattern of bias, we determine those regions that are consistently significant across some percentage of controls, using STAR and we then proceeded with approach 1 below, masking out these regions. In both scenarios STAR also performs repeat masking, if a repeats file is provided (we have made human and mouse repeat files formatted for STAR at cbil.upenn.edu/STAR/repeats). 1. PERMUTATION APPROACH. Under this option, STAR takes as input a tab delimited file for the sample in BED format giving the genomic locations and read lengths of all mapped reads. It also takes, as optional input, a file listing regions of control bias to be masked out prior to subsequent analyses. After this optional masking, STAR determines whether local accumulations of reads are statistically significant by using randomized data as controls. The details of this procedure follow. The resolution of the data is set by choosing a window size L and a displacement of size D. For each chromosome the first window is placed as the start of the chromosome, positions 1 through L. Each next window is obtained from the previous by moving it D bases to the right. The second window is therefore at positions D+1 through D+L. If D is greater than L the windows will be disjoint. For a given window W of L bases, let N(W) be the number of reads overlapping W. For each positive integer n, let R(n) be the number of windows W of length L such that N(W) > n. A permutation of the reads is performed by placing them back on the genome with an equal probability of being placed in any location. A number k of such permutations are performed, denoted p1,...,pk. For each i=1,...,k and positive integer n, we compute the number of windows of length L in permutation pi for which N(W) > n, and we denote by A(n) the average of these k numbers. For each n, we use F(n) = min{1, A(n)/R(n)} to estimate the proportion of false positive windows obtained by calling significant all windows in the sample for which N(W) > n. A(n) estimates the number of false positives assuming no regions are modified, leading to a conservative (over) estimate of the number of false positives in that this estimate has expected value that is an upper bound for the number of false positive windows with N(W) > n. Each window W is given as score S(W) = F(N(W)). A minimum n0 is chosen so that F(n0) < alpha for a chosen error rate alpha. Peaks are then called by merging into spans all overlapping windows in the sample with S(W) > n0. Each span is reported with a score given by the average S(W) across all windows W contributing to the span. This score should decrease as the strength of evidence for the region to have been correctly determined increases. 2. CONTROL METHOD. Under this option, STAR takes as input two tab delimited files, one for the sample and one for the control, in BED format giving the genomic locations and read lengths of all mapped reads. First, significant peaks are determined in the control using approach 1 and are masked out from both the sample and the control. Then the method proceeds analogously to (1), except that instead of using permutations to compute A(n), the latter is defined as the number of windows of length L in the control for which N(W) > n.