PaGE 5.0 Documentation

This document gives basic user documentation of the Java version of PaGE 5.0. This assumes you have already installed PaGE and know how to start it up.

See the technical manual for the complete details of the algorithm, and expanded discussion of all issues. This document is meant to be a relatively quick introduction to the software. See also the examples for a walk though the usage on acutal data.

Index

Introduction

PaGE is a tool for analyzing microarray data that is used to:

The patterns are derived from comparisons to a reference group. So if there are n groups, then the patterns have length n-1. For example suppose there are four groups, with 3, 2, 3, and 4 replicates, respectively. The data might look like this:
 
idc0r1c0r2c0r3c1r1c1r2c2r1c2r2c2r3c3r1c3r2c3r3c3r4
G10.9070.9641.0752.4101.8972.5563.2382.9230.9930.9720.9831.071
G21.1361.1141.0697.3116.1971.1140.8741.1923.3103.3993.8604.077
G310.28810.00812.02325.50724.6944.4596.2344.23411.33212.3238.24314.230
G45.2346.4932.3304.5708.4984.3498.3236.38421.93725.78818.84714.324
Etc...

The patterns attached to these genes might look like this.
id   c1      c2      c3   
G1
2
3
0
G2
3
0
1
G3
2
-1
0
G4
0
0
5

Positive integers represent upregulation and negative integers represent downregulation. Higher positive number represent greater differential expression however they are not meant to represent actual fold-change. A zero means there was insufficient evidence to make a differential expression call at the desired confidence level. The difference between levels 1,2,3,... of upregulation, and -1,-2,-3,... of downregulation, will be explained below.

The statistical confidence measures used in PaGE are False Discovery Rate (FDR) measures. Therefore what is controlled is the percentage of false predictions in the set of all predictions. Note that this differs fundamentally from a p-value based multiple testing approach which would control the probability of making any false predictions at all. The FDR has largely replaced the p-value in microarray differential expression analysis, since the p-value approach is generally considered too conservative.

As such, it is not unreasonable to use an FDR of .5, while a p-value of .5 would be completely unreasonable. For example, if there are 10,000 elements on an array and only 100 are differentially expressed, then it will be virtually impossible to find them by PCR verification. However, if we can find a set of 200 genes with FDR of .5 then one out of every two genes in this set will verify.

An FDR of .05 or .01, is often lower than necessary and many genes will be missed. Usually you will want to choose the FDR based on the size of the result set. If at first you get too few or too many genes, you can raise or lower it to find a reasonable number (that is assuming there are any differentially expressed genes at all to be found in the data).

Unfortunately microarray analysis is not a push-button exercise, every data set is unique and requires special considerations. Differential expression analysis is best performed interactively with an algorithm flexible enough to allow looking at the data from different angles. See the the technical manual for a more detailed discussion of this.

Microarray data come in many formats and there are many ways to design a microarray experiment when looking for differential expression. Therefore there are numerous options in PaGE required to tell it exactly what kind of data and study you are entering.

Input Data

Study Design

PaGE takes as input microarray data from two or more conditions. There must be at least two replicates per condition. Possible designs are:

Data and File Format

Running PaGE

You follow the menu items from left to right.

Setting the Analysis Type

You start by choosing File new analysis type from the menu.

You will be asked whether it is 1-channel or 2-channel data. Note that AFfymetrix data is considered 1-channel.

If you choose 2-channel you will be asked whether it is a reference design or a direct comparison design.

For both 1-channel and 2-channel data, the algorithm needs to know if you have already log transformed the data.

In the case of 1-channel and 2-channel direct comparison designs, you will be asked if the data are paired.

After answering these questions the analysis type has been fully defined. You next have to input the data.

Inputting the data

Choose menu item Data Open Data file

You should see a file browser that will allow you to find the data file on your disk.

Including gene information and links in the output

Choose menu item Data Open id2info file

This allows you to enter a file of gene descriptions which will be included in the results file.

Choose menu item Data Open id2url file

This allows you to enter a file of URLs, to be included in the results file as links.

Executing the algorithm

You may choose to set some of the configuration options with the menu choice Options Configuration (generally you will leave them as their defaults).

To execute the algorithm, choose menu item Run PaGE, and then choose one of the two statistics (generally you will want to start with the t-statistic).

You will now be asked to give the level confidence and the reference group.

This is the most important parameter to adjust.

If one has very few genes differentially expressed, or if the data are very noisy, then a relatively low level confidence might be necessary to find them. A confidence is very different from a p-value, so that even a confidence as low as .5 might be useful.

Conversely, if there is a large number of differentially expressed genes, on the order of thousands, then the user will generally want to set the level confidence higher to see just the most confident predictions. One in this case might wish to raise the level confidence as high as .95, or even .99 in extreme cases.

If appropriate, you will be asked whether to run the algorithm on the logged or the unlogged data.

The Results Files

The results are first output to a window displaying them in text. You can save them either in tab delimited text format, or in HTML format. The HTML format has links so that you can easily view the intensities for any array element in the report.