GPAP Help and Frequently Asked Questions
GenePix AutoProcessor (GPAP) is an automated and customizable application which is used to correct, filter and normalize raw microarray
data and identify differentially expressed genes. Primary microarray data are captured using GenePix
to generate a GenePix Results File (GPR). GPR files obtained from analyzing
biological or technical replicates of the the same treatment or time point can
be processed using GPAP. User defined inputs specify the stringency of data filtering
and correcting, specify the outlier
spots definition when calculating the average across arrays
and the method of normalization. The output includes descriptive statistics summary, diagnostic plots
(before and after processing), gene summary
report across arrays to evaluate the filtering and normalization. In
addition, each gene is ranked for significance of differential expression with t-statistics, p-value and B-statistics. GPAP utilizes the R statistical language with the Bioconductor and LIMMA
packages.
Back to Top
The files names should start with a letter, should be less than 20 characters and
be "UNIX friendly" (avoiding non-alphabetic characters in the file name).
Further considerations for your files and file names are listed
below.
Back to Top
If you have replicates evenly printed on your slide such that replicate
features have an equal number of rows between each replicate in the GPR files
and are separated by an equal number of features on the array,
GPAP can average these replicates (after filtering and normalization) and
provide averaged results and overall B-statistics for each gene. This
works when the replicates are printed in different replicate arrays.
Suppose your microarray was printed using 16 pins (4x4 pins) and each sub-array
contains 18x18 spots (324 features per sub-array) and you have printed
duplicate spots in replicate arrays, the replicates would be equally spaced
with 16x18x18 = 5184 spots between each replicate and 5184 rows would exist between reach
replicate in the GPR files.
Back to Top
GenePix Pro results containing signal intensity data are saved as .gpr files, which are in a tab-delimited
plain text file format.
Back to Top
You will need to upload 2-6 GPR files of biological and/or technical replicates
of a single treatment or
time point. You should not upload GPR files derived from different
treatments or different time points because the results will be averaged
together. Suppose you have three biological replicates at 1 hour,
then you will upload these three GPR files at one time. You
must have at least two biological or technical replicates and you cannot
process a single GPR file.
Back to Top
GPAP and the underlying R statistical language with the Bioconductor and LIMMA packages are very sensitive to format, consistency and use of non-alphanumeric characters. Errors frequently occur when the files uploaded do not meet the format required by the R statistical language. When problems occur while attempting to process data you should consider the following known problems to help identify errors and correct your input files for processing.
Raw microarray data extracted from image analysis contains many data points
that are very low in overall intensity, saturated in both channels, noisy or
otherwise are of poor-quality and questionable. Based on user-defined baseline and
threshold values, such poor-quality features will be filtered out and removed
from further processing and analysis which includes normalization, replicate
averaging, etc.
Normalization is used to adjust and balance individual signal
intensities to reduce technical or systematic differences not caused by the
treatment and allows for more meaningful interpretations of the biological
effect of the treatment. Normalization is used to correct for
systematic variation NOT biological variation. Systematic or technical
variations include differences in: dye incorporation rates, RNA loading, RNA
purity and quality, differences in laser age and power, emission
characteristics and stability of the fluors and a multitude of unrecognized
variables to name a few.
Back to Top
The baseline and threshold values are used to exclude or filter out low
intensity, poor-quality data before analysis. Baseline value is a number (such as 200, 500) that you set to
filter out low intensity spots. For a spot with signal intensities in both
channels below the user defined baseline value (before background subtraction), this spot
will be be
filtered out (filtering) leaving those features with signal greater than
baseline value in at least one channel. A spot whose intensities in only
one channel is below the user defined baseline value (before background
subtraction) will be increased to this baseline value (scaling). The threshold value (T-value) is also a number (such as
1, 2, or 3)
that you can set to filter out "noisy" spots. Based on the T-value, GPAP
will calculate a dynamic value for each spot (Feature Background Intensity + [T-value *
Background Standard Deviation]) and spots with intensities below the threshold
in both channels (before background subtraction) will be filtered out.
Back to Top
You should set these two criteria to filter and scale the raw intensity data
in GRP files appropriately. We recommend you begin with the default
values (baseline value of 200 and not threshold value) along with an
appropriate normalization method and view the resulting
"before and after" diagnostic plots to evaluate the impact of these settings. Currently, the
baseline cannot be set to a value below 200 and we suggest that you use at least
the minimum baseline value. Hybridizations with extremely high background
and noise will require more stringent filtering than the default.
Back to Top
GPAP will weight spots in GPR files whose Flag is less than 0; that is,
flagged in GenePix Pro as "bad", "not found" or "absent" (-100, -50 and -75,
respectively). Features flagged to "Include in Normalization" will
not be weighted unless they are also flagged as "bad", "not found" or "absent".
Back to Top
If
you have applied a normalization method using GenePix Pro, you can choose "No
Normalization" from the normalization drop down box. If your data
are not already normalized, we recommend a non-linear normalization method which
is based on feature intensity (Loess - Global intensity dependent normalization)
and/or spatial grouping (Loess - Within-print-tip-group intensity dependent
normalization). These non-linear normalization methods are widely accepted
and applicable to most microarray data. Additional normalization methods
(Linear Global Median, etc) are
available if needed or warranted. Evaluation of the resulting
"before and after"
diagnostic plots will help determine if your data are better suited for
one normalization method over another. For example, the distribution of
log-ratios as seen in the Box Plots for individual arrays
may reveal a spatial effect indicating "Loess - Within-print-tip-group intensity
dependent normalization" is warranted.
(NOTE: Remember, normalization is used to correct for systematic variation
NOT biological variation. Sub-genomics arrays with few probes, probes selected with
bias for participation in your treatment (subtracted cDNA libraries, SSH
libraries, selected oligo subsets, etc) and/or treatments creating global changes
(e.g., starvation, heatshock, etc) may alter the expression of so many features
on your array that these normalization methods could obscure true changes in
expression. Reliable normalization controls (such as heterologous
genes and companion RNA spikes) are required in these cases and normalization to
those controls should be done in GenePix Pro with no additional normalization
applied in GPAP.)
Back to Top
An outlier is defined as a value far from most others in a set of data.
An outlier that is many standard deviations from the mean will have dramatic
impact on the average. Outliers should be identified and cast out of the
data set to obtain a more meaningful average. However, identifying a
large number of outliers for a given gene can indicate true variability of the
mRNA abundance - an equally valuable measurement. The outlier definition is a number (such as
1, 2 or 3) used by GPAP to identify and remove outliers to calculate the final average Log2 ratios within and across array's
replicates for each gene and is included in the Gene Summary Report . If the outlier definition is
set at "2", any Log2 ratio that is outside of 2 standard deviations of the mean is considered
as outlier, is removed and the average is re-calculated. If the outlier definition is
set at "3", any Log2 ratio that is outside of 3 standard deviations of the mean is considered
as outlier, is removed and the average is re-calculated.
Back to Top
If you have replicates evenly printed on your slide such that replicate
features have an equal number of rows between each replicate in the GPR files
and are separated by an equal number of features on the array,
GPAP can average these replicates (after filtering and normalization) and
provide averaged results and overall B-statistics for each gene. This
works when the replicates are printed in different replicate arrays.
Suppose your microarray was printed using 16 pins (4x4 pins) and each sub-array
or block contains 18x18 spots (324 features per block) and you have printed
duplicate spots in replicate arrays, the replicates would be equally spaced
with 16x18x18 = 5184 spots between each replicate and 5184 rows would exist between reach
replicate in the GPR files.
Back to Top
A box plot is a plot represents graphically several descriptive statistics of a
given data set, which usually has a box including a central line
and two tails. The the upper and lower boundary of the box show the
location of the 75th percentile and 25th quartile respectively. The
median, or central 50% of the data is drawn inside the box and the central line
in the box shows the position of the median. The lines extending from the
box disply the spread of the data. Replicate arrays must be similar in
range or otherwise should be discarded. Box plots are provided for all of
the data points in the GPRs and for each print-tip-group within individual GPRs
and are drawn for the raw data and the processed (filtered and normalized) data.
Box plots for individual arrays are helpful in diagnosing spatial effects and
observing the impact of "Loess - Within-print-tip-group intensity dependent
normalization". Thus, box plots can
be useful for visually comparing different normalization methods.
Box plots are produced using the R statistical language
with the Bioconductor package.
Back to Top
The scatter plot is one of the simplest methods used to visualize overall
mRNA
expression levels within a single hybridization. The M-A scatter
plot is a convenient to observe the distribution of intensity values and log
ratios. M-A scatter plots
are provided for each array and for separate blocks of each array. The
colored lines appearing within the scatter plot represent the average for each
print-tip-group and are drawn for the raw data and the processed (filtered and
normalized) data. The M-A scatter plot is a Log2ratio (log2(cy5/cy3)
) vs. log2 intensity (1/2(log2(cy5*cy3))) plot. Therefore, if a gene has equal
expression values in both the control and experiment, the expression ratio (
log2 ratio) will be zero. For a typical hybridization experiment, most
genes will have equal expression values in both control and experiment and we
expect the majority of points to be grouped around the horizontal line Y=0.
Without normalization, the majority of points may be clustered around a
horizontal line greater or less than Y=0 indicating the need for normalization.
Notice that unrealistic and often dramatic scattering of data points is often
observed prior to processing as "A" or log2 intensity (log2(cy5*cy3))
approaches zero, emphasizing the need to filter out low intensity,
poor-quality data before analysis and interpretation. The baseline and/or
threshold values should be stringent enough to remove the scattering near the
lower intensities and increase the reliability of the reported ratios.
Scatter plots are produced using the R statistical language with the Bioconductor
package.
Back to Top
The Q-Q (Quantile-Quantile) plot provides a visual comparison of two
populations and is a plot of the sampled t-statistic vs. a theoretical
t-statistic. The Q-Q plot can indicate the degree a sample diverges from
a normal distribution.
Points which deviate markedly from a linear relationship to a theoretical
t-statistic could be considered suspect genes exhibiting differentially
expression. The Q-Q plots allow you to visualize the magnitude of
differentially gene expression within the sample tested based on the Students
t-test. However, the ordinary Student t-test is not ideally suited for
microarray data because a large t-statistic can be driven by an unrealistically
small standard deviation. The Q-Q plots are produced using the R statistical language with the Bioconductor
package.
Back to Top
The density plots (density vs
log2
intensity (1/2(log2(cy5*cy3)))) are of
single-channel log-intensity densities and illustrate the distribution of
single-channel intensities. Normally the distribution of intensities
should appear roughly bell-shaped; however depending on the choice of genes and
the experiment conducted, the distribution of intensities may appear
double-peaked or skewed to one side. Density plots can be useful for
visually comparing different normalization methods.
Density plots are produced using the R statistical language with the
LIMMA
package.
Back to Top
The B-statistic is based on the Empirical Bayes approach to rank genes and
determine if a gene is statistically significantly differential expressed or
not. Classically, inference of significant changes in gene expression was based
on a fixed value or absolute 2-fold or greater change (Log2 ratio >1 or <-1).
However, this is an arbitrary threshold which can lead to false positive and
false negative inferences and does not account for more subtle variations with
biological significance. More recent approaches for determining
significant changes in expression rely heavily on adequate biological replicate
hybridizations and the calculation of a suitable statistic such as a moderated
t-statistic or a b-statistic which ranks each gene to indicate whether a gene
has significantly changed in expression or not. The ordinary t-statistic is
not ideal because a large t-statistic can be driven by an unrealistically small
standard deviation. An added advantage of the t-statistic is the
introduction of standard deviation, number of replicates and sample variance to
the averaged Log ratio. B-statistic is the log-odds that gene is differntialy expressed.
For example if B=1.5, the odds of differential expression is exp(1.5)=4.48,
the probility that a gene is differentially expressed is 4.48/(4.48+1)=0.82, so there
is 82% chance that gene is differentially expressed. The B-statistics is automatically
adjusted for multiple testing by assuming that 1% genes are expected to be differentially
expressed. In GPAP, the genes are ranked or scored
according to the B-statistic and selection of a cut-off value is then
determined by the user/investigator. If the B-statistic is 0 (zero),
there is a 50% chance the measured Log2 ratio is random and not significant. The introduction of additional biologial replicates (with high correlation coefficients) tends to produce higher B-statistic values.
The higher the B-statistic, the more significant the result. B-statistic,
t-statistic and P-value (probability) are generated using the R statistical
language with Bioconductor and the LIMMA package.
Back to Top
M is the log-transformed ratio typically calculated as log2(cy5/cy3) or
log2(treatment/control) and use used to instead of intensity ratios so the
up-regulated and down-regulated values are of the same scale and comparable.
A two-fold change is represented by a log2 ratio of 1.0 (up-regulation) or -1.0
(down-regulation). A three-fold change in gene expression is represented
by a Log2 Ratio of 1.58 or -1.58. This value is not averaged in the "M
and A value for individual spots" report nor in the diagnostic plots.
However, the M value is averaged in the "Gene Summary (averaged)" and
"B-Statistics Ranking" reports for the same spot between slides and within
slides if replicates are evenly printed on the array.
Back to Top
A is typically referred to as the intensity or log2intensity and is
calculated as 1/2(log2(cy5*cy3)). This value is useful to observe the
distribution of signal intensities and to recognize if a spot produced
brilliant or weak signal. Note that the signal intensities are the result
of transcript abundance AND probe abundance, spot quality, hybridization
kinetics, cross-hybridization events, etc thus a weak signal cannot be
interpreted as low mRNA levels because the spot may simply be a bad probe!!!.
This value is not averaged in the "M and A value for individual spots" nor in
the diagnostic plots. However, the M value is averaged in the "Gene
Summary (averaged)" and "B-Statistics Ranking" reports between slides and
within slides if replicates are evenly printed on the array.
Back to Top
"t" is a moderated t-statistic and is a ratio of Log2-expression
level to its standard error. Moderated t-statistic has the same interpretation as an student
t-statistic except that standard error have been moderated across genes, effectively borrowing
information from the ensemble of genes to aid with inference about each individual gene.
Moderated t-statistic and associated p-value do not require prior guess for the number of
differntially expressed genes.
Back to Top
The P-value is obtained from moderated t-statistic and after FDR adjustment which
is Benjamini and Hochberg's method to control the false discovery rate. If you select all the genes
with p-value less than a given value, say 0.05, as differntially expresed, then the expected
proportion of false discovery in the selected group should be less than that value, in this case less than 5%.
Among the three statistics, moderated-t, associated p-value and B-statistic, we usually base our gene
select on p-value. The p-value will
represent an area under a probability curve which is less than or greater than
a significance level. The significance level is defined by the user and is
normally 0.05 or less and and any P-values below that mark are considered
"significant". P-values do not simple provide you with a "Yes" or "No"
answer, they provide a sense of the strength based on the evidence. The lower
the p-value, the stronger the evidence.
Back to Top
SD is the standard deviation of the log2 ratios for each gene. SD
is one of several indices of variability used to characterize the dispersion
among the measures in a given population. The standard deviation is the
square root of the variance and is calculated as "sqroot [sum( x - u)^2 /
(N-1)] where x is the log2 ratio of individual spots, u is the mean log2 ratio
and N is the total number of spots. Variance is a measure of how spread
out a distribution is. The standard deviation is a measure of how
dispersed the measured values are from the mean.
Back to Top
CV or coefficient of variation is a statistic used to describe the amount of
variation within a set of measurements and is calculated as the (SD of Log2
Ratios)/(Mean of Log2 Ratios).
Back to Top
Weight is used in normalization. In normalization, good spots with high intensity in both
channel are given full weight (weight equal 1) and bad spots which are flagged in GenPix or have low intensity
in both channels are given low weight (weight equal 0.1).
Back to Top
The Fold Change is calculated by the following formula:
Fold change=2^(signal Log2 ratio) (signal log2 ratio>=0)
Fold change=(-1)*2^(-1*signal Log2 ratio) (signal log2 ratio<0)
Back to Top
GPAP use Spearman rank correlation coeffient. Spearman rank correlation coefficient is non-parametric
and distribution free method. It is used when the distribution of data make the Pearson method misleading. The Pearson method
is the default and the only method for calculating correlation coefficients in Excel.
Back to Top
The number of "Spots Used" is the total number of spots (Log2 ratios) used for each gene to calculate the average Log2 ratio after removing data filtered by baseline, threshold and outlier definitions or flagged. The number of outliers removed are listed in the adjacent column and the "missing spots" are those filtered out by the baseline and threshold definitions or flagged. These filtered data points also appear on the "and A value for individual spots" report as missing data and are scored as "NA".
When data and settings are submitted to GPAP for processing, each request is
assigned an identification number such as 1068085560514 called the Job ID.
This number appears briefly during processing and on the first line of the
Statistics Summary. You can use this number to retrieve the results at a
later date (up to one month). To revisit old data, click on "Request
Results" from the left menu, enter the saved Job ID into the box and click the
"Get Results" button.
Back to Top
GenePix Pro Auto-Processor (GPAP) was developed by Hua Weng and Patricia Ayoubi
in the OSU Microarray Core Facility and Bioinformatics Group in the
Department of Biochemistry and Molecular Biology at Oklahoma State
University (Stillwater, OK 74074, USA). You can contact the developers by
email: Hua Weng,
hweng@biochem.okstate.edu or Patricia Ayoubi,
ayoubi@okstate.edu .
Back to Top
"Pre-processing and normalization of data was accomplished using R-project statistical environment (http://www.r-project.org) and Bioconductor (http://www.bioconductor.org) through the GenePix AutoProcessor (GPAP) website (http://darwin.biochem.okstate.edu/gpap , Weng and Ayoubi, 2004)."
Paper:
Hua Weng and Patricia Ayoubi. 2004. GPAP (GenePix Pro
Auto-Processor) for online preprocessing, normalization and statistical analysis
of primary microarray data. 2004. In preparation.
Web site:
The GPAP web site
[http://darwin.biochem.okstate.edu/gpap]
Back to Top