Ajeet Ryatt
Bio 480 – Individual Project

Applying ANOVA statistics to microarray analysis

Introduction

Analysis of Variance, or ANOVA, is a statistical technique used to determine if points of data are significantly different from the mean or if it is due to chance. This process is very similar to a t-test, except ANOVA can be applied to more than two sets of data. Consider the following table:

Number of Groups

Number of T-Tests

3

3

4

6

5

10

6

15

7

21

8

28

Table 1. Comparison of number of data groups versus number of t-tests that must be done. Number of groups refers to how many individual data sets there are. For example, in microarray analysis, each slide would be considered a separate group. If using slides with duplicate hybridization locations, like the 3DNA slides used in lab, each half can be counted as a discrete group.

Since t-tests are done between groups of data, it becomes very cumbersome to do multiple t-tests for microarray analysis. Most scientific journals will require 6-10 microarrays as data in order to get published, so the researcher may have to do at least 15 t-tests to prove the statistical significance of their values. ANOVA can be applied to many sets of data at once, so it is easier to work with when there are many data groups.

An ANOVA test will help to normalize microarray data between groups in order to produce groups that are as similarly as possible to the mean of the experiment, so the variance, or effect, can be measured.

There are several assumptions that come with doing an analysis of variance test. These include:

1) Independence of cases – each case must be its own experiment. This requirement is fulfilled in microarray analysis because each slide (or half slide, when using slides with duplicates) is its own group of data.
2) Normality – data needs to have a normal distribution. With a large data set, like a microarray, this condition will usually exist naturally.
3) Equality of variance – data must have equal variance. Variance is (standard deviation)2.

Since the assumptions are met, the ANOVA procedure can now be applied to the microarray data.

 

Procedure – This procedure is based off analyzing 3 Δzms2 slides (6 arrays, as each has a duplicate). Slides 725, 726 and 728 were used. Please note that due to the massive size of the microarray data, most figures only portray the first few lines of it. The same analysis can and should be used on the entire array, but for the scope of this paper it is not shown.

1. Organize data to be used for the statistical tests. The first step in microarray analysis with ANOVA is to get a “fixed” value for each spot. A fixed value has the following conditions:

a)    Subtract the background from the intensity. This discounts the background fluorescence from the data. Discard all negative values for background minus intensity, as they will skew the results in the next steps.

b)    Divide the channel 2 intensity by the channel 1 intensity. This is represented by (Channel 2 Intensity - Channel 2 Background)/(Channel 1 Intensity – Channel 1 Intensity).

c)    This creates a ratio of values instead of absolute values. From here to the end, the microarray is read as a relative measurement of under- or over-expression, and not an absolute measurement.

d)    Take the log2 of each value. This helps to visualize data because it will keep values to small numbers, where negative represents under-expression and positive represents over-expression.

 


Table 1. Table of fixed data values for ANOVA analysis from slides 725, 726 and 728. Columns A-L contain original fluorescence values from ScanAlyze. Columns M-R contain values of data that have background fluorescence subtracted from overall intensity. Columns S-U contain “fixed” values of fluorescence.

 

2. Create a plot to help visualize the data. This plot has little analytical value, but can help determine a hypothesis for the following steps.


Figure 2. Scatter plot representing gene expression ratios (Channel 2 Intensity - Channel 2 Background)/(Channel 1 Intensity – Channel 1 Intensity). Blue represents slide 725, Red represents slide 726, Green represents slide 728.

 

3. Create a null and alternative hypothesis. Microarrays are mostly exploratory and not true evidence of over- and under-expression. Therefore, the hypothesis will reflect this. In order to accept the alternative hypothesis, the null hypothesis must be disproved.

H0 = Δzms2 expression will not differ significantly from WT expression.
HA = Δzms2 expression will differ significantly from WT expression by at least F = 0.05.

 

4. Now the ANOVA will be computed. ANOVA is based on the sum of squares of the data. In order to find this, first the mean of each row and the mean of each column must be taken:


Table 2. Simplified example of row mean and column mean. This figure only shows the first seven spots of microarray data. Actual data from microarray slides contains 13,377 rows of data.

The above table is simplified to only show 7 rows of data. The mean is taken for each row, which represents the average of the ratio of expression for each microarray slide. The mean is also taken for each column, which represents the overall expression on each array. All values are in log2.

 

5. Next, the total sum of squares must be calculated. Total sum of squares represents the squared deviation of each observation from the grand mean, where the grand mean is the average of every observation, not limited by which slide it is. For example, the grand mean of the above table would be:

(1.68 + 2.28 + [-1.34] + 1.79 + [-0.19] + [-2.97] + 1.19 + [0.71] + [-1.497] + [-0.16] + [-0.18] + [-3.56] + [-0.16] + 0.47 + 7.10 + 4.00 + [-0.05] +    [-1.48] + [-1.10]) = 5.09 (sum of each observation)

5.09 = .26 (sum of observations, divided by 19 observations)
 19

.26 is the grand mean of this example data. However, for the microarray slides 725, 726 and 728, the grand mean is -0.75048. The total of all the observations is     -30,115 and there are 40,128 observations. (-30,115)/40,128 = -0.75.

 

6. Next, we must find the deviation of each observation from the grand mean and square it. Using the simplified data from above, that gives the following table:


Table 3. Simplified version of total sum of squares calculation for the first seven spots of each array. 0.26 is the grand mean, or mean of each individual observation from each array. 107.70 is the total sum of squares, which represents the sum of the square of the deviation of each observation from the grand mean.

 

For the actual data from slides 725, 726 and 728, the following table is created:


Table 4. Data for slides 725, 726 and 728. Columns 2-4 are the expression values in log2 format. Columns 6-8 are the deviation of each value from the grand mean (-0.89). Columns 9-11 are the deviations squared.

The Total Sum of Squares of this data is 135,263.78.

 

7. The next step is to find the deviation of each row mean from the grand mean. The row mean was found in step 4. Using the example data, the following table is produced:


Table 5. Summary of values of deviance of the row mean from the grand mean. Column 4 contains the deviation squared. The sum of the squared row deviations is 33.00.

 

8. Next the same is done for the columns. The column mean was also found in step 4.


Table 6. Summary of values of deviance of the column mean from the grand mean. Column 4 contains the deviation squared. The sum of the squared column deviations is 7.54.

 

9. The Error Sum of Squares is calculated next. The Error Sum of Squares is a normalized value based on the total sum of squares (found in step 6), the row sum of squares (found in step 7) and the column sum of squares (found in step 8). The Error Sum of Squares is found by the following equation:

Error Sum of Squares = Total Sum of Squares – Row Sum of Squares – Column Sum of Squares

Error Sum of Squares = 107.70 – 33.00 – 7.54

Error Sum of Squares = 67.16

Please note that the above values do not reflect the actual values from the entire microarray slides. They are based on the example values taken from the first 7 spots of each array as shown in Tables 4, 5 and 6.

 

10. Next, the mean squares of the rows, columns and error values must be determined. The mean squares are utilized with degrees of freedom to determine which values are significant. In order to get the means, the values for Sum of Squares are divided by (n-1), where n is the number of rows or columns. The error mean square incorporates both the columns and rows, so the product of both is used as the denominator.

 

Using the 7 spots as an example:

·         Row Mean Square = Row Sum of Squares / (# of rows – 1)
                = 33.00 / (7 – 1)
Row Mean Square = 5.5

·         Column Mean Square = Column Sum of Squares / (# of columns – 1)
                   = 7.54 / (3 – 1)
Row Mean Square    = 3.77

·         Error Mean Square = Error Sum of Squares / (# of rows – 1)(#of columns-1)
                  = 67.16 / (7 – 1)(3 - 1)
Row Mean Square   = 5.59

Degrees of freedom are equal to the value of the denominator in each equation. In this case, there are 6 degrees of freedom in the rows, 2 degrees of freedom in the columns, and 12 degrees of freedom in the error.

 

11. The final step in ANOVA analysis is to compute the F-value. F-value is similar to the P-value given by a T-test in that a table must be used to compare which values are significant or not. An example of a table of Critical F values is shown below:


Table 7. Summary of degrees of freedom and critical F values. The non-bold values represent .05 level of significance and bolded values represent .01 level of significance. This table and others can be found at
http://faculty.vassar.edu/lowry/apx_d.html.

 

An F-value is determined from the experimental results by finding the ratio of the row or column mean squares to the error mean square. A table of results must be constructed from the sums of squares and the means of squares taken from the preceding steps. For example:

F-value (rows) = Row Mean of Squares / Error Mean of Squares
               = 5.5 / 5.59
               = 0.98

F-value (columns) = Column Mean of Squares / Error Mean of Squares
                  = 3.77 / 5.59
                  = 0.67

The following table is constructed from the abbreviated results from the first 7 spots of each array:


Table 8. Summary of ANOVA results. F-value is the ratio of row or column mean of squares to error mean of squares.

 

When considering the F-value table (Table 7), both the degrees of freedom of the numerator and denominator must be accounted for. For example, the calculation for rows has 6 degrees of freedom in the numerator and 12 degrees of freedom in the denominator. Therefore, the F-value for rows is 3 for a 0.05 level of significance. The F-value for columns is 3.89. (the F-values are not shown on the table due to size limitations, please click on the link to see the whole table).

The experimental F-value must be greater than the F-value in the table in order to reject the null hypothesis. In this case, both experimental F-values in Table 8 are lower than the acceptable F-values given by Table 7. Therefore, we must accept the null hypothesis and are unable to accept the alternative hypothesis.

Based on the analysis of the first 7 spots of each array, the null hypothesis must be accepted:

H0: Δzms2 expression will not differ significantly from WT expression.

 

Conclusion

Analysis of Variance, of ANOVA, is a helpful tool for quickly comparing microarray slides to one another. It will not give you exact values for gene expression, but it will aid in discovering how much variance there is between all the spots on each array. If some spots are very highly fluorescent while others are dim, the F-value generated through these calculations will be very low and it will be difficult to accept the alternative hypothesis. An ANOVA test will help to verify that consistent data is extrapolated from a microarray and any variance will be actually due to gene expression and not an inherent problem with the microarray or the molecular techniques used to prepare it. Once an acceptable F-value is obtained from microarrays, a researcher can go back and determine which genes in each row are actually differentially expressed in different types of cells.