Introduction

    Upon group analysis of the microarray results for zms2 arrays (green dye) versus zms2 arrays (red dye), I thought it would be good idea to attempt the design of a generic protocol for further analysis for comparing gene expression of microarrays using the statistical analysis program R. Using said analysis will allow future groups to not only compare variations in gene expression between two sets of arrays but will also provide a way to assess the reliability of any results by use of a t-test. For this particular analysis, expression values from zms2 arrays (green dye) will be paired with zms2 arrays (red dye) in a t-test to assess variations in expression values between arrays. Theoretically there should be few differences between the arrays expression values since they are essentially the same array only with dye labeling reversed for each sample, however due to effects of dye bias and inherent variability between arrays it is not unlikely that differences in expression between arrays will be observed for at least a small set of genes.

Protocol: Statistical Analysis of Microarray Data using R

 Important Notes

When working with R it’s important to understand some basics of programming commands. For example in the command x<-read.table(“file name”). File name is the name and location of the file you wish to enter in R, such as log transformed data from a microarray that has had all flagged values removed and data organized so that it may be compared properly. The command, <-read.table, enters the file into R and the data set is being named “x”. Note that x is an arbitrary name and a name can be more specific such as “zms2data”, but since multiple commands will be needed for this analysis, short descriptive names should be used in order to stay organized and concise.

When preparing your data set for analysis remember you would preferably need 3 sets of array data for each slide type that you will be comparing, however any number can be used for comparison (e.g. 2X zms2 mutant with red dye and 2X zms2 mutant with green dye). These sets of data should be paired so that the gene name as it appears in the yeast genome data base is aligned with its value for each array and that all values are log transformed prior to being entered in R. The following is an example of the data set prior to entry in R using 2 slides for each sample.

 

Gene name       zms2 red top    zms2 red bot    zms2 green top zms2 green bot

YBL008W       -2.221447496  -1.39209994    0.964376089   3.761389067

YCR059C       0.788456367   -0.461492036  3.920897457   3.203417638

YDR151C       -0.754192936  0.34758352     0.568447415   3.144948338

YIR038C         -1.068083455  -0.403048072  3.037871329   2.669608016

           

      Its important to remember that the more samples (number of arrays of a specific experiment type) the more reliable the analytical results can be. With your data properly prepared it is easiest to save the data without column titles in a notepad document on the desktop and then continue with the following commands for data analysis. The statistical Analysis program R can be downloaded free by following the next few steps.

Go to the website http://www.r-project.org/

Click CRAN under the Download heading

Scroll to the first choice under the United States List

Choose Windows or Mac based on the computer you are using, (keep in mind this protocol has only been used in windows)

Choose base

Select the newest version of R (e.g. R-2.8.0-win32.exe) and Choose Run

Follow Installation Procedure and you should have no problem downloading R

 

Entering Data sets from Log base 2 transformed data sets (zms2 red and green dye experiments.) Note: This analysis applies to the use of two slides per array type.

 

xzms2<-read.table("C://Users/Charlie/Desktop/zms2analysis.txt") This enters data into R to initiate analysis

n<-length(xzms2 [,2]) This command creates a column n which has the same number of data points as one column of your data to make sure you have entered in all rows of data

z0<- xzms2 [,1]; z1<- xzms2 [,2]; z2<- xzms2 [,3]; z3<- xzms2 [,4]; z4<- xzms2 [,5]

zz<- cbind(z0,z1,z2,z3,z4,)

These sets of commands identified each array with a specific column in R and then bound them together so that they can be normalized and analyzed statistically

Normalizing zms2 red Data

qwz<-(1/2)*(sort(z1)+sort(z2))

qw1<- qwz[rank(z1)]; qw2<- qwz[rank(z2)]

These commands cause the data between the arrays of the first type to be normalized together (e.g. zms2 red array 1, 2, and 3 normalized together)

Normalizing zms2 green Data

qmz<-(1/2)*(sort(z3)+sort(z4))

qm1<- qmz[rank(z3)]; qm2<- qmz[rank(z4)]

These commands cause the data between the arrays of the first type to be normalized together (e.g. zms2 green array 1, 2, and 3 normalized together)

Combining Both Sets of Data and Displaying Boxplot

zz<- cbind(qw1,qw2,qm1,qm2)

boxplot(qw1,qw2,qm1,qm2, main="Quantile Normalized Box Plots")

These commands bind the two normalized data sets together and then allow for a view of the boxplot results to verify that the (Figure 1 in results)

Normalizing Data Together and Displaying Boxplot

 qnmz<- (1/4)*(sort(qw1)+sort(qw2) + sort(qm1) + sort(qm2))

 w1<- qnmz[rank(qw1)]; w2<- qnmz[rank(qw2)]; mm1<- qnmz[rank(qm1)]; mm2<- qnmz[rank(qm2)]

zzn<- cbind(w1,w2,mm1,mm2)

 boxplot(w1,w2,mm1,mm2, main="Quantile Normalized (zms2 Green and zms2 Red) Box Plots")

By performing this set of commands you hope to normalize all array sets together optimizing normality and then view data in a boxplot to confirm that they all have the same mean, quartile, and range values (Figure 2 in results)

Shapiro Test

The Shapiro test will evaluate whether the data is truly normalized for all the gene values and by these commands we will separate the group that satisfies conditions for normality for further analysis in PCA and T-test, while the other group which fails the test will be analyzed using the Wilcoxon Test.

m<-length(w1)

      stat1<- 1:m;  pv1<- 1:m

for(i in 1:m)

+ {

+ stat1[i] <- shapiro.test(zzn[i,])$ statistic

+ pv1[i] <- shapiro.test(zzn[i,])$ p.value

+ }

yn <- data.frame(z0,stat1,pv1)

yn1 <- yn[order(pv1, decreasing =F),]

yn2 <- length(which(yn1[,3] <= .05)) Removes data that fails test conditions

yn3<- yn1[1:yn2,]

yn3[1:5,] Shows first 5 rows of genes that pass Shapiro test (not necessary)

Wilcoxon Test on Data which failed Shapiro Test conditions

The Wilcoxon Test is a non-parametric test useful for analyzing data that doesn’t satisfy conditions for normality. Even though the data in this set does not satisfy conditions for normality it is important to note that comparisons can still be made among expression levels even though this test is not as statistically “strong” as the t-test.

gsn <-as.numeric(rownames(yn3))

zzn1 <- zzn[gsn,]

m<- length(zzn1[,1])

stat<- 1:m

pv<- 1:m

for(i in 1:m)

+ {stat[i] <- wilcox.test(zzn1[i,], mu=0)$ statistic

+ pv[i] <- wilcox.test(zzn1[i,], mu=0)$ p.value

+ }

 

gn<-yn3[,1]

yn4<-data.frame(gsn, gn, stat, pv)

attach(yn4)

yn5 <- yn4[order(pv, decreasing =F),] Displays Data in order of decreasing pv

wn <- length(which(yn5[,4] <= .05)) Number of genes that are significantly different using Wilcoxon (Mann-Whitney) non-parametric test. If data is significantly different, this implies that the arrays, which are comparing the same zms2 mutant, are showing different expression levels/trends for the same genes, indicating that factors such as the dye used may be altering how the expression is read.

T-test conducted on Data which passed Normality Test

By inputting these commands a t-test will be conducted on all the data that passed the Shapiro test. The null hypothesis for this test is that the genes between the arrays are significantly different. The genes with a p-value less than 0.05 at the end of this test will satisfy this condition and we will accept the null hypothesis.

ynn2 <- length(which(yn1[,3] > .05)) Verification of total number of genes (yn2+ynn2=n)

m1<- yn2+1

ynn4<- yn1[m1:n,]

length(ynn4[,2])

gsn <-as.numeric(rownames(ynn4))

zzt <- zzn[gsn,]

m<- length(zzt[,1])

stat<-1:m

pv<-1:m

for(i in 1:m)

+ {

+ stat[i] <- t.test(zzt[i,], mu=0)$ statistic

+ pv[i] <- t.test(zzt[i,], mu=0)$ p.value

+ }

gn<-ynn4[,1]

yn6<-data.frame(gsn, gn, stat, pv)

attach(yn6)

yn7 <- yn6[order(pv, decreasing =F),] Displays Data in order of increasing p-value

tn <- length(which(yn7[,4] <= .05))

tn

yn7 This will display all the data from the t-test in increasing order of p-value. One can then view this list to determine whether genes are up-regulated or down-regulated, compared between the arrays and if these results are significant based on their p-value.

Bonferroni

Normally a method of p-value correction such as the Bonferroni correction method should be used to correct alpha value inflation due to multiple t-tests performed. However, when such a large amount of data is being analyzed the correction method often fails. I only included the programs for this correction method because it theoretically can be used if one is only comparing a small set of genes in arrays (~10).

pvbf<-1:m

bfyn <- transform(yn7,pvbf =pv*m)

pvbf<-bfyn[,5]

{if (pvbf[i] >1) (pvbf[i] <-1)}

pvbf[1:5]

bfyn[1:5] Allows you to see first 5 rows of Bonferroni corrected data sets, as you will probably notice when comparing large amounts of data the p-values become so inflated that almost none of the genes will have p-values less than 0.05.

Principle Component Analysis

This principle component analysis will allow us to view the correlation between sets of genes and the correlation between arrays of the same type (e.g. arrays with zms2 dyed red).

gs<-as.numeric(rownames(bfyn))

gs1<-gs[1:50]

vpc <- zzt[gs1,]

pc.cr<-princomp(vpc,cor=T,scores=T)

summary(pc.cr) This command produces a summary of the PCA results, most importantly under which component the majority of the variation lies. Typically you want ~80 of your variation to be in the first two components. Keep in mind that this PCA deals with the array types as variables so we are evaluating the variation between types of arrays.

wpc1<-pc.cr$loadings

wpc1

biplot(pc.cr,main="Biplot with samples as variables",xlabs=as.character(1:50),ylabs=c(rep("W",2),rep("M",2)),lwd=2,cex=.7)This command displays the results of the PCA in biplot form in order to better visualize variation between the arrays. (Figure 3 in results)

p1 <- prcomp(t(vpc), scale = T); summary(p1)

wpc <- p1$rotation

biplot(p1,main="Biplot with genes as variables",ylabs=as.character(1:50),xlabs=c(rep("W",2),rep("M",2)),lwd=2,cex=.7) Here we visualize the PCA of the genes as variables. In Figure 4 in the results section we see the first 50 genes represented as arrows or vectors. The angle between each vector is proportional to the variation between them. The length of the vector is representative of the magnitude of its expression value. This same relationship is found in the PCA concerning variation between the arrays that was conducted earlier.

Note: As stated above this analysis and protocol involves comparison of duplicates of two sets of arrays. Since future analyses should use at minimum triplicates of two sets of arrays for comparison I have provided a generic protocol that can be used to conduct such an analysis in the link below.

Triplicate Analysis

********************************************************************

 

Results For Statistical Analysis Comparing Expression Difference Between Arrays

∆zms2 (green dye) vs ∆zms2 (red dye)

 

Figure 1: Boxplot of Normalized data sets for 2 arrays of zms2 comparison with red dye (lanes 1 and 2) and 2 arrays of zms2 comparison with green dye(lanes 3 and 4).

Figure 2: Boxplot of normalized array data amont all arrays. Note that all arrays have the same mean, range, and quartile values as they should after normalization.

    Shapiro revealed 9 genes which failed normality, these were discarded. A wilcoxon could be performed to assess statistical significance but non-parametric test is a very weak analysis especially for 9 genes. All other genes in data set pass conditions for normality and therefore can be analyzed using a t-test.

Table 1: T-test results after analysis of normalized array data that satisfied the Shapiro test.  Genes that satisfied conditions for significance, p-value<0.05 are listed. The corresponding stat value reveals over or under expression based on sign and magnitude. For example,  gene YML052W in the first row is underexpressed in the second set of arrays (red dye for zms2) vs the first set of arrays (zms2 with green dye).

           Gene name         stat                 pv

49  178 YML052W -8.467267975 0.003458219

122  42 YLR003C  7.642511745 0.004651809

185 199 YNL021W -6.227940336 0.008346970

81  215 YMR307W -4.867351903 0.016566961

75  210 YLL028W -4.790085788 0.017305196

250 223 YBR170C -4.498043700 0.020514374

55  189 YLR233C -4.406386213 0.021678919

239 167 YGL193C -4.178493052 0.024969026

25  229 YDR037W -4.019448544 0.027654928

186 165 YOR249C -3.933197301 0.029267959

191 166 YPR044C -3.758070278 0.032933556

179 127 YOR289W -3.703088720 0.034204720

201 219 YDR013W -3.607771802 0.036561054

99  209 YNL130C -3.572268086 0.037491597

121 149 YKL001C -3.441048385 0.041204467

38  118 YLR155C -3.419183965 0.041867952

140 173 YLR070C -3.400031679 0.042460335

84  160 YDL053C -3.303649892 0.045608507

77  110 YGL114W -3.280727909 0.046400417

61   16 YMR303C  3.234103186 0.048065693

111 193 YLL015W -3.196344331 0.049470057

 Table 2: This table contains results of a t-test coupled with the bonferroni correction to address inflation of alpha value.  Unfortunately, as one can see the pvbf value which is the p-value after the bonferroni correction leaves none of the genes being statistically significant. This is most likely due to the sheer magnitude of the data being analyzed which makes the bonferroni ineffective for correction of alpha value inflation.

            Gene name         stat                  pv                  pvbf

49  178 YML052W -8.467267975 0.003458219   0.8645547

122  42 YLR003C  7.642511745 0.004651809   1.1629522

185 199 YNL021W -6.227940336 0.008346970   2.0867426

81  215 YMR307W -4.867351903 0.016566961   4.1417402

75  210 YLL028W -4.790085788 0.017305196   4.3262991

250 223 YBR170C -4.498043700 0.020514374   5.1285934

55  189 YLR233C -4.406386213 0.021678919   5.4197298

239 167 YGL193C -4.178493052 0.024969026   6.2422565

25  229 YDR037W -4.019448544 0.027654928   6.9137319

186 165 YOR249C -3.933197301 0.029267959   7.3169896

191 166 YPR044C -3.758070278 0.032933556   8.2333889

179 127 YOR289W -3.703088720 0.034204720   8.5511800

201 219 YDR013W -3.607771802 0.036561054   9.1402636

99  209 YNL130C -3.572268086 0.037491597   9.3728993

121 149 YKL001C -3.441048385 0.041204467  10.3011168

38  118 YLR155C -3.419183965 0.041867952  10.4669880

140 173 YLR070C -3.400031679 0.042460335  10.6150837

84  160 YDL053C -3.303649892 0.045608507  11.4021267

77  110 YGL114W -3.280727909 0.046400417  11.6001042

61   16 YMR303C  3.234103186 0.048065693  12.0164233

111 193 YLL015W -3.196344331 0.049470057  12.3675141

Figure 3: A Biplot constructed from the top 50 significant genes from the t-test results. In this Biplot the vectors represent each array and the angle between the arrays is proportional to the variance between them. Each of the zms2 arrays (green dye) are represented by a W while the zms2 red dye are represented by a M. The following information is important when conducting more statistical analysis. Ultimately, from a biological aspect one should be looking at the grouping of numbers which represent each gene and note that these genes should be investigated further for any correlation in regulation or function.

 

Importance of components:

                          Comp.1    Comp.2    Comp.3    Comp.4

Standard deviation     1.5729133 0.7723975 0.7023461 0.6603452

Proportion of Variance 0.6185141 0.1491495 0.1233225 0.1090139

Cumulative Proportion  0.6185141 0.7676636 0.8909861 1.0000000

 

Loadings:

    Comp.1 Comp.2 Comp.3 Comp.4

w1  -0.479  0.751 -0.307 -0.334

w2  -0.518  0.171  0.480  0.687

mm1 -0.495 -0.493 -0.680  0.225

mm2 -0.508 -0.404  0.462 -0.604

               Comp.1 Comp.2 Comp.3 Comp.4

SS loadings      1.00   1.00   1.00   1.00

Proportion Var   0.25   0.25   0.25   0.25

Cumulative Var   0.25   0.50   0.75   1.00

                        PC1   PC2   PC3      PC4

Standard deviation     4.36 4.066 3.804 1.32e-15

Proportion of Variance 0.38 0.331 0.289 0.00e+00

Cumulative Proportion  0.38 0.711 1.000 1.00e+00

Figure 4: A Biplot constructed from the top 50 significant genes from the t-test results. In this Biplot the vectors represent each gene and the angle between the genes is proportional to the variance between them. The magnitude of each vector (i.e. length of the vector) is also proportional to their expression in the arrays. Again biologically one should be looking for specific grouping to conduct further analysis for example the cluster of genes 17, 18, 50, 40, 49, and 3. Each of these numbers represents the row of the gene in the original gene list. So one can simply match up the row number with the gene name and looking for trend in function/regulation.

 Discussion

    After conducting this analysis we can see that the normalization procedure was very effective since all but nine genes passed conditions for normality. Results of the t-test prior to bonferroni correction show that of the entire set of genes analyzed approximately 20 were found to be of statistical significance, meaning that the expression values according to the test conditions can be deemed reliable and therefore have different expressions between the arrays. I attempted the bonferroni correction on this data set in order to assess its effectiveness for correcting alpha inflation which often occurs when conducting a t-test on the massive amount of data accompanied with microarrays. As I have often found before this correction method was ineffective and therefore I recommend exploration of a program none as SAM, which is supposed to be designed for statistical analysis of data such as that obtained from microarrays. The first PCA dealing with variations between arrays showed that microarrays conducted with zms2 labeled with green dye had a large degree of variation unlike the arrays conducted with zms2 labeled with red dye. The grouping of genes in the second PCA allows for future comparison by any curious students, unfortunately, I ran out of time before being able to make any significant connections between genes of similar/different variations in expression.