Making Statistical Analysis Easy using SPSS using the Zms1/Zms2 genes of Saccharomyces cerevisiae
· What is SPSS?
The Statistical Package for the Social Sciences (SPSS) is a basic program used in JMU’s Introduction to Statistics class – MATH 220. Any student who has intentions of graduating with a bachelor’s of science in biology or biotechnology is required to take MATH 220, or an equivalent. In this class descriptive statistics, frequency distributions, sampling, estimation, testing of hypotheses, regression, correlation and an introduction to statistical analysis is performed, all with the assistance of the SPSS program This knowledge can then be transferred to the analysis of microarrays.
Microarray’s can show which genes are up and down regulated, but in order to determine if these results are reliable statistical tests can be used. It is recommended to have more then 4 samples for increased accuracy, which we did not. Generally the values of particular note are the outliers. It is important to investigate them as they could be a incorrect value due to a mechanical or human error, or an indication of up or down regulation of gene expression. Without statistical analysis this difference is impossible to tell.
· Standardization
– Standardization is commonly known as the Z-score in statistics (Agresti).
z= (x- µ)/ σ
x: observed score
µ: mean value
σ: population standard deviation
s: sample std. deviation
The square root of the sum of squared deviations/(sample size-1)
s2 = ∑ (x- µ)2 / n-1 n: sample size
Standardizing allows for a comparison of observations which have different normal distributions – as seen in microarray slides. Generally 68% values lie in µ± σ, 95% values in µ± 2σ, 99.7% in µ± 3σ (Figure 1). (Agresti) By standardizing log2 values, outliers are prominently displayed, as any value greater then 3, or less then -3 represents 0.3% of all values. Generally an outlier is considered as a extremely high or low value when compared with the rest of the data set which do not follow the general pattern of the other data points, which typically is detrimental to data sets, but in the case of microarrays very critical, as it is these outliers that indicate a change in expression. (Hayden).

Figure 1. Empirical rule displayed with standard deviation intervals. Percentages indicate the percentage of values which fall in the selected area, while the values below the x axis indicate the µ ± σ, or the mean ± number of standard deviations (depending on direction on x axis).
In the case of microarrays that it is likely up or down regulated depending on its value. By choosing values that were above ±3.7 we selected genes that were respectively up or down regulated 3.7 standard deviations away from the mean of approximately 0, which indicated no change in expression. The fold increase/decrease can then be calculated by looking at the log2X values that correspond with the said outliers. The importance of the determined outliers then calls the need for further investigation through addition statistical analysis such as scatter plots, and box plots.
· Scatter Plots

Table 2. Example of what data entry looks like in SPSS to create Scatter Plot. Use the intensity data obtained from ScanAnalyze for both channels 1 and 2, and each column represents a single gene.
To create a scatterplot Graphs > Chart Builder > Choose Scatter/Dot > Simple Scatter > Drag CH1I and CH2I from Table 2 onto axis drop zone (does not matter which is which)

Figure 2. Scatterplot of the intensity of CH1I (green dye) and CH2I (red dye) displaying the relationship between the two quantitative variables for further investigation.
Correlation coefficient - How To Find
SPSS – insert 2 rows – one for each CH1I and CH2I.
Analyze > Correlate > Bivarate Correlation
Insert both CH1I and CH2I as variables

Table 2. Correlations calculated by SPSS with a 2-tailed 0.01 significance level, displays the correlation coefficient (Pearson Correlation) as 0.641. The correlation coefficient indicates the strength and direction of the relationship, and ranges from (-1<r<1). Generally an absolute value of 0-0.5 indicates a weak relationship, 0.5-0.8 a moderate, and a 0.8 -1 a strong relationship (Agresti). The sign indicates if it is a positive or negative relationship. Correlation coefficients also require graphing to be sure that the correlation is accurate (Figure 2).
The scatterplot of our values indicates that there is a moderate positive relationship between the relationships of the dye intensity in a sample of 1672. This moderate relationship is visually seen in the scatterplot’s (Figure 2) whose shape is mildly unspecific with a slight positive slope. Generally a plot should be linear with a correlation coefficient of 1, which would indicate that both the green and red dyes had a corresponding and equal intensity. This scatterplot indicates that the green dye had a greater intensity then the red, which may have affected the results. Although the scatterplot does little to highlight outliers, besides their physical appearance, it can act as an indicator of which samples will have a higher likely hood of outliers.
· Boxplots

Figure 3. Visual description of parts of a boxplot. Mean: the sample average. Q1: the lower quartile (25%) – falls within one standard deviation of sample mean. Q3: the upper quartile (75%) – falls within one standard deviation of sample mean. **: Largest non-outlier value. Interquartile Range (IQR) = Q3-Q1 : The equal difference between the first and third quartiles. Q1 ± 1.5 (IQR): Mild outliers. Q1 ± 3 (IQR): Extreme outliers. (Agresti)
A box plot can be used to easily display the above information in a visual depiction. Depending on the values, skew can be shown depending on the length of the wiskers. Also it can be used to show if outliers are skewing the mean and quartile values.

Figure 4. Boxplots of standardized ZMS1/2 Red top and bottom values for grids 11 & 14 in group 5 & 9. Because all values standardized, the means and quartiles are approximately the same. Q1 is approximately -1 for all values, and Q3 is approximately 1. The IQR is approximately 2 on all plots, meaning that any outliers between the extent of the whisker and ±3 are considered mild, and ±6 is extreme.
From the values in Figure 4 it is indicated that 50% of the values lie between -1 and 1, meaning that there is no change in expression for 50% of the genes. Also using the box plots it can be observed that the majority of the outliers were present on group 5’s top and bottom slides.
Boxplots can also be used to display skew – which can be useful in determining if a gene is actually up or down regulated. Using our data inaccuracies will be seen as we only had 4 samples for the genes with the zms1/2 knockout, but with increased samples statistical relevance increases. In the analysis of up and down regulations of our groups data we had used the criteria of up or down regulation of -3.7 of standardized values, with the general trend seen in at least 3 out of 4 slides. From this the log2X values for that gene can be observed, and the fold increase can be determined. By observing the boxplots, and where the means lie and if there is skew.
To create boxplot of individual genes in SPSS:
Enter each gene individually

Table 3. Log2X values of genes up or down regulated by a standard deviation of +/- 3.7 as entered in SPSS for use in boxplot and t-test. YER091C is highlighted as a gene of interest to perform statistical tests on. Gene exhibited 3 out of 4 samples being down regulated with one of which being at least -3.7 standard deviations from population mean.
Graphs > Chart Builder > Boxplot > Simple 2-D Box Blot
Categorical variable to x-axis – left blank
Scale variable to y-axis – the gene intensity ratios

Figure 5. Box plot the regulation of expression of gene YER091C. YER091C was selected as it was down regulated in 3 out of 4 of samples with one value being at least -3.7 standard deviations from the population mean.
Example of YER091C which was determined to be up regulated as seen across 3 out of 4 values with scores of -3.2801, -3.7655, 1.2265, -2.3219. Box plot visually shows the down regulation, with the majority of the values being negative. A skew is also seen towards the lower limit by the mean being towards Q1 (Refer to Figure 3). Since the mean is closer to the Q1 it indicates that more values are towards this end, and that the upper whisker may be an error. This provides further evidence that YER091C truly is down regulated. The value of this statistical analysis increases with the more data provided.
· Paired t-test
Done with the individual genes
Analyze > Descriptive Statistics > Explore

Table 4. Descriptive statistics of gene YER091C. Shows confidence interval of (-5.624618, 1.554087).
This provides the confidence interval for each gene subset. Also need to check the population shape before this is done to check for skew. We are looking for reasonable proof that the there is up regulation below -5.624618. Meaning that the alternative hypothesis is Ha: µ < -5.624618
To look specifically at the One-Sample T Test
Analyze > Compare Means > One-Sample T Test
With a test value of 3.7

Table 5. Results of one sample statistic using a test value of 3.7.
The sample average is 1.90 disproving our Ha, and our p value is .104/2= .052 >0.05. This indicates that although it appears that our gene was over expressed, statistically it may not be. In order to appropriately show this more samples would be needed, as there was not enough data to rely on a t-test in it’s entirety.
These steps could be continued with the rest of the samples in order to determine if the expression was statistically proven.
Literature Cited:
Agresti, Alan, and Franklin, Christine. Statistics: The Art and Science of Learning from Data. Pearson Prentice Hall, 2007.
Hayden, Robert. "A Dataset that is 44% Outliers." Journal of Statistics Education 13(2005)