Analysis of Variance (ANOVA) Designs

ANOVA - Comparing More Than Two Group Means

As we've seen, t tests compare the means of two independent groups or two related groups or scores. Many research designs involve more than just two groups. One approach for comparing more than two means is to test each pair of means with a t test. Not only does this approach involve many tests, but it also compounds the Type 1 error rate. For example, if we have three groups, Morning, Afternoon, and Evening, whose scores we want to compare, we could compare 1) Morning with Afternoon, 2) Morning with Evening, and 3) Afternoon with Evening. This approach would involve three separate tests, and it would result in an overall error rate of three times that of a single test. Instead of this approach, statisticians use an overall test, called an ANOVA.

ANOVA stands for ANalysis Of VAriance and is described as an omnibus test because it tests all differences between the separate groups at once. The variances being analyzed are the deviations from the mean scores, which we used to calculate the standard deviation (itself a measure of variation, of course). Review the calculation for the variance and standard deviation before proceeding on. Understanding that calculation will help to understand the following technique.

There are many types of ANOVAs, which depend on the specifics of the research design. The first type we will consider is called simple or oneway ANOVA. A oneway ANOVA is used to compare the means of three or more groups. Think of it as an extension of an independent t test, or think of the independent t test as a special case of an ANOVA for two groups.

The heart of an ANOVA test is something called the sum of squares. This is where recalling the calculation of the variance and standard deviation is helpful. Remember that we determined deviations from the mean for each score in a distribution. This process resulted in positive and negative deviations, which, due to the property of the mean, all sum to 0 always. In order to calculate an average deviation, we squared and then summed the deviations. This is the sum of squares (SS) - think of it as the sum of squared deviations to remember where the squares originated.

In this section, we will be considering three different types of sums of squares.

The total sum of squares represents the total amount of variation in the combined distribution of all raw scores.
The between groups sum of squares represents the variation that is related to group membership. The between groups sum of squares represents variance that is explained by the characteristics that define the separate groups.
The within groups sum of squares represents the variation within the separate groups, which is unexplained variance and is also called error.

Explained variance plus unexplained variance equals total variance, or stated another way, the between groups sum of squares plus the within groups sum of squares equals the total sum of squares. In a t test, the between groups sum of squares is represented by the difference between the two group means and the within groups sum of squares is represented by the variances of the two groups.

The test statistic for an ANOVA is called the F ratio, which compares the between groups sum of squares with the within groups sum of squares. The between groups sum of squares is divided by the degrees of freedom associated with the number of group means (e.g., if there are 3 groups, then there are 2 degrees of freedom) to obtain the mean sum of squares between groups. Likewise, the within groups sum of squares is divided by the degrees of freedom associated with the group sample size (n). For example, when comparing three groups of 10 scores, there are 27 degrees of freedom, which is 9 degrees of freedom from each group. Dividing the within groups sum of squares by its associated degrees of freedom gives the mean sum of squares within groups.

The F ratio is the mean between groups sum of squares divided by the mean within groups sum of squares. This test statistic is used to determine statistical significance of the result. SPSS and Excel report the associated p-value directly; Appendix D in the text lists some example F values for specific α levels and degrees of freedom. Notice that there are two degrees of freedom parameters in the table - one for the numerator of the F ratio and another for the denominator.

An effect size for an ANOVA is called eta squared, η², and is calculated by dividing the between groups sum of squares by the total sum of squares to obtain the percentage of variance that is explained by group membership. Similar to the coefficient of determination, the amount of unexplained variance is 1 - the amount of explained variance.

The assumptions for an ANOVA are extensions of the assumptions for a t test, namely 1) independent observations, 2) normally distributed data, and 3) homogeneity of variance among all groups. There is a form of Levene's test to assess whether the variances are similar.

Let's step through an example. Suppose that you are teaching three sections of the same course, designated Morning, Afternoon, and Evening. For the sake of this example, each section has 10 students. You collected data from your students using an instrument that measures their level of activity in the class. Here are the data:

Section	Morning	Afternoon	Evening
	7	4	8
	6	5	7
	8	6	8
	6	5	8
	7	4	6
	6	5	7
	8	6	6
	6	5	9
	5	5	7
	7	4	8
Sum	66	49	74
Mean	6.60	4.90	7.40
SD	0.97	0.74	0.97

1. Set up the hypothesis
The null hypothesis is that the population means are equal. Symbolically, H₀: μ_Morning = μ_Afternoon = μ_Evening
The alternative, research, hypothesis is never directional for an ANOVA. In this case, the alternative hypothesis is that the three means are not equal.
Symbolically, H1: Mean_Morning ≠ Mean_Afternoon ≠ Mean_Evening
As we will see, this form of the alternative hypothesis will not identify the source of specific differences if they are found to exist.

2. Establish the alpha level at α = .05, which is always two-tailed for an ANOVA, based on the alternative hypothesis.

3. Select the appropriate test statistic (see the decision tree inside the back cover or http://www.benbaab.com/salkind/dt1.html).
The appropriate test to use is a oneway ANOVA. The test statistic will be F.

4. Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
Independence is a result of the data collection process.
Normality is checked by inspecting the histograms and skewness ratios.
Homogeneity of variances is check with Levene's test (see the SPSS output below).
The test statistic is calculated and reported by SPSS or Excel. This is how the F ratio is determined:

First, a grand mean is calculated. All 30 scores shown above are added and the sum is divided by 30 to obtain a grand mean of 6.30.

The between groups sum of squares is the sum of the squared differences of each group mean from the grand mean multiplied by the sample size. [Note: ^2 represents squaring and * represents multiplication]

Between groups sum of squares = 10 * [(6.60-6.30)^2 + (4.90-6.30)^2 + (7.40-6.30)^2] , which equals 10 * [0.09 + 1.96 + 1.21] or 32.60

The within groups sum of squares is the sum of each score's squared deviation from its group mean. In other words, we subtract 6.60, 4.90, or 7.40 from each score listed above, square that result, and then add up all of the squares. Here are the squared deviations from the grand mean:

Section	Morning	Afternoon	Evening
	0.16	0.81	0.36
	0.36	0.01	0.16
	1.96	1.21	0.36
	0.36	0.01	0.36
	0.16	0.81	1.96
	0.36	0.01	0.16
	1.96	1.21	1.96
	0.36	0.01	2.56
	2.56	0.01	0.16
	0.16	0.81	0.36
Sum	8.40	4.90	8.40

Within groups sum of squares = 8.40 + 4.90 + 8.40, which equals 21.70

Next, the two mean sums of squares are calculated by dividing by the degrees of freedom. Because there are three groups, the between groups degrees of freedom is 3-1 or 2. Because each group has 10 observations, the within groups degrees of freedom is 3 * (10 -1) or 3*9 which is 27.

The mean between groups sum of squares is 32.60/2 or 16.30

The mean within groups sum of squares is 21.70/27 or .803704

The F ratio is 16.30/.80 or 20.28

5. Skip to #8. Determine the critical value for the test statistic.

6. Skip to #8. Compare the obtained value with the critical value.

7. Skip to #8. Either reject or retain the null hypothesis based on the following:
        If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
        If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.
8. Alternative to #5-7 - for use with SPSS output:
    Compare the reported p-value (Sig.) with the preset alpha level.
    If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
    If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.

In the first table of the SPSS output below, find the group means and the grand mean described above. The second table contains the results of Levene's test, which establishes the equality of the group standard deviations. The third table contains the ANOVA results, which include the three sums of squares, the two mean sums of squares, the F ratio, and the observed p-value. Because the p-value is less than .05, we reject the null hypothesis and conclude that the three group means are different - but how are the different? For this we need to conduct another test, called pairwise comparisons. There are numerous versions of this test. We'll use the one that the author recommends, the Bonferroni test. But before that, look over the Excel results, generated using the ANOVA: Single Factor option on the Data Analysis dialog box, which is accessed from the Data Analysis command on the Tools menu (see the earlier note about installing these tools).

Here are the Excel ANOVA results:

Now for the pairwise comparisons. The Bonferroni test identifies the specific source of the differences found by the overall ANOVA test.

This test reports that the mean score of the Afternoon section (4.9) is different from both the Morning (6.6) and Evening (7.4) sections, but that the difference between Morning and Evening sections' scores is not statistically significant. Here is a picture of the pattern of mean scores.

The effect size is computed from the ANOVA table as the percentage of explained variance, denoted by η², which is 32.6/54.3 or about 40.0%, representing a medium effect. The determination of the relative strength of an effect depends on the field of study - general guidelines should be weighed against findings of other researchers.

For practice understanding the oneway ANOVA and explained and unexplained variance, visit http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html or http://www.ruf.rice.edu/~lane/stat_sim/one_way/index.html

Factorial ANOVA - Comparing Group Means Based on Combinations of Independent Variables

We'll take the application of ANOVA just one step further. Suppose that you not only want to compare sections of a course but you also want to compare the level of activity of majors and non-majors in the course. Now, instead of just three groups, you have six groups - three sections and two types of students within each section. Again to make the example more straightforward, we'll assume equal numbers of majors and non-majors within the sections. That is, there are exactly five majors and five non-majors in each of the three sections. This uniformity is not a requirement, but it does make the results easier to understand.

Before conducting the comparison of the six group means, let's introduce a few new terms. First of all, this type of ANOVA is called a factorial ANOVA, and in particular for the example just described, a two factor or two-way ANOVA. A factor is an independent variable that is used to categorize the observations.

When we compare the means, there are two types of effects that we can observe. They are called main effects and interaction effects. Main effects are due to the factors themselves. In this example, there is a main effect due to Section and another main effect due to Type of Student. Interaction effects are due to the combination of the two factors. For example, if we found that majors are more actively learning in the morning section and non-majors are more actively learning in the afternoon section, there would be an interaction effect. Interactions are designated by a combination of factors, such as Section * Type.

In the ANOVA table, the result of the factorial structure is a separation of the between groups sum of squares. Because there are more categories of students, we have more ways to determine where the differences might arise. In the following example, which uses the same data that we used earlier, you'll notice that the overall sum of squares is the same.

The various F ratios and effect sizes are calculated in the same manner as they were in the oneway ANOVA. The assumptions are the same as the oneway ANOVA as well - there are just more subgroups to check.

Here are the data seen earlier, but now divided by Type of Student as well.

Here is the Excel output from the ANOVA: Two-Factor with Replication command. Note when using this command, the column and row containing the headings are required in the range. The values in the Total column were reported in the previous oneway ANOVA.

Here is the ANOVA table with the sum of squares for the two main effects and for the interaction.

Notice that the differences due to the Section are statistically significant, which is what we found in the oneway ANOVA, and that the differences due to Type of Student are statistically significant, but the interaction effect is not statistically significant. Try computing the effect sizes for the statistically significant effects. What percentage of variance is left unexplained?

Here is the same analysis from SPSS. First, here are the descriptive statistics about the groups as well as Levene's test for the equality of variances.

Then here are the results of the ANOVA. Notice that SPSS includes additional information in the table.

For the purposes of our example, we just need to focus on the rows labeled, Section, Type, Section * Type, Error, and Corrected Total. Compare these results to the Excel output displayed above.

Here is a picture that illustrates the pattern of means.

When the two lines do not cross each other, there is no interaction effect. In this graph, we can see that no matter which section they were in, the mean scores of Majors exceeded those of Non-Majors. This pattern represents a main effect. Also, by estimating the midpoints between the two mean scores for each section, we can see that Morning and Evening mean scores are higher than the Afternoon mean score.

For practice understanding the two-way ANOVAs and main effects and interactions, visit http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html and choose the Two-Way model.

Inferential Correlation and Regression

Correlation Revisited

In a previous module, correlation was introduced as a descriptive statistic that describes the nature of a relationship between two variables. There are various types of correlations depending on the specific characteristics of the two variables between compared. One commonly used correlation is the Pearson product-moment correlation coefficient, which measures the extent of a linear relationship between two interval or ratio-level variables. The Pearson correlation, represented by r, ranges from -1 to +1. The magnitude of r indicates the degree to which the pattern of paired points represents a line. The sign of r (- or +) indicates the slope of the line that represents the relationship - a positive r indicates a direct relationship a negative r indicates an indirect relationship. When r = 0, there is no linear relationship between the two variables. The characteristics of relationships that are represented by a correlation must be checked by inspecting a scatterplot of pairs of points.

For example, here is a scatterplot of 2006 and 2007 API scores for San Francisco elementary schools.

Here is a correlation matrix of various characteristics of the San Francisco elementary schools. Meals represents the percentage of students who are eligible for free or reduced lunched. P_EL represents the percentage of English language learners in the schools. Not-HSG through Grad_Sch represent percentages of parent education levels. Each cell in the matrix includes the Pearson r, significance level (p-value), and N, which represents the number of schools.

Inferential Correlation

As shown in the matrix above, correlation can be used in an inferential test. The second number in each cell of the matrix is the level of statistical significance (p-value) associated with the inferential test of the correlation value.

The assumptions for conducting an inferential test of correlation rely on the concept of a conditional distribution. A conditional distribution can be thought of as either horizontal or vertical bands of a scatterplot. The conditional distribution of Y given X is the distribution of Y values for any given X value. The actual assumptions for the test are the following:

Normality - both variables must be normally distributed as well as each conditional distribution is normally distributed.
Homoscedasticity - the standard deviations for each conditional distribution are equal.
Linearity - the means of each conditional distribution lie on a straight line.

These assumptions can be checked by inspecting the scatterplot, identifying outliers, and analyzing skewness ratios.

The null hypothesis for the inferential test is:

H₀: ρ = 0,

that is, the correlation between the two variables in the population is equal to 0, or stated another way, the two variables are not correlated.

The alternative hypothesis is one of the following:

H₁: r ≠ 0 (the non-directional alternative hypothesis)
H₁: r < 0 (the correlation is indirect)
H₁: r > 0 (the correlation is direct)

The distribution associated with the inferential test is Student's t distribution, which is based on a function of r and n, namely

The associated degrees of freedom for the test are N-2. If the p-value obtained is less than the preset value of α (usually .05), then the null hypothesis of no correlation between the variables is rejected in favor of the alternative hypothesis. .

In the matrix shown above, the correlation between api06 and api07 of r = .938 is statistically significant at the .05 level, and the correlation between api07 and some_col (percentage of parents who have some college education but not a degree) of r = -.220 is not statistically significant. For correlations that are found to be statistically significant, an effect size of r² can be computed to report the amount of shared variance between the two variables. This effect size measure is called the coefficient of determination. In the example of api06 and api07, .938² = .8798, which indicates that 88.0% of the variance is shared between the two sets of API scores. Caution must be applied when statistically significant correlations are found for large samples. A statistically significant correlation does not necessarily indicate a meaningful result.

Regression Revisited

When a statistically and practically significant correlation is found between two variables, it may be appropriate and worthwhile to consider predicting one of the variables based on the other variable. The variable used in the prediction is called the predictor or independent variable. The variable being predicted is called the criterion or dependent variable. The process used in the prediction is called simple linear regression and uses the following linear equation:

Y' = bX + a

where Y' is the predicted variable, b is the slope of the line, X is the independent variable, and a is the y-intercept (the point at which the line crosses the vertical axis). The equation parameters, a and b, can be derived as detailed in the text, or can be determined using the correlation between X and Y and the fact that the line passes through the point (X, Y). The magnitude (absolute value) of the correlation coefficient determines the accuracy of the prediction. This accuracy is based on the difference between the observed Y value and the predicted Y' value (Y - Y'), which is called a deviation, discrepancy, or residual. An overall measure of the accuracy of the prediction is also given by the standard error of estimate, which is the square root of the sum of these deviations squared and divided by the sample size.

Explanation of SPSS's Regression Output

Here is an example of a regression analysis in which 2007 API scores for San Francisco elementary schools are predicted using the 2006 API scores.

Note that the Model Summary table displays R, R², and the standard error of estimate, which was mentioned earlier. R is the correlation between Y' and Y, which, for simple linear regression with one predictor variable, is equal to the absolute value of r. The adjusted R² corrects for the effects of small sample sizes, if any.

The ANOVA table shown below reports the sums of squares for the predicted variable and the error (residuals/deviations/discrepancies). It is important to note that the decimal places in the sum of squares column are not aligned - use caution when analyzing these values. The interpretation of this table is the following. First of all, the large value of F and small significance (p-value) indicates that the fit of the regression model to the data is good - the null hypothesis of no relationship between the data and the regression model is rejected. The second important source of information is the sum of squares column. The amount of explained variance is given by the ratio of the Regression sum of squares to the Total sum of squares, or 439291.6 divided by 499042.2, which equals .88 or 88%. This coincides with the value of R² listed in the previous table. Both of these measures are effect sizes for the regression model. Consequently, the remaining unexplained variance, which is given by the ratio of the Residual sum of squares to the Total sum of squares, is 12%.

The Coefficients table provides the regression equation parameters, a and b, as well as two t tests about those parameters. In the model, the regression equation is

Y' = .962 X + 34.988

The first t test assesses whether the y-intercept, a, is different from 0. In this case, based on the standard error, the value of the y-intercept is not statistically significantly different than 0. This result may have meaning in particular settings. In this example, it has no real importance. The second t test assesses whether the slope is different from 0. This test is more important than the previous one because if the slope is likely to be 0, then so is the correlation and, consequently, the prediction would not be very accurate. In this case, the slope of .962 is statistically significantly different from 0 because the p-value associated with t = 22.194 is less than .0005.

Analyzing Discrepancy, Leverage, and Influence of Observations

Once a regression equation with adequate prediction accuracy has been determined, more specific analysis of individual points that were used to develop the regression model can be performed. Three characteristics will be introduced: Discrepancy (also known as deviation or residual), Leverage, and Influence.

Discrepancy (deviation or residual) is the vertical distance between an observed point and its predicted value, which is the point directly above or below it on the regression line. Observed points that fall far from the regression line have high discrepancy. Observed points that fall on the regression line have no discrepancy.

Leverage is the horizontal distance that a point falls from the center (mean) of the predictor (independent) variable. Observed points far from the mean for the predictor variable (i.e., the extreme scores for X) have high leverage. Those near the mean of X have low leverage.

The combination of discrepancy and leverage determines a points influence. Those points with both high discrepancy and high leverage have a greater degree of influence on the parameters for the regression equation. Eliminating the points with high influence may improve the accuracy of the regression model. These highly influential points may indicate unique characteristics that merit more careful investigation.

Here are some examples of plots of the regression line, discrepancy, leverage, and influence for the 2006 and 2007 API scores for San Francisco elementary schools.

In the scatterplot of the two API scores below, the regression line has been superimposed to represent the predicted Y' scores. Notice that most of the observed points fall fairly close to the line. Some of the lower API 2007 scores fail to match the pattern.

The discrepancies are the vertical distances between the observed points and the line. Notice that most observed points are within a distance of 50 score points from the line. One observation is over 150 points below the line.

As mentioned earlier, leverage is the horizontal distance for the mean of X (API06). The mean API score for these schools in 2006 was 774 points. The points with the most leverage are those scores well below and above 774.

The combination of discrepancy and leverage indicates the influence a point has on the regression model. Note here that the two points with the most influence are the circled ones. Closer investigation of the data reveals that Cesar Chavez Elementary went from an API score of 735 in 2006 to an API score of 596 in 2007, when the school's predicted score was 742. John Muir Elementary went from an API score of 615 in 2006 to an API score of 573 in 2007, when the school's predicted score was 627. The observation for Chavez Elementary has very high discrepancy and relatively low leverage, because its 2006 API score was near the mean of 774 points. The observation for Muir Elementary has moderate discrepancy but high leverage, because its 2006 API score was among the lowest. Inspecting the discrepancy plot will show two other schools whose predicted scores were off by over 50 points, but because their 2006 API scores are closer to the mean than Muir's was, their influence on the regression model is less.