Dependent t Tests and ANOVA Designs

Review of Hypothesis Testing and Independent t Tests

In the previous module, we compared the means from two different (i.e., independent) groups to see if they were statistically significantly different and if so, whether that difference was educationally or practically significant.

The appropriate statistical test to use was the independent t test, but before conducting this test certain assumptions need to be checked. These assumptions include independence of the two groups, normality of the two distributions or large enough sample sizes (e.g., around 30 each), and homogeneity of variance (i.e., equality of the two sample standard deviations).

By either calculating the test statistic, t, manually and looking up its associated probability in a table, or by using Excel or SPSS, eventually a p-value is obtained that is compared with the preset level of tolerance of committing a type I error, usually α=.05, which is the level of statistical significance. If the p-value is less than α, the observed difference is unlikely to be due to chance; if the p-value is equal to or greater than α, then there is too great a risk of committing a type I error, and the two means are deemed to be equal (or not different enough).

Another method for determining statistical significance, only for two-tailed tests though, is by using the confidence intervals. If the interval does not contain 0, the means are determined to be different; if the interval contains 0, the two means are not different.

The last step of the independent t test process is to calculate the effect size for differences that have been determined to be statistically significant. Effect sizes represent practical or educational significance and measure the magnitude of the difference between the means. Instead of the raw difference, a standardized difference is computed, called Cohen's d. These effect sizes are standardized so that they can be compared across multiple studies.

Dependent t Tests - Comparing Two Means for Related Groups

In the previous discussion, one of the assumptions was that the two groups are unrelated or independent. What happens if they are not unrelated? What if there is some type of link between pairs of observations? The existence of this link allows us to calculate the difference between the paired observations and test if the mean difference is equal to 0. This process should sound similar to the independent t test. Before conducting one of these tests, let's consider a little background information.

First of all, what are related groups? Consider a study that involves twins. Obviously, twins can share many characteristics. Let's say we are interested in comparing the GPAs of the twins. Instead of dividing the sets of twins into two groups and then conducting an independent t test, we pair the twins and conduct a dependent t test, which involves the difference between their two GPAs. If this difference is found to be statistically significantly different than 0, then we can calculate an effect size to measure the magnitude of the difference.

The previous scenario may seem a bit contrived. Who conducts twin studies in education? These types of studies exist, but they don't represent the most frequent use of the dependent t test in education. Instead of related groups, consider the comparison of related scores. Many education studies compare related scores. The most common comparison is between pretest scores and posttest scores. This comparison is used to assess the effect of an instructional intervention, for example. We might compare last year's score with this year's score, or the score at the beginning of the semester with that at the end. There are many versions of this type of scenario. The link between the pairs of scores is the fact that each pair belongs to a single individual. Each individual's posttest score can be compared with a previously observed pretest score. In fact, a difference score is calculated from the pair of scores and tested statistically.

Assumptions for dependent t tests

The main assumption for the dependent t test is that the difference scores are normally distributed, or that there is a sufficiently large sample size. Notice that we no longer have an assumption about the homogeneity of the variances, because we are comparing each score with its pair. This is a benefit of the dependent t test.

Effect size for dependent t tests

If a statistically significant result is found, an effect size can be calculated, similar to the process with independent t tests. Only here instead of Mean1 - Mean2, we have D, which represents the mean of the individual difference scores. Like the previous situation, there are a number of ways to choose which standard deviation (sd) to use. One commonly accepted choice is to use the pretest standard deviation, because it most closely represents the standard deviation of the population prior to any intervention.

In the following example, students' scores on a midterm and final will be compared. The data are listed below. Let's review the steps for comparing the scores.

Here are the steps:
  1. State the null and research hypotheses.
    Null hypothesis:  There is no difference between midterm scores and final scores
    Research hypothesis: Final scores are higher than midterm scores
  2. Establish the alpha level.
    We'll set a one-tailed alpha level at α = .05
  3. Select the appropriate test statistic (see the decision tree on page 166 or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
    The appropriate test to use is a dependent t test. The test statistic will be t.
  4. Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
    The assumption of normality is met by checking the histogram or the skewness ratio for the differences.
  5. Skip to #8. Determine the critical value for the test statistic.
    The critical value for the test statistic could be looked up in a table of t values, but SPSS gives the p-value associated with the observed t  value instead.
  6. Skip to #8. Compare the obtained value with the critical value.
    Again, because SPSS reports the p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the associated p-value (.000) to the predetermined alpha level of .05. Note that SPSS reports the two-tailed p-value, the appropriate p-value for a one-tailed test is one half of the p-value for a two-tailed test. The obtained p-value is less than .0005 for a two-tailed test, so the corresponding p-value for the one-tailed test is less than .00025.
  7. Skip to #8. Either reject or retain the null hypothesis based on the following.
    1. If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
    2. If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.
  8. Alternative to #5-7 - for use with SPSS output:
    Compare the reported p-value (Sig.) with the preset alpha level.
    If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
    If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.

  9. Here is the output from SPSS:




To conduct the same analysis in Excel, you need to first ensure that the Analysis Toolpak is installed for your version of Excel. Specific instructions for installing the Toolpak may vary for different versions of Excel. Check the appropriate Help information by searching for "Analysis Toolpak." When the Analysis Toolpak is installed, a Data Analysis option appears on the Tools menu. Choosing t-Test: Paired Two Sample for Means from the Data Analysis menu and completing the corresponding dialog box by filling in the data ranges produces the following output:

In either the SPSS analysis or the Excel analysis, we see that the mean difference is statistically significantly different from 0, which indicates that the final scores and the midterm scores are not the same. By dividing the observed mean difference by the midterm standard deviation, we obtain a medium effect size of approximately .39.


ANOVA - Comparing More Than Two Group Means

As we've seen, t tests compare the means of two independent groups or two related groups or scores. Many research designs involve more than just two groups. One approach for comparing more than two means is to test each pair of means with a t test. Not only does this approach involve many tests, but it also compounds the Type 1 error rate. For example, if we have three groups, Morning, Afternoon, and Evening, whose scores we want to compare, we could compare 1) Morning with Afternoon, 2) Morning with Evening, and 3) Afternoon with Evening. This approach would involve three separate tests, and it would result in an overall error rate of three times that of a single test. Instead of this approach, statisticians use an overall test, called an ANOVA.

ANOVA stands for ANalysis Of VAriance and is described as an omnibus test because it tests all differences between the separate groups at once. The variances being analyzed are the deviations from the mean scores, which we used to calculate the standard deviation (itself a measure of variation, of course). Review the calculation for the variance and standard deviation before proceeding on. Understanding that calculation will help to understand the following technique.

There are many types of ANOVAs, which depend on the specifics of the research design. The first type we will consider is called simple or oneway ANOVA. A oneway ANOVA is used to compare the means of three or more groups. Think of it as an extension of an independent t test, or think of the independent t test as a special case of an ANOVA for two groups.

The heart of an ANOVA test is something called the sum of squares. This is where recalling the calculation of the variance and standard deviation is helpful. Remember that we determined deviations from the mean for each score in a distribution. This process resulted in positive and negative deviations, which, due to the property of the mean, all sum to 0 always. In order to calculate an average deviation, we squared and then summed the deviations. This is the sum of squares (SS) - think of it as the sum of squared deviations to remember where the squares originated.

In this section, we will be considering three different types of sums of squares.
Explained variance plus unexplained variance equals total variance, or stated another way, the between groups sum of squares plus the within groups sum of squares equals the total sum of squares. In a t test, the between groups sum of squares is represented by the difference between the two group means and the within groups sum of squares is represented by the variances of the two groups.

The test statistic for an ANOVA is called the F ratio, which compares the between groups sum of squares with the within groups sum of squares. The between groups sum of squares is divided by the degrees of freedom associated with the number of group means (e.g., if there are 3 groups, then there are 2 degrees of freedom) to obtain the mean sum of squares between groups. Likewise, the within groups sum of squares is divided by the degrees of freedom associated with the group sample size (n). For example, when comparing three groups of 10 scores, there are 27 degrees of freedom, which is 9 degrees of freedom from each group. Dividing the within groups sum of squares by its associated degrees of freedom gives the mean sum of squares within groups.

The F ratio is the mean between groups sum of squares divided by the mean within groups sum of squares.
This test statistic is used to determine statistical significance of the result. SPSS and Excel report the associated p-value directly; Table B.3 in the text lists some example F values for specific α levels and degrees of freedom. Notice that there are two degrees of freedom parameters in the table - one for the numerator of the F ratio and another for the denominator.

An effect size for an ANOVA is called eta squared, η2, and is calculated by dividing the between groups sum of squares by the total sum of squares to obtain the percentage of variance that is explained by group membership. Similar to the coefficient of determination, the amount of unexplained variance is 1 - the amount of explained variance.

The assumptions for an ANOVA are extensions of the assumptions for a t test, namely 1) independent observations, 2) normally distributed data, and 3) homogeneity of variance among all groups. There is a form of Levene's test to assess whether the variances are similar.

Let's step through an example. Suppose that you are teaching three sections of the same course, designated Morning, Afternoon, and Evening. For the sake of this example, each section has 10 students. You collected data from your students using an instrument that measures their level of activity in the class. Here are the data:

Section Morning Afternoon Evening

7 4 8

6 5 7

8 6 8

6 5 8

7 4 6

6 5 7

8 6 6

6 5 9

5 5 7

7 4 8
Sum 66 49 74
Mean 6.60 4.90 7.40
SD 0.97 0.74 0.97

1. Set up the hypothesis
The null hypothesis is that the population means are equal. Symbolically, H0: μMorningμAfternoonμEvening

The alternative, research, hypothesis is never directional for an ANOVA. In this case, the alternative hypothesis is that the three means are not equal.
Symbolically, H1: MeanMorning   MeanAfternoon  MeanEvening

As we will see, this form of the alternative hypothesis will not identify the source of specific differences if they are found to exist.

2. Establish the alpha level at α = .05, which is always two-tailed for an ANOVA, based on the alternative hypothesis.
 
3. Select the appropriate test statistic (see the decision tree on page 166 or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a oneway ANOVA. The test statistic will be F.

4. Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
Independence is a result of the data collection process.
Normality is checked by inspecting the histograms and skewness ratios.
Homogeneity of variances is check with Levene's test (see the SPSS output below).
The test statistic is calculated and reported by SPSS or Excel. This is how the F ratio is determined:

First, a grand mean is calculated. All 30 scores shown above are added and the sum is divided by 30 to obtain a grand mean of 6.30.

The between groups sum of squares is the sum of the squared differences of each group mean from the grand mean multiplied by the sample size. [Note: ^2 represents squaring and * represents multiplication]

Between groups sum of squares =  10 * [(6.60-6.30)^2 + (4.90-6.30)^2 + (7.40-6.30)^2] , which equals 10 * [0.09 + 1.96 + 1.21] or 32.60

The within groups sum of squares is the sum of each score's squared deviation from its group mean. In other words, we subtract 6.60, 4.90, or 7.40 from each score listed above, square that result, and then add up all of the squares. Here are the squared deviations from the grand mean:

Section Morning Afternoon Evening

0.16 0.81 0.36

0.36 0.01 0.16

1.96 1.21 0.36

0.36 0.01 0.36

0.16 0.81 1.96

0.36 0.01 0.16

1.96 1.21 1.96

0.36 0.01 2.56

2.56 0.01 0.16

0.16 0.81 0.36
Sum 8.40 4.90 8.40

Within groups sum of squares = 8.40 + 4.90 + 8.40, which equals 21.70

Next, the two mean sums of squares are calculated by dividing by the degrees of freedom. Because there are three groups, the between groups degrees of freedom is 3-1 or 2. Because each group has 10 observations, the within groups degrees of freedom is 3 * (10 -1) or 3*9 which is 27.

The mean between groups sum of squares is 32.60/2 or 16.30

The mean within groups sum of squares is 21.70/27 or .803704

The F ratio is 16.30/.80 or 20.28

5. Skip to #8. Determine the critical value for the test statistic.

6. Skip to #8. Compare the obtained value with the critical value.

7. Skip to #8. Either reject or retain the null hypothesis based on the following:
        If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
        If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.
8. Alternative to #5-7 - for use with SPSS output:
    Compare the reported p-value (Sig.) with the preset alpha level.
    If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
    If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.

In the first table of the SPSS output below, find the group means and the grand mean described above. The second table contains the results of Levene's test, which establishes the equality of the group standard deviations. The third table contains the ANOVA results, which include the three sums of squares, the two mean sums of squares, the F ratio, and the observed p-value. Because the p-value is less than .05, we reject the null hypothesis and conclude that the three group means are different - but how are the different? For this we need to conduct another test, called pairwise comparisons. There are numerous versions of this test. We'll use the one that the author recommends, the Bonferroni test. But before that, look over the Excel results, generated using the ANOVA: Single Factor option on the Data Analysis dialog box, which is accessed from the Data Analysis command on the Tools menu (see the earlier note about installing these tools).



Here are the Excel ANOVA results:



Now for the pairwise comparisons. The Bonferroni test identifies the specific source of the differences found by the overall ANOVA test.


This test reports that the mean score of the Afternoon section (4.9) is different from both the Morning (6.6) and Evening (7.4) sections, but that the difference between Morning and Evening sections' scores is not statistically significant. Here is a picture of the pattern of mean scores.



The effect size is computed from the ANOVA table as the percentage of explained variance, denoted by η2, which is 32.6/54.3 or about 40.0%, representing a medium effect. The determination of the relative strength of an effect depends on the field of study - general guidelines should be weighed against findings of other researchers.

For practice understanding the oneway ANOVA and explained and unexplained variance, visit http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html or http://www.ruf.rice.edu/~lane/stat_sim/one_way/index.html


Factorial ANOVA - Comparing Group Means Based on Combinations of Independent Variables

We'll take the application of ANOVA just one step further. Suppose that you not only want to compare sections of a course but you also want to compare the level of activity of majors and non-majors in the course. Now, instead of just three groups, you have six groups - three sections and two types of students within each section. Again to make the example more straightforward, we'll assume equal numbers of majors and non-majors within the sections. That is, there are exactly five majors and five non-majors in each of the three sections. This uniformity is not a requirement, but it does make the results easier to understand.

Before conducting the comparison of the six group means, let's introduce a few new terms. First of all, this type of ANOVA is called a factorial ANOVA, and in particular for the example just described, a two factor or two-way ANOVA. A factor is an independent variable that is used to categorize the observations.

When we compare the means, there are two types of effects that we can observe. They are called main effects and interaction effects. Main effects are due to the factors themselves. In this example, there is a main effect due to Section and another main effect due to Type of Student. Interaction effects are due to the combination of the two factors. For example, if we found that majors are more actively learning in the morning section and non-majors are more actively learning in the afternoon section, there would be an interaction effect. Interactions are designated by a combination of factors, such as Section * Type.

In the ANOVA table, the result of the factorial structure is a separation of the between groups sum of squares. Because there are more categories of students, we have more ways to determine where the differences might arise. In the following example, which uses the same data that we used earlier, you'll notice that the overall sum of squares is the same.

The various F ratios and effect sizes are calculated in the same manner as they were in the oneway ANOVA. The assumptions are the same as the oneway ANOVA as well - there are just more subgroups to check.

Here are the data seen earlier, but now divided by Type of Student as well.


Here is the Excel output from the ANOVA: Two-Factor with Replication command. Note when using this command, the column and row containing the headings are required in the range. The values in the Total column were reported in the previous oneway ANOVA.



Here is the ANOVA table with the sum of squares for the two main effects and for the interaction.



Notice that the differences due to the Section are statistically significant, which is what we found in the oneway ANOVA, and that the differences due to Type of Student are statistically significant, but the interaction effect is not statistically significant. Try computing the effect sizes for the statistically significant effects. What percentage of variance is left unexplained?

Here is the same analysis from SPSS. First, here are the descriptive statistics about the groups as well as Levene's test for the equality of variances.



Then here are the results of the ANOVA. Notice that SPSS includes additional information in the table.


For the purposes of our example, we just need to focus on the rows labeled, Section, Type, Section * Type, Error, and Corrected Total. Compare these results to the Excel output displayed above.

Here is a picture that illustrates the pattern of means.


When the two lines do not cross each other, there is no interaction effect. In this graph, we can see that no matter which section they were in, the mean scores of Majors exceeded those of Non-Majors. This pattern represents a main effect. Also, by estimating the midpoints between the two mean scores for each section, we can see that Morning and Evening mean scores are higher than the Afternoon mean score.

For practice understanding the two-way ANOVAs and main effects and interactions, visit http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html and choose the Two-Way model.