Dependent
t Tests and
ANOVA Designs
Review
of Hypothesis Testing and Independent t Tests
In
the previous module, we compared the means from two different (i.e.,
independent) groups to see if they were statistically significantly
different and if so, whether that difference was educationally or
practically significant.
The appropriate
statistical test to use was the independent t test, but before
conducting this test certain assumptions need to be checked. These
assumptions include independence of the two groups, normality of the
two distributions or large enough sample sizes (e.g., around 30 each),
and homogeneity of variance (i.e., equality of the two sample standard
deviations).
By either calculating the test
statistic, t,
manually and looking up its associated probability in a table, or by
using Excel or SPSS, eventually a p-value is obtained that is compared
with the preset level of tolerance of committing a type I error,
usually α=.05, which is the level of statistical significance. If the
p-value is less than α, the observed difference is unlikely to be due
to chance; if the p-value is equal to or greater than α, then there is
too great a risk of committing a type I error, and the two means are
deemed to be equal (or not different enough).
Another
method for determining statistical significance, only for two-tailed
tests though, is by using the confidence intervals. If the interval
does not contain 0, the means are determined to be different; if the
interval contains 0, the two means are not different.
The
last step of the independent t
test process is to calculate the effect size for differences that have
been determined to be statistically significant. Effect sizes
represent practical or educational significance and measure the
magnitude of the difference between the means. Instead of the raw
difference, a standardized difference is computed, called Cohen's d. These effect
sizes are standardized so that they can be compared across multiple
studies.
Dependent t Tests - Comparing
Two Means for Related Groups
In the previous discussion, one
of the assumptions was that the two groups are unrelated or
independent. What happens if they are not unrelated? What if there is
some type of link between pairs of observations? The existence of this
link allows us to calculate the difference between the paired
observations and test if the mean difference is equal to 0. This
process should sound similar to the independent t test. Before
conducting one of these tests, let's consider a little background
information.
First of all, what are related groups?
Consider a study that involves twins. Obviously, twins can share many
characteristics. Let's say we are interested in comparing the GPAs of
the twins. Instead of dividing the sets of twins into two groups and
then conducting an independent t
test, we pair the twins and conduct a dependent t test, which
involves the difference between their two GPAs. If this difference is
found to be statistically significantly different than 0, then we can
calculate an effect size to measure the magnitude of the difference.
The
previous scenario may seem a bit contrived. Who conducts twin studies
in education? These types of studies exist, but they don't represent
the most frequent use of the dependent t test in
education. Instead of related groups, consider the comparison
of related scores. Many education studies compare related scores. The
most common comparison is between pretest scores and posttest scores.
This comparison is used to assess the effect of an instructional
intervention, for example. We might compare last year's score with this
year's score, or the score at the beginning of the semester with that
at the end. There are many versions of this type of scenario. The link
between the pairs of scores is the fact that each pair belongs to a
single individual. Each individual's posttest score can be compared
with a previously observed pretest score. In fact, a difference score
is calculated from the pair of scores and tested statistically.
Assumptions for dependent t tests
The
main assumption for the dependent t
test is that the difference scores are normally distributed, or that
there is a sufficiently large sample size. Notice that we no longer
have an assumption about the homogeneity of the variances, because we
are comparing each score with its pair. This is a benefit of the
dependent t
test.
Effect
size for dependent t tests
If
a statistically significant result is found, an effect size can be
calculated, similar to the process with independent t tests. Only here
instead of Mean1 - Mean2,
we have D, which represents the mean of the individual difference
scores. Like the previous situation, there are a number of ways to
choose which standard deviation (sd) to use. One commonly accepted
choice is to use the pretest standard deviation, because it most
closely represents the standard deviation of the population prior to
any intervention.
In the
following example, students' scores on a midterm and final will be
compared. The data are listed below. Let's review the steps for
comparing the scores.
Here
are the steps:
- State
the null and research hypotheses.
Null hypothesis: There
is no difference between midterm scores and final scores
Research hypothesis: Final scores
are higher than midterm scores
- Establish
the alpha level.
We'll
set a one-tailed alpha level at α = .05
- Select
the appropriate test statistic (see the decision tree on page 166
or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a
dependent t
test. The test statistic
will be t.
- Check
the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
The assumption of normality is
met by checking the histogram or the skewness ratio for the differences.
- Skip
to #8. Determine
the critical value for the test statistic.
The
critical value for the test statistic could be looked up in a table of
t values, but SPSS gives the p-value associated with the observed t value instead.
- Skip
to #8. Compare
the obtained value with the critical value.
Again, because SPSS reports the
p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare
the associated p-value (.000) to the predetermined alpha level of .05.
Note that SPSS reports the two-tailed p-value, the appropriate p-value
for a one-tailed test is one half of the p-value for a two-tailed test.
The obtained p-value is less than .0005 for a two-tailed test, so the
corresponding p-value for the one-tailed test is less than .00025.
- Skip to #8. Either
reject or retain the null hypothesis based on the following.
- If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
- If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
- Alternative
to #5-7 - for use with SPSS output:
Compare the reported
p-value (Sig.) with the preset alpha level.
If
p-value < alpha level, then reject the null hypothesis -
evidence
supports the research hypothesis. There is a small chance of committing
a Type I error.
If p-value >= alpha level, then retain
the null
hypothesis - evidence does not support the research hypothesis. The
chance of committing a Type I error is too large.
Here
is the output from SPSS:
To conduct the
same analysis in Excel, you need to first ensure that
the Analysis Toolpak is installed for your version of Excel.
Specific instructions for installing the Toolpak may vary for different
versions of Excel. Check the appropriate Help information by searching
for "Analysis Toolpak." When the Analysis Toolpak is installed, a Data
Analysis option appears on the Tools menu. Choosing t-Test: Paired Two
Sample for Means from the Data Analysis menu and completing the
corresponding dialog box by filling in the data ranges produces the
following output:
In either the SPSS
analysis or the Excel analysis, we see that the mean difference is
statistically significantly different from
0, which indicates that the final scores and the midterm scores are not
the same. By dividing the observed mean difference by the midterm
standard deviation, we obtain a medium effect size of approximately .39.
ANOVA - Comparing More Than
Two Group Means
As we've seen, t tests
compare the means of two independent groups or two related groups or
scores. Many research designs involve more than just two groups. One
approach for comparing more than two means is to test each pair of
means with a t
test. Not only does this approach involve many tests, but it also
compounds the Type 1 error rate. For example, if we have three groups,
Morning, Afternoon, and Evening, whose scores we want to compare, we
could compare 1) Morning with Afternoon, 2) Morning with Evening, and
3) Afternoon with Evening. This approach would involve three separate
tests, and it would result in an overall error rate of three times that
of a single test. Instead of this approach, statisticians use an
overall test, called an ANOVA.
ANOVA stands for ANalysis Of VAriance and
is described as an omnibus test because it tests all differences
between the separate groups at once. The variances being analyzed are
the deviations from the mean scores, which we used to calculate the
standard deviation (itself a measure of variation, of course). Review
the calculation for the variance and standard deviation before
proceeding on. Understanding that calculation will help to understand
the following technique.
There are many types of
ANOVAs, which depend on the specifics of the research design. The first
type we will consider is called simple or oneway ANOVA. A oneway ANOVA is used to compare the
means of three or more groups. Think of it as an extension
of an independent t
test, or think of the independent t
test as a special case of an ANOVA for two groups.
The
heart of an ANOVA test is something called the sum of squares. This is where
recalling the calculation of the variance and standard deviation is
helpful. Remember that we determined deviations from the mean for each
score in a distribution. This process resulted in positive and negative
deviations, which, due to the property of the mean, all sum to 0
always. In order to calculate an average deviation, we squared and then
summed the deviations. This is the sum of squares (SS) - think of it as
the sum of squared deviations to remember where the squares originated.
In this section, we will be considering three
different types of sums of squares.
- The total sum of squares
represents the total amount of variation in the combined distribution
of all raw scores.
- The between groups sum of squares
represents the variation that is related to group membership. The
between groups sum of squares represents variance that is explained by the
characteristics that define the separate groups.
- The within groups sum of squares
represents the variation within the separate groups, which is unexplained variance and is
also called error.
Explained
variance plus unexplained
variance equals total variance, or stated another
way, the between groups
sum of
squares plus the within groups sum of squares equals the
total sum of squares. In a t
test, the
between groups sum of squares is represented by the difference between
the two group means and the within groups sum of squares is represented
by the variances of the two groups.
The test
statistic for an ANOVA is called the F
ratio, which compares the between groups sum of squares
with the within groups sum of squares. The between groups sum
of squares is divided by the degrees of freedom associated with the
number of group means (e.g., if there are 3 groups, then there are 2
degrees of freedom) to obtain the
mean sum of squares between groups. Likewise, the within
groups sum of squares is divided by the degrees of freedom associated
with the group sample size (n). For example, when comparing three
groups of 10 scores, there are 27 degrees of freedom, which is 9
degrees of freedom from each group. Dividing the within groups sum of
squares by its associated degrees of freedom gives the mean sum of squares within groups.
The F ratio is
the mean between groups sum of squares divided by the
mean within groups sum of squares. This test statistic is
used to determine statistical significance of the result. SPSS and
Excel report the associated p-value directly; Table B.3 in the text
lists some example F values for specific α levels and degrees
of freedom. Notice that there are two degrees of freedom parameters in
the table - one for the numerator of the F ratio and another for the
denominator.
An effect
size for an ANOVA is called eta squared, η2,
and is calculated by dividing the between
groups
sum of squares by the total
sum of squares to obtain the percentage of variance that is explained
by group membership. Similar to the coefficient of
determination, the amount of unexplained variance is 1 - the amount of
explained variance.
The assumptions for an ANOVA are
extensions of the assumptions for a t
test, namely 1)
independent observations, 2) normally distributed data, and 3)
homogeneity of variance among all groups. There is a form of Levene's
test to assess whether the variances are similar.
Let's
step through an example. Suppose that you are teaching three sections
of the same course, designated Morning, Afternoon, and Evening. For the
sake of this example, each section has 10 students. You collected data
from your students using an instrument that measures their level of
activity in the class. Here are the data:
Section |
Morning |
Afternoon |
Evening |
|
7 |
4 |
8 |
|
6 |
5 |
7 |
|
8 |
6 |
8 |
|
6 |
5 |
8 |
|
7 |
4 |
6 |
|
6 |
5 |
7 |
|
8 |
6 |
6 |
|
6 |
5 |
9 |
|
5 |
5 |
7 |
|
7 |
4 |
8 |
Sum |
66 |
49 |
74 |
Mean |
6.60 |
4.90 |
7.40 |
SD |
0.97 |
0.74 |
0.97 |
1. Set up the
hypothesis
The null hypothesis is that the population means
are equal. Symbolically, H0: μMorning
= μAfternoon
= μEvening
The
alternative, research, hypothesis is never directional for an ANOVA. In
this case, the alternative hypothesis is that the three means are not
equal.
Symbolically, H1:
MeanMorning ≠ MeanAfternoon ≠ MeanEvening
As we will see, this form
of the alternative hypothesis will not identify the source of specific
differences if they are found to exist.
2. Establish
the alpha level at α =
.05, which is always two-tailed for an ANOVA, based on the
alternative hypothesis.
3. Select the
appropriate test statistic (see the decision tree on page 166
or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a
oneway ANOVA.
The test statistic will be F.
4.
Check
the test's assumptions and the compute
the test statistic based on the sample data (obtained value).
Independence is a result of the
data collection process.
Normality is checked by inspecting
the histograms and skewness ratios.
Homogeneity of variances
is check with Levene's test (see the SPSS output below).
The
test statistic is calculated and reported by SPSS or Excel. This is how
the F ratio is determined:
First, a grand mean is calculated.
All
30 scores shown above are added and the sum is divided by 30 to obtain
a grand mean of 6.30.
The
between groups sum of squares is the sum of the squared differences of
each group mean from the grand mean multiplied by the sample size.
[Note: ^2 represents squaring and * represents multiplication]
Between
groups sum of squares = 10 * [(6.60-6.30)^2 + (4.90-6.30)^2 +
(7.40-6.30)^2] , which equals 10 * [0.09 + 1.96 + 1.21] or 32.60
The
within groups sum of squares is the sum of each score's squared
deviation from its group mean. In other words, we subtract 6.60, 4.90,
or 7.40 from
each score listed above, square that result, and then add up all of the
squares. Here are the squared deviations from the grand mean:
Section |
Morning |
Afternoon |
Evening |
|
0.16 |
0.81 |
0.36 |
|
0.36 |
0.01 |
0.16 |
|
1.96 |
1.21 |
0.36 |
|
0.36 |
0.01 |
0.36 |
|
0.16 |
0.81 |
1.96 |
|
0.36 |
0.01 |
0.16 |
|
1.96 |
1.21 |
1.96 |
|
0.36 |
0.01 |
2.56 |
|
2.56 |
0.01 |
0.16 |
|
0.16 |
0.81 |
0.36 |
Sum |
8.40 |
4.90 |
8.40 |
Within
groups sum of squares = 8.40 + 4.90 + 8.40, which equals 21.70
Next,
the two mean sums of squares are calculated by dividing by the degrees
of freedom. Because there are three groups, the between groups degrees
of freedom is 3-1 or 2. Because each group has 10 observations, the
within groups degrees of freedom is 3 * (10 -1) or 3*9 which is 27.
The
mean between groups sum of squares is 32.60/2 or 16.30
The
mean within groups sum of squares is 21.70/27 or .803704
The
F ratio is 16.30/.80 or 20.28
5.
Skip
to #8. Determine
the critical value for the test statistic.
6. Skip
to #8. Compare
the obtained value with the critical value.
7. Skip
to #8. Either
reject or retain the null hypothesis based on the following:
If obtained
value > critical value, then reject the null hypothesis -
evidence supports the research hypothesis.
If
obtained value <= critical value, then retain the null
hypothesis - evidence does not support the research hypothesis.
8.
Alternative
to #5-7 - for use with SPSS output:
Compare the reported
p-value (Sig.) with the preset alpha level.
If
p-value < alpha level, then reject the null hypothesis -
evidence
supports the research hypothesis. There is a small chance of committing
a Type I error.
If p-value
>= alpha level, then retain
the null
hypothesis - evidence does not support the research hypothesis. The
chance of committing a Type I error is too large.
In
the first table of the SPSS output below, find the group means and the
grand mean described above. The second table contains the results of
Levene's test, which establishes the equality of the group standard
deviations. The third table contains the ANOVA results, which include
the three sums of squares, the two mean sums of squares, the F ratio,
and the observed p-value. Because the p-value is less than .05, we
reject the null hypothesis and conclude that the three group means are
different - but how are the different? For this we need to conduct
another test, called pairwise comparisons. There are numerous versions
of this test. We'll use the one that the author recommends, the
Bonferroni test. But before that, look over the Excel results,
generated using the ANOVA: Single Factor option on the Data Analysis
dialog box, which is accessed from the Data Analysis command on the
Tools menu (see the earlier note about installing these tools).
Here are the Excel ANOVA results:
Now
for the pairwise comparisons. The Bonferroni test identifies the
specific source of the differences found by the overall ANOVA
test.
This
test reports that the mean score of the Afternoon section
(4.9) is
different from both the Morning (6.6) and Evening (7.4) sections, but
that the difference between Morning and Evening sections' scores is not
statistically significant. Here is a picture of the pattern of mean
scores.
The
effect size is
computed from the ANOVA table as the percentage of explained variance,
denoted by η2,
which is 32.6/54.3
or about 40.0%, representing a medium effect. The determination of the
relative strength of an effect depends on the field of study - general
guidelines should be weighed against findings of other researchers.
For
practice understanding the oneway ANOVA and explained and unexplained
variance, visit
http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html
or http://www.ruf.rice.edu/~lane/stat_sim/one_way/index.html
Factorial ANOVA - Comparing
Group Means Based on Combinations of Independent Variables
We'll
take the application of ANOVA just one step further. Suppose that you
not only want to compare sections of a course but you also want to
compare the level of activity of majors and non-majors in the course.
Now, instead of just three groups, you have six groups - three sections
and two types of students within each section. Again to make the
example more straightforward, we'll assume equal numbers of majors and
non-majors within the sections. That is, there are exactly five majors
and five non-majors in each of the three sections. This uniformity is
not a requirement, but it does make the results easier to understand.
Before
conducting the comparison of the six group means, let's introduce a few
new terms. First of all, this type of ANOVA is called a factorial ANOVA, and in
particular for the example just described, a two factor or two-way
ANOVA. A factor
is an independent variable that is used to categorize the observations.
When
we compare the means, there are two types of effects that we can
observe. They are called main effects and interaction effects. Main effects
are due to the factors themselves. In this example, there is a main
effect due to Section and another main effect due to Type of
Student. Interaction
effects are
due to the combination of the two factors. For example, if we found
that majors are more actively learning in the morning section and
non-majors are more actively learning in the afternoon section, there
would be an interaction effect. Interactions are designated by a
combination of factors, such as Section * Type.
In
the ANOVA
table, the result of the factorial structure is a separation of the
between groups sum of squares. Because there are more categories of
students, we have more ways to determine where the differences might
arise. In the following example, which uses the same data that we used
earlier, you'll notice that the overall sum of squares is the same.
The
various F ratios and effect sizes are calculated in the same
manner as they were in the oneway ANOVA. The assumptions are
the
same as the oneway ANOVA as well - there are just more subgroups to
check.
Here are the data seen earlier, but
now divided by Type of Student as well.
Here
is the Excel output from the ANOVA: Two-Factor with Replication
command. Note when using this command, the column and row containing
the headings are required in the range. The values in the Total column
were reported in the previous oneway ANOVA.
Here is the ANOVA
table with the sum of squares for the two main effects and for the
interaction.
Notice
that the differences due to the Section are statistically significant,
which is what we found in the oneway ANOVA, and that the differences
due to Type of Student are statistically significant, but the
interaction effect is not statistically significant. Try computing the
effect sizes for the statistically significant effects. What percentage
of variance is left unexplained?
Here
is the same analysis from SPSS. First, here are the descriptive
statistics about
the groups as well as Levene's test for the equality of variances.
Then here are the results of
the ANOVA. Notice that SPSS includes additional information in the
table.
For
the purposes of our example, we just need to focus on the rows labeled,
Section, Type, Section * Type, Error, and Corrected Total. Compare
these results to the Excel output displayed above.
Here
is a picture that illustrates the pattern of means.
When
the two lines do not cross each other, there is no interaction effect.
In this graph, we can see that no matter which section they were in,
the mean scores of Majors exceeded those of Non-Majors. This pattern
represents a main effect. Also, by estimating the midpoints between the
two mean scores for each section, we can see that Morning and Evening
mean scores are higher than the Afternoon mean score.
For
practice understanding the two-way ANOVAs and main effects and
interactions, visit
http://www.kingsborough.edu/academicDepartments/math/faculty/rsturm/anova/Anova0126.html
and choose the Two-Way model.