Hypothesis Testing, Statistical Significance, and Independent t Tests

Hypothesis Testing and Statistical Significance

When a hypothesis is tested by collecting data and comparing statistics from a sample with a predetermined value from a theoretical distribution, like the normal distribution, a researcher makes a decision about whether the null hypothesis should be retained or whether the null hypothesis should be rejected in favor of the research hypothesis. If the null hypothesis is rejected, then the researcher often describes the results as being significant. In describing the importance of the results of the research study, however, there are two types of significance involved - statistical significance and practical/educational significance. Rejecting a null hypothesis results in statistical significance, but not necessarily practical significance.

A statistically significant result is one that is likely to be due to a systematic (i.e., identifiable) difference or relationship, not one that is likely to occur due to chance. No matter how carefully designed the research project is, there is always the possibility that the result is due to something other than the hypothesized factor. The need to control all possible alternative explanations of the observed phenomenon cannot be emphasized enough. Alternative explanations can stem from an unrepresentative sample, some other type of validity threat, or an unknown, confounding factor. The ideal situation is one in which all other possible explanations are ruled out so that the only viable explanation is the research hypothesis.

The level that demarks statistical significance (called alpha and designated with the Greek letter, α) is completely under the control of the researcher. Norms for different fields exist. For example, α=.05 is generally used in educational research. But, what does α=.05 actually mean? The level of statistical significance is the level of risk that the researcher is willing to accept that the decision to reject the null hypothesis may be wrong by mis-attributing a difference to the hypothesized factor, when no difference actually exists. In other words, the level of statistical significance is the level of risk associated with rejecting a true null hypothesis. Selecting α=.05 indicates that the researcher is willing to risk being wrong in the decision to reject the null hypothesis 5 times out of 100, or 1 time out of 20. Referring back to the normal curve, α=.05 divides the area under the curve into two sections - one section where the null hypothesis is retained and another section where the null hypothesis is rejected. Rejecting a true null hypothesis is called committing a Type I error.

Another type of error that can be made is retaining a false null hypothesis. This is called Type II error. There is also a probability level associated with this type of error, called beta and designated with the Greek letter, β. Associated with β is a probability known as the power of the test, which equals 1 - β. Like the chance of committing a Type I error, the chance of committing a Type II error is also under the control of the researcher. Unlike the Type I error level, which is set directly by the researcher, the Type II error level is determined by a combination of parameters, including the α level, sample size, and anticipated size of the results. Visit this site (http://wise.cgu.edu/power/power_applet.html) or this site (http://www.intuitor.com/statistics/T1T2Errors.html) to explore how these elements are related.

Here is a table to help you understand Type I and Type II errors. See page 159 for another version of the same table. The decision you will make as a researcher is whether to reject or retain the null hypothesis based on the evidence that you've collected from the sample. This decision is similar, in theory, to the decision a juror makes about the guilt or innocence of a person on trial based on the evidence presented in the case.

Decision (action)	Null is true (not guilty)	Null is false (guilty)
Reject the null hypothesis (convict)	Type I error (convict the innocent) level of statistical significance, α	Correct decision (convict the guilty) power of the test, 1 - β
Retain the null hypothesis (acquit)	Correct decision (acquit the innocent)	Type II error (acquit the guilty) chance of Type II error, β

Remember that the null hypothesis represents the true state of nature (e.g., the characteristics of the population), which cannot be discovered directly. The decision, or action, is the choice made by the researcher (or the juror) based on the collected evidence. If an error was made, you will know which it was because either the null hypothesis is rejected or retained. The catch is that you can never know without a doubt whether an error, or a correct decision, was made.

Which error is more serious? Does the seriousness of the error depend on the consequences of the decision/action taken? How does this relate to conducting research in an educational setting?

Steps for conducting a test of statistical significance

State the null and research hypotheses.
Establish the level of statistical significance (alpha level, level of risk for committing a Type I error).
Select the appropriate test statistic (see the decision tree on page 166 or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
Determine the critical value for the test statistic.
Compare the obtained value with the critical value.
Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

Comparing the Means of Two Groups - Independent t Tests

Many research projects in education include a comparison of two groups. Examples include comparing two instructional methods, comparing achievement between girls and boys, comparing performance levels of English language learners and non-English language learners, or comparing outcomes for students from high SES schools with those from low SES schools. These types of comparisons involve stating a null hypothesis - there is no difference between the groups - and an alternative (research) hypothesis - there is a difference, which can be non-directional or directional. An example of a directional hypothesis is that girls will score higher than boys on a listening comprehension test. An example of a non-directional hypothesis is that girls will score differently (i.e., either higher or lower) than boys on the test.

Before comparing the two means, we need to determine if the two means are comparable. If you will recall, when the standard deviation was introduced earlier, it was described as a quality-control measure for the mean. As you know, the standard deviation indicates the amount of spread around the mean. Larger standard deviations indicate more spread, smaller standard deviations indicate less spread. Likewise, larger standard deviations indicate a less representative mean and smaller standard deviations indicate a more representative mean. So, before directly comparing the mean, we need to compare the standard deviations. Are the two standard deviations similar enough? As we will see shortly, SPSS includes this step in its output labeled as Levene's Test, which tests a prerequisite of the statistical comparison of the means called the assumption of homogeneity of variance (think of equality of spread). Other assumptions that must be met are independence between the two groups and having an underlying normally distributed population. Group independence is a result of the research design. The normality assumption can be checked by inspecting the histogram for symmetry and can be violated if there are enough (usually greater than 30) participants in each group.

Let's see what testing two means looks like in practice by conducting an independent t test with the 2006 API data for San Francisco elementary schools. In this test, we are comparing API scores for elementary schools with a significant number of English learners (EL) with API scores for elementary schools without a significant number of ELs. When using SPSS to conduct a t test, the steps are easier than those listed in the text. The result is equivalent, but instead of relying on a table of critical values (Table B.2 in the text), SPSS incorporates the values from the table into the statistical output. Instead of comparing the observed t value with the critical t value from the table, you just need to compare the observed p-value with the predetermined alpha level. Here are the steps:

State the null and research hypotheses.
Null hypothesis: There is no difference between mean API scores for EL elementary schools and non-EL elementary schools
Research hypothesis: This is a difference between mean API scores for EL elementary schools and non-EL elementary schools
Establish the alpha level.
We'll set a two-tailed alpha level at α = .05
Select the appropriate test statistic (see the decision tree on page 166 or https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
The appropriate test to use is a t test. The test statistic will be t.
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
The assumption of independence is met because we are comparing different schools.
The assumption of normality is met by checking the histograms for each group.
The assumption of homogeneity of variance is tested with Levene's test below. Notice that the reported significance level (Sig.) is .86. The interpretation of this value is that the two standard deviations (83.454 and 96.063) are similar enough for the t test, which allows us to compare the two means (782.81 and 735.55).
Skip to #8. Determine the critical value for the test statistic.
The critical value for the test statistic could be looked up in a table of t values, but SPSS gives the p-value associated with the observed t value instead.
Skip to #8. Compare the obtained value with the critical value.
Again, because SPSS reports the p-value - Sig. (2-tailed) - associated with the observed t value, we can directly compare the associated p-value (.046) to the predetermined alpha level of .05.
Skip to #8. Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

Alternative to #5-7 - for use with SPSS output:
Compare the reported p-value (Sig.) with the preset alpha level.
If p-value < alpha level, then reject the null hypothesis - evidence supports the research hypothesis. There is a small chance of committing a Type I error.
If p-value >= alpha level, then retain the null hypothesis - evidence does not support the research hypothesis. The chance of committing a Type I error is too large.
In this example, .046 < .05, so we reject the null hypothesis and conclude that there is a statistically significant difference between the API scores of the two groups of schools.

Confidence Intervals and the Standard Error of the Mean or Difference
Hypothesis testing involves a point estimation and results in a decision about rejecting or retaining the null hypothesis. An equivalent alternative to this approach involves an interval instead of a point. The interval is called a confidence interval and has a researcher-determined percentage associated with it. The percentage is the level of confidence that the true difference is within the interval. This is not a probability that the true difference is within the interval - the true difference either is or isn't within the interval and because we'll never be absolutely certain what the true difference is, we'll never know if it is within the interval. The 95% indicates that if you repeated this test many times, 95% of the intervals would contain the true difference and 5% of the intervals would not.

Note in the table above that the 95% confidence interval for the t test is reported as ranging from a low of .889 to a high of 93.636. These numbers indicate that the difference between the two mean API scores could be as little as .889 or as large as 93.636, but notice that the difference cannot be 0 because 0 is less than .889.

You might wonder what these interval endpoints (.889 and 93.636) are based on. First, the center of these numbers is 47.263, which is the observed difference between the two mean API scores. Try subtracting these numbers (93.636 - .889 = 92.747), dividing the result by 2 (92.747 / 2 = 46.3735), and then adding that result to .889 (46.3735 + .889 = 47.2625, which rounds to 47.263 -- the observed difference). So, the midpoint of the confidence interval is the observed difference between the two mean that we are comparing.

How is the length of the interval determined? The answer to this question is based on a theoretical probability distribution, similar to the standard normal distribution, called Student's t distribution. Recall that we found z scores associated with certain probabilities. We can also find t values associated with probabilities as well. The t value associated with 95% is just under 2 - actually 1.9966. So, we should construct an interval that is approximately two "standard deviations" above and below the observed difference of 47.263.

What is the standard deviation of the difference? This number is shown in the table above - it is labeled Std. Error Difference. Technically, it is a weighted average of the two sample standard deviations that is divided by the square root of the sample size. More about this in the next paragraph. To determine the endpoints, start with the observed difference (47.263), subtract from it 1.9966 * Std. Error (47.263 - (1.9966 * 23.227) = .889) -- this is the left endpoint of the confidence interval, and finally repeat the same calculation only add instead of subtract (47.263 + (1.9966 * 23.227) = 93.636) -- this is the right endpoint of the confidence interval. When conducting this comparison of the two mean API scores, 95% of the time, the true difference will be between .889 and 93.636 points.

More about the standard error. Notice in the first table above, after the standard deviations, there is a column labeled Std. Error Mean. The numbers in this column are calculated by dividing the standard deviation by the square root of the sample size. For example, 12.046 = 83.454 / 6.928, where 6.928 is the square root of 48. What does this number represent? Recall the simulation shown in the last module of drawing samples from a population and then calculating and plotting the mean of each sample. The result of this process is a distribution of sample means. The standard deviation of the distribution of sample means is the standard error of the mean, which is denoted SEM. Think of it the same way that you do the standard deviation for a regular sample - the only difference that instead of the sample of observations, we are referring to a sample of means derived from multiple samples, all of the same size. You might revisit this site (http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html) and generate some more distributions of sample means.

One limitation of confidence intervals is that they are used only with two-tailed tests. For a one-tailed test, where the alternative hypothesis is directional, you need to compare the observed p-value to the predetermined significance (alpha) level. To learn more about confidence intervals, see http://www.ruf.rice.edu/~lane/stat_sim/conf_interval/index.html.

Effect Size
All this work and we really only determined that the observed difference between the two mean API scores is probably not due to chance. In other words, we found a statistically significant difference between the two mean API scores -- meaning that 782.81 is statistically significantly larger than 735.55. But, we have not determined how important this difference should be for educators. Is this 47-point difference really meaningful? To answer that question, we need to calculate another statistic, called Cohen's d, which is a measure of effect size. Based on Jacob Cohen's work, the following strengths of effect sizes have been determined for educational research:

small effect	.00 to .20
medium effect	.20 to .50
large effect	.50 and higher

Cohen's d is calculated using the following formula:

It doesn't really matter which mean is Mean₁ and which is Mean₂, call the larger mean, Mean₁ to work with positive effect sizes. Which standard deviation to use is a matter of debate. In some situations, you should use the control group's standard deviation. In other cases, you should use a weighted average of the two sample standard deviations. This weighted sample is called the pooled standard deviation - you can see an example formula on page 181. [While you are on page 181, please note that there is an error on page 180 at the bottom -- the word numerator should be denominator.]

To calculate the effect size, you can also use the effect size calculator mentioned in the text at http://web.uccs.edu/lbecker/Psy590/escalc3.htm. Doing so, gives the following result.

So, the effect size is approximately one-half of a standard deviation, which is a large effect based on Salkind's criteria. If we compute the percentile associated with a z score of .5, we would get about the 70th percentile, which can be interpreted to indicate that the mean API score of the EL elementary schools is higher than 70% of the non-EL schools.

Excel has a TTEST function that has the following syntax: =TTEST(array1,array2,tails,type).
Here is an example TTEST formula for comparing two sets of scores, provided in cells A2:A21 and B2:B31, respectively.
=TTEST(A2:A21,B2:B31,2,3)
The 2 indicates a two-tailed test and the 3 indicates a test that doesn't assume that the sample SDs are equal. See Excel's help information for other options for these parameters.
The result of the formula is a p-value which can be compared to the significance level (α) to determine the chance of committing a Type I error. Here is what the spreadsheet looks like.