Hypothesis Testing, Sampling,Types of Errors, Statistical Significance, Effect Size, and Power

Descriptive and Inferential Statistics

The study and use of statistics is roughly divided into two categories: descriptive statistics and inferential statistics. As previous sections have demonstrated, descriptive statistics summarize, organize, and illustrate distributions of numerical observations. Among typically employed descriptive statistics are measures of central tendency (e.g., mean, median, mode), measures of dispersion (e.g., range, interquartile range, variance, standard deviation), and bivariate measures (e.g., correlation coefficients). Within the category of descriptive statistics are the visual displays of distributions, including frequency tables, histograms, pie charts, bar charts, and scatterplots.

So what is different about inferential statistics? Instead of simply attempting to summarize and organize a distribution of data, inferential statistics are used to extend results generated from a subgroup of observations to a larger context - in other words, the purpose of inferential statistics is to generalize from samples to populations. This section will describe the fundamental concepts of the process of using inferential statistics.

A research hypothesis is the focal point of statistical inference. A research hypothesis makes a prediction, based on a theoretical foundation, that identifies an anticipated outcome - for example, higher reading scores, due to a new instructional technique. The research hypothesis is generated from a, usually broader, research question. An example of a research question might ask whether there is a difference between a new way of teaching and a traditional method. Research questions, in turn, are narrower investigations of a more general research purpose, which seeks to address an important problem in the field. So, you can think of a research hypothesis as the end of a series of narrower, more focused, statements about a research topic.

A research hypothesis aids the research study by providing a very specific and precise prediction to test. The test of the hypothesis involves sampling and then collecting data. The nature of the sample affects the ability of the researcher to generalize the results. A full introduction to sampling techniques is beyond the scope of this introduction, but the important aspect of sampling for statistical purposes involves the concept of representation. A quantitative researcher intends to apply her results to a larger group than the one she studied. The larger group is called the population. In order to generalize, the group involved in the study, called the sample, must resemble the population on variables that are important to the research. The simplest, but usually costliest, way to sample is randomly - the equivalent of pulling names out of a hat. This process is usually costly in terms of time, money, and logistics; so, a compromise is made for practical reasons. The compromise, and the act of sampling itself, introduce errors in the process; in other words, the smaller sample may differ from the population in some important way. A researcher cannot know whether this situation has actually happened though. Instead, the researcher will use statistics to describe the sample's characteristics and then infer from those sample characteristics what the properties of the larger population are. For example, testing a new reading method on a properly selected subset of all third-grade students allows a researcher to make statements about the value of the reading program for all third-grade students. Numbers pertaining to the sample are called statistics, and numbers pertaining to the population are called parameters. Statisticians use different symbols to distinguish the two set of numbers. Latin symbols (e.g., X, s, r) are used to represent sample statistics, and Greek symbols (e.g., μ, σ, ρ) [named mu, sigma, and rho] are used to represent population parameters. Here are several relationships to remember:

Symbols for the mean: X (sample mean) is used to estimate μ (population mean)
Symbols for the standard deviation: s (sample standard deviation) is used to estimate σ (population standard deviation)
Symbols for the Pearson correlation coefficient: r (sample correlation) is used to estimate ρ (population correlation)

As a researcher who is interested in improving the state of the field, you might propose to start your investigation by assuming your research hypothesis is indeed correct and looking for evidence to the contrary. This approach might result in a less than objective view of the research study. Instead researchers start with what is called a null hypothesis, which represents the status quo, for example, that there are no differences between a new method and a traditional one. The null hypothesis makes a statement about the population that is not directly testable. The statement is untestable in practice due to time, money, and logistical constraints, not untestable theoretically. If it was logistically possible, the new method could be tested on the entire population and the need to make an inference would not exist.

Here are some examples of null hypotheses:

Comparison of treatment (t) and control (c) group means
    H₀: μ_t = μ_c

Comparison of pre-test (pre) and post-test (post) means
    H₀: μ_pre = μ_post

Test of linear relationship between age (a) and experience (e)
    H₀: ρ_ae = 0

Notice that null hypotheses assume equality among groups, between variables, or over time. Researchers then collect data to see if there is evidence to the contrary. In this way, the null hypothesis provides a benchmark for assessing whether observed differences are due to chance or some other factor/variable (i.e., systematic differences).

Here are a few examples of research hypotheses, also called alternative hypotheses. These are directly testable because they refer to the sample statistics using Latin symbols, and not the population parameters. Notice that these research hypotheses contradict the equality represented by the null hypotheses. Research hypotheses can be directional (i.e., predicting that one statistic is greater than or less than the other) or they can be non-directional (i.e., predicting that the two statistics are different but not specifying how they differ).

Directional hypothesis that treatment (t) group mean is greater than control (c) group mean
    H₁: X_t > X_c

Non-directional hypothesis that pre-test (pre) and post-test (post) means are different (i.e., not equal)
    H₁: X_pre ≠ X_post

Directional hypothesis that a direct (i.e., positive) linear relationship between age (a) and experience (e) exists
    H₁: r_ae > 0

Well written hypotheses should have the following characteristics. They should:

Be declarative statements making specific predictions.
Identify a specific expected relationship.
Have a firm theory or literature base.
Be concise and to the point.
Be testable - allowing for the collection of data measuring variables in a systematic, unambiguous way.

Hypothesis Testing and Statistical Significance

After posing null and alternative hypotheses, researchers collect data and then test whether the data supports or refutes the hypothesis. When a hypothesis is tested by collecting data and comparing statistics from a sample with a predetermined value from a theoretical distribution, like the normal distribution, a researcher makes a decision about whether the null hypothesis should be retained or whether the null hypothesis should be rejected in favor of the research hypothesis. If the null hypothesis is rejected, then the researcher often describes the results as being significant. In describing the importance of the results of the research study, however, there are two types of significance involved - statistical significance and practical/educational significance. Rejecting a null hypothesis results in statistical significance, but not necessarily practical significance.

A statistically significant result is one that is likely to be due to a systematic (i.e., identifiable) difference or relationship, not one that is likely to occur due to chance. No matter how carefully designed the research project is, there is always the possibility that the result is due to something other than the hypothesized factor. The need to control all possible alternative explanations of the observed phenomenon cannot be emphasized enough. Alternative explanations can stem from an unrepresentative sample, some other type of validity threat, or an unknown, confounding factor. The ideal situation is one in which all other possible explanations are ruled out so that the only viable explanation is the research hypothesis.

The level that demarks statistical significance (called alpha and designated with the Greek letter, α) is completely under the control of the researcher. Norms for different fields exist. For example, α=.05 is generally used in educational research. But, what does α=.05 actually mean? The level of statistical significance is the level of risk that the researcher is willing to accept that the decision to reject the null hypothesis may be wrong by mis-attributing a difference to the hypothesized factor, when no difference actually exists. In other words, the level of statistical significance is the level of risk associated with rejecting a true null hypothesis. Selecting α=.05 indicates that the researcher is willing to risk being wrong in the decision to reject the null hypothesis 5 times out of 100, or 1 time out of 20. Referring back to the normal curve, α=.05 divides the area under the curve into two sections - one section where the null hypothesis is retained and another section where the null hypothesis is rejected. Rejecting a true null hypothesis is called committing a Type I error.

Another type of error that can be made is retaining a false null hypothesis. This is called Type II error. There is also a probability level associated with this type of error, called beta and designated with the Greek letter, β. Associated with β is a probability known as the power of the test, which equals 1 - β. Like the chance of committing a Type I error, the chance of committing a Type II error is also under the control of the researcher. Unlike the Type I error level, which is set directly by the researcher, the Type II error level is determined by a combination of parameters, including the α level, sample size, and anticipated size of the results. Visit this site (http://wise.cgu.edu/power/power_applet.html) or this site (http://www.intuitor.com/statistics/T1T2Errors.html) to explore how these elements are related.

Here is a table to help you understand Type I and Type II errors. See page 149 for another version of the same table. The decision you will make as a researcher is whether to reject or retain the null hypothesis based on the evidence that you've collected from the sample. This decision is similar, in theory, to the decision a juror makes about the guilt or innocence of a person on trial based on the evidence presented in the case.

Decision (action)	Null is true (not guilty)	Null is false (guilty)
Reject the null hypothesis (convict)	Type I error (convict the innocent) level of statistical significance, α	Correct decision (convict the guilty) power of the test, 1 - β
Retain the null hypothesis (acquit)	Correct decision (acquit the innocent)	Type II error (acquit the guilty) chance of Type II error, β

Remember that the null hypothesis represents the true state of nature (e.g., the characteristics of the population), which cannot be discovered directly. The decision, or action, is the choice made by the researcher (or the juror) based on the collected evidence. If an error was made, you will know which it was because either the null hypothesis is rejected or retained. The catch is that you can never know without a doubt whether an error, or a correct decision, was made.

Which error is more serious? Does the seriousness of the error depend on the consequences of the decision/action taken? How does this relate to conducting research in an educational setting?

Steps for conducting a test of statistical significance

State the null and research hypotheses.
Establish the level of statistical significance (alpha level, level of risk for committing a Type I error).
Select the appropriate test statistic (see the flowchart inside the back cover or visit here for a similar, computer-based form of the flowchart: https://usffiles.usfca.edu/FacStaff/baab/www/lessons/DecisionTree.html).
Check the test's assumptions and the compute the test statistic based on the sample data (obtained value).
Determine the critical value for the test statistic.
Compare the obtained value with the critical value.
Either reject or retain the null hypothesis based on the following.

If obtained value > critical value, then reject the null hypothesis - evidence supports the research hypothesis.
If obtained value <= critical value, then retain the null hypothesis - evidence does not support the research hypothesis.

It often helps to see a process unfold through an animation. To view what happens during the sampling process, visit this site (http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html), read the instructions, and experiment with drawing different sized samples from different types of population distributions.

Effect Size

Determining whether to reject or retain a null hypothesis is only part of the goal. If your hypothesis test suggested that you should reject the null hypothesis, then this means that you found a statistically significant result. But, is that result really important? This quesiton is answered by calculating another statistic, called Cohen's d, which is a measure of effect size. See Module 32 for examples of different types of effect size calculations. Based on Jacob Cohen's work, the following strengths of effect sizes have been determined for educational research:

small effect	.00 to .20
medium effect	.20 to .50
large effect	.50 and higher

One of the types of effects sizes is Cohen's d, which is calculated using the following formula:

It doesn't really matter which mean is Mean₁ and which is Mean₂, call the larger mean, Mean₁ to work with positive effect sizes. Which standard deviation to use is a matter of debate. In some situations, you should use the control group's standard deviation. In other cases, you should use a weighted average of the two sample standard deviations. This weighted sample is called the pooled standard deviation - you can see an example formula on page 365.

To calculate the effect size, you can also use the effect size calculator at http://web.uccs.edu/lbecker/Psy590/escalc3.htm. If you have the following two means and standard deviations, the online effect size calculator gives the following result, d = .53, which would generally be considered a large effect.