Chi-Square Tests and Other Nonparametric (Distribution-Free) Tests

Parameters Revisited

When the concept of sampling was introduced in this course, two groups were identified - the population and a sample from the population. In the first part of the course, descriptive statistics were derived to illustrate the characteristics of a distribution of numbers. In the second part, statistics were derived in order to make inferences about a population based on observations based on a sample. In each of the inferential tests that we've discussed, assumptions were made about the nature of the sample and underlying distribution, but very little has been explained about what to do if these assumptions were not met. In this module, several new as well as alternative statistical tests will be presented for circumstances in which the assumptions for the previous tests (e.g., t tests and ANOVAs) have not been met.

Recall that a parameter is a number that refers to a population, and a statistic is a number that refers to a sample. The inferential statistics that we've studied in previous modules have been used to estimate a parameter based on a statistic. All of the dependent variables that we've considered so far have been interval or ratio-level variables. What can you do if you need to test differences or relationships among nominal or ordinal variables? In this module, we will study inferential tests that do not involve the actual numbers that were observed (e.g., test scores). Instead, these tests will involve either the counts (i.e., frequencies) of observations or will involve alternative tests that use the ranks of scores instead of the scores themselves.

For example, in one type of test, we may be interested in measuring a pattern of responses against those previously observed. These two patterns are described by frequency counts, which can be compared. In another example, we'll see how an alternative to a Pearson correlation coefficient can be used with ranks of scores to describe a relationship between two variables.

Tests Involving Nominal Variables

Chi-Square Goodness of Fit Test

Suppose that you manage an academic support service for high school students. The district leadership has been skeptical about adopting your model because they claim that your students (we'll call them tutored students) do not represent the general population of students in the district high schools. To challenge their claim, you need to compare your students with the general characteristics of students in the district. You could conduct a number of independent samples t tests on performance measures (e.g., CAHSEE scores), but how do you compare characteristics, such as gender, SES, ethnicity, and mobility?

Let's start with gender. The district high school student gender composition is 65% female and 35% male. You have been helping 14 female students and 10 male students. How similar or different are these two groups regarding gender composition? To answer this question, you should conduct a chi-square goodness of fit test (or one-sample chi-square test). The specifics for this test are as follows. First, you are interested in a nominal variable (i.e., gender). Second, you have both a current group and a comparison group. In this example, the current group is the 24 students using your service (14 girls, 10 boys), and the comparison group is the total district high school student population, which you know to be 65% female and 35% male. Third, based on these two groups, you compare the two sets of frequencies. Two very important concepts are involved: observed frequency and expected frequency. Observed frequency is the count of observations for the current group - in this case, the 24 students. Expected frequency is the count of frequencies in the comparison group - in this case, all high school students. The comparison of observed frequencies and expected frequencies is called chi-square analysis. This type of analysis is often displayed using a cross-tabulation (or crosstab) table. Here is the start of the table for this analysis.

	Female	Male	Total
Tutored Students	14	10	24
High School Students

The numbers in the row for Tutored Students are the observed frequencies. What are the expected frequencies? Using the percentages 65% and 35% we can construct expected frequencies to use for comparison. For the girls, 65% of 24 is 15.6 and for the boys 35% of 24 is 8.4. These numbers are used in the row for High School Students.

	Female	Male	Total
Tutored Students	14	10	24
High School Students	15.6	8.4	24

Obviously, the frequencies are different, but how important is the difference? To answer this question, we use a familiar technique - we find the difference between the Observed and Expected frequencies (called a residual), square each of those differences, divided each squared difference by the Expected frequency, and then add up the results. This technique should sound vaguely similar to the calculation of the variance. Residual is another synonym for deviation.

For this example, the residuals are shown below:

	Female	Male
Residuals	-1.6	1.6

Squaring these residuals and dividing by 15.6 and 8.4, respectively, results in the following:

	Female	Male	Total
Squared Residuals/ Expected Frequency	.164	.305	.469

The value in the Total column is called Χ² (chi-square, pronounced ki as in kite). This is the test statistic for the comparison of the observed and expected frequencies. As with other test statistics, we compare the obtained value with the critical value to determine whether to reject or retain the null hypothesis.

For a chi-square test, the null hypothesis is that the two sets of frequencies (i.e., observed and expected) are equal. The alternative hypothesis is that they are unequal.

The closer the obtained chi-square is to zero, the more similar the two sets of frequencies are - or, stated another way, the better the observed data fit the expected pattern. This interpretation is where the term "goodness of fit" originates. Similar to previous comparisons of obtained and critical values of the test statistic, we can use a table (see Appendix B.4), SPSS, or Excel to determine whether to reject or retain the null hypothesis. The chi-square distribution is similar to the t and F distributions in that it takes different forms based on the degrees of freedom associated with the test. To view the shape of the chi-square distributions for different degrees of freedom, visit this site: http://highered.mcgraw-hill.com/sites/dl/free/0072868244/124727/CSDistribution.html. For a goodness-of-fit test, the degrees of freedom is the number of columns minus 1 (C-1). The Excel formula for obtaining the associated probability for a chi-square value of .469 (derived above) with 1 degree of freedom is given by the following: =CHIDIST(0.469,1). The SPSS output looks like this:

Because .494 is greater than .05, we fail to reject the null hypothesis, and instead, determine that the two patterns of genders are similar.

You can use this tool (http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html) to compute the p-value for the chi-square statistic you've calculated and its associated degrees of freedom. Scroll down to the section entitled Calculate probability from X² and d and then enter the computed X² amount and the appropriate degrees of freedom.

There are two assumptions that must be met in order to conduct a chi-square test:

The categories for the observations cannot overlap - each observation is in only one category.
Each category needs to have an expected frequency of at least 5. This assumption is difficult to meet when there are many categories and few observations.

The result of the chi-square goodness-of-fit analysis indicates that the percentage of each gender in the tutored students group is similar to the percentage in the full group of district high school students.

Chi-Square Test of Independence

The goodness-of-fit test compares data from one sample with some other established pattern. What if you needed to compare data from two samples or compare frequencies based on a pair of nominal variables?

Suppose that you are studying the instructional styles for different types of courses. You are interested in whether art, humanities, math, and science courses utilize the same predominant instructional style. You might think of this test as a type of correlation. We are interested in whether instructional style is related to course type or, alternatively, whether the two variables are independent. This interpretation is where the term "chi-square test of independence" originates. The null hypothesis that is tested is that the two variables are independent (not related). The alternative hypothesis is that the two variables are related (not independent).

Based on observations, 150 courses were classified as lecture/discussion, collaborative/cooperative, hands-on/interactive, or individualized/self-paced. Shown in the graph to the right are the observed frequencies for each combination of course type and instructional style. Below that, the observed frequency, expected frequency, and residual for each combination is identified in a crosstab table.

Notice in the crosstab table above that the observed and expected row and column totals are all equal. The expected frequencies just redistribute the counts among the cells. For example, 30 out of the 150 courses are Art courses, which represents 20% of the courses. If the pattern of instructional styles is independent of the type of course, then 20% of the total Lecture/Discussion courses should be Art courses; or stated another way, of the 46 Lecture/Discussion courses, there should be 46 * .20 = 9.2 Art courses. This is the expected frequency for the first cell, which represents Lecture/Discussion Art courses. The remaining expected frequencies are calculated similarly.

After the expected frequencies are calculated, they are subtracted from the observed frequencies to obtain the residuals. To derive the chi-square value of 20.739, shown in the lower table, the residuals are each squared, then divided by each cell's expected frequency, and then added up. You might consider entering these numbers into a grid in Excel and manually repeating the calculation to reinforce the derivation of the chi-square statistic.

Once the chi-square statistic is computed, its associated probability - the significance level or p-value, is derived, either using a table or using Excel or SPSS. The degrees of freedom for the test are equal to the number of rows minus 1 times the number of columns minus 1, or (R-1)*(C-1). The SPSS output shown to the right reports the Asymptotic Significance in the right-most column. The value shown is .014, which is less than .05, so we can reject the null hypothesis. The two variables are related, not independent, meaning that the pattern of instructional styles is not the same for all course types.

Create your own example for a crosstab table (also called a contingency table) by thinking of two nominal-level variables, determining the number of categories for each of the two variables, and then entering counts for each combination of categories. Use this interactive applet to calculate the expected frequencies, chi-square statistic, and p-value: http://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.html. For practice generating and interpreting chi-square statistics for different types of distributions, visit: http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html.

Tests Involving Ordinal Variables

Spearman Rho Rank-Order Correlation

In a previous module, we described the characteristics of a linear relationship between two variables using the Pearson product-moment correlation coefficient, r. One of the assumptions for calculating a Pearson correlation is that the variables are measured at the interval or ratio level. In calculating a Pearson correlation, caution must be taken when there are outliers, especially with small sample sizes. An alternative to Pearson's r is called Spearman rank-order correlation, which is designated with the Greek letter ρ (rho). The relationship between these two correlation measures is fairly straightforward. Suppose you have two sets of quiz scores for a fairly small class. Due to a few extreme scores, you determine that a Pearson correlation isn't appropriate - instead you'll calculate a Spearman correlation. Here is the procedure (SPSS does all of this for you).

Here are the quiz scores:

Quiz 1	Quiz 2
1	4
22	17
3	6
4	8
11	10
8	12
9	7
10	15
15	21
16	22
17	28
18	11
5	34
24	18
27	40

Instead of the actual scores, the Spearman rank-order correlation uses the ranks of the scores. Ranks are derived by sorting the scores from lowest to highest and then assigning the lowest score a 1, the next lowest a 2, and so on. Here are the ranks:

Quiz 1	Quiz 2	Rank Q1	Rank Q2
1	4	1	1
22	17	13	9
3	6	2	2
4	8	3	4
11	10	8	5
8	12	5	7
9	7	6	3
10	15	7	8
15	21	9	11
16	22	10	12
17	28	11	13
18	11	12	6
5	34	4	14
24	18	14	10
27	40	15	15

The Spearman rank-order correlation is based on the squared differences between the two ranks. For example, the difference between the first two ranks is 1-1=0 and the difference between the second two ranks is 13-9=4. These differences are squared and then summed, and then standardized based on the sample size. The resulting number, ρ, is interpreted exactly like the Pearson r is - the sign of ρ identifies a direct or indirect relationship and the size of ρ indicates the strength of the relationship.

The SPSS output for the data listed above is:

You can practice computing Spearman correlations for sets of numbers at this site: http://www.fon.hum.uva.nl/Service/Statistics/RankCorrelation_coefficient.html.

Mann-Whitney U Test (independent t test alternative)

Recall that an independent samples t test compares the means of two unrelated samples (e.g., majors and non-majors) for an interval or ratio-level variable. The nonparametric alternative to an independent t test is a Mann-Whitney U test, which compares the ranks of observed scores instead of the actual scores. This process is similar to the use of ranks in calculating the Spearman correlation. Here is an example.

Suppose that you want to compare the ages of males and females in a small sample of homeless shelter volunteers. Here is an illustration of their ages:

Instead of comparing the average ages for males and females, the Mann-Whitney U test combines all of the ages into one group, sorts them, and then assigns a rank to each age. These ranks are then added up for the two groups of numbers. The test determines whether the two sums of ranks are equal. The null hypothesis is that they are equal and the alternative hypothesis is that they are not equal. Here is the Mann-Whitney output from SPSS:

Notice in the Ranks table that the sum of the ranks for the males is larger than that of the females. This difference matches the pattern shown in the graph above. The question is whether this difference between ages is statistically significant. The p-value (i.e., significance level) associated with the Mann-Whitney U of 85 is .768, which is greater than .05 indicating that we should not reject the null hypothesis and that even though the two patterns of ages are different, they are not different enough to be regarded as statistically significant.

The Mann-Whitney U test is used instead of the independent t test when assumptions about the samples are violated. Usually these violations happen with small sample sizes.

You can practice obtaining p-values for different values of U and different sample sizes here: http://www.socr.ucla.edu/Applets.dir/U_Test.html.

Sign Test (dependent t test alternative)

In this module, we will consider two alternatives to the dependent t test, which is used for two related samples. The first alternative is called the sign test. Conceptually, this test is very simple. Suppose there are 24 pairs of related scores. For each pair, one score is subtracted from the other score. This subtraction results in a difference. Instead of the value of the difference, only the sign is noted. These signs can be positive (+), negative (-), or tied (0). If the sets of scores are not different, we would expect to see equal numbers of positive and negative signs. For example, in the 24 pairs of scores, there may be 12 pluses (+) and 12 minuses (-). If we find that there is an imbalance between pluses and minuses, then the two sets of scores are different. Here are some related scores:

Total GPA	Major GPA	Sign
2.7	2.6	+
2.8	3	-
3.1	3.2	-
3.2	3.4	-
2.8	2.9	-
3.5	3.4	+
3.8	3.9	-
2.7	2.9	-
3	3.3	-
3.6	3.8	-
3.6	3.4	+
3.6	3.5	+
3.1	3.2	-
2.9	3.1	-
2.9	2.7	+
2.7	2.6	+
3.7	3.8	-
4	4	0
3.3	3.2	+
3.5	3.3	+
2.9	3	-
3.7	3.8	-
3.7	3.6	+
3.5	3.5	0

Here is what the sign test output from SPSS looks like:

In this test, total GPA is being compared with major GPA for 24 students. There are 13 pairs where major GPA is higher than total GPA and 9 where the opposite is true, along with 2 cases where major and total GPA are the same. The null hypothesis for this test is that the number of pluses and minuses is the same, and the alternative is that the number of pluses and minuses is different. The likelihood of obtaining 13 minuses and 9 pluses is based on a binomial distribution, which approximates a normal distribution for large sample sizes. In this case, the chance of committing a Type I error is .523, so we fail to reject the null hypothesis and conclude that the number of positive and negative differences between total GPA and major GPA is the same, which means that there is no significant difference between total and major GPA.

Wilcoxon Signed Ranks Test (dependent t test alternative)

The sign test is called a crude test because the sizes of the differences are not considered - only whether the differences are positive or negative. A refinement of the sign test is the Wilcoxon signed ranks test. Using the same data as in the sign test example, the two GPAs are subtracted, and the absolute values of the differences are sorted and ranked. Then, the ranks for the positive differences and the ranks for the negative differences are added up and compared. If the sums of the ranks are equal, then the null hypothesis is retained and the two sets of GPAs are determined to be the same.

Based on the p-value of .284 from the Test Statistics table, we fail to reject the null hypothesis and conclude that the two sums of ranks are equal (or close enough), which indicates that the two GPAs are similar. One way to interpret the Wilcoxon signed ranks test is as a dependent t test comparing the medians instead of the means.

Kruskal-Wallis Test (ANOVA alternative)

The final nonparametric test that we'll cover in this module is an alternative to the ANOVA test that we used to compare three or more group means. Similar to the other nonparametric tests, instead of using the actual data values, the data are sorted and ranked, and then the ranks are used for the comparison. Extending the previous example, suppose we want to compare the total GPAs for three different ages of students - younger than 21, 21-25, and older than 25. The first step is to sort the GPAs as one group and assign each one a rank. Then, the ranks from each of the groups are added and divided by the number in the group to obtain three average ranks. These three average ranks are compared using a form of a chi-square test. If the average ranks are equal, there is no difference in total GPA associated with age. If the average ranks are not equal, then age accounts for some of the difference in total GPA. Here is the SPSS output for a Kruskal-Wallis test.

Based on the p-value of .295, we fail to reject the null hypothesis of no differences between the total GPAs for the three groups. Age is not associated with statistically significant differences in total GPA.

Here is a second example in which total GPA is being compared for students who work different amounts during a school week. Here again, the total GPAs for all 24 students are sorted and ranked from 1 for the lowest to 24 for the highest GPA. Then, the ranks for each group of students, based on their Work Level, were added and divided by the number in the group to obtain the average rank. These average ranks are then compared using a chi-square statistic. In this case, based on the p-value of .001, the null hypothesis is rejected, and the differences between the three average ranks are determined to be too large to be due to chance.