Recall that a parameter is a number that refers to a population, and a statistic is a number that refers to a sample. The inferential statistics that we've studied in previous modules have been used to estimate a parameter based on a statistic. All of the dependent variables that we've considered so far have been interval or ratio-level variables. What can you do if you need to test differences or relationships among nominal or ordinal variables? In this module, we will study inferential tests that do not involve the actual numbers that were observed (e.g., test scores). Instead, these tests will involve either the counts (i.e., frequencies) of observations or will involve alternative tests that use the ranks of scores instead of the scores themselves.

For example, in one type of test, we may be interested in measuring a pattern of responses against those previously observed. These two patterns are described by frequency counts, which can be compared. In another example, we'll see how an alternative to a Pearson correlation coefficient can be used with ranks of scores to describe a relationship between two variables.

Let's start with gender. The district high school student gender composition is 65% female and 35% male. You have been helping 14 female students and 10 male students. How similar or different are these two groups regarding gender composition? To answer this question, you should conduct a chi-square goodness of fit test (or one-sample chi-square test). The specifics for this test are as follows. First, you are interested in a nominal variable (i.e., gender). Second, you have both a current group and a comparison group. In this example, the current group is the 24 students using your service (14 girls, 10 boys), and the comparison group is the total district high school student population, which you know to be 65% female and 35% male. Third, based on these two groups, you compare the two sets of frequencies. Two very important concepts are involved: observed frequency and expected frequency. Observed frequency is the count of observations for the current group - in this case, the 24 students. Expected frequency is the count of frequencies in the comparison group - in this case, all high school students. The comparison of observed frequencies and expected frequencies is called chi-square analysis. This type of analysis is often displayed using a cross-tabulation (or crosstab) table. Here is the start of the table for this analysis.

Female | Male | Total | |

Tutored Students | 14 | 10 | 24 |

High School Students |

The numbers in the row for Tutored Students are the observed frequencies. What are the expected frequencies? Using the percentages 65% and 35% we can construct expected frequencies to use for comparison. For the girls, 65% of 24 is 15.6 and for the boys 35% of 24 is 8.4. These numbers are used in the row for High School Students.

Female | Male | Total | |

Tutored Students | 14 | 10 | 24 |

High School Students | 15.6 | 8.4 | 24 |

Obviously, the frequencies are different, but how important is the difference? To answer this question, we use a familiar technique - we find the difference between the Observed and Expected frequencies (called a residual), square each of those differences, divided each squared difference by the Expected frequency, and then add up the results. This technique should sound vaguely similar to the calculation of the variance. Residual is another synonym for deviation.

For this example, the residuals are shown below:

Female | Male | |

Residuals | -1.6 | 1.6 |

Squaring these residuals and dividing by 15.6 and 8.4, respectively, results in the following:

Female | Male | Total | |

Squared Residuals/ Expected Frequency | .164 | .305 | .469 |

The value in the Total column is called Χ

For a chi-square test, the null hypothesis is that the two sets of frequencies (i.e., observed and expected) are equal. The alternative hypothesis is that they are unequal.

The closer the obtained chi-square is to zero, the more similar the two sets of frequencies are - or, stated another way, the better the observed data fit the expected pattern. This interpretation is where the term "goodness of fit" originates. Similar to previous comparisons of obtained and critical values of the test statistic, we can use a table (see Appendix B.4), SPSS, or Excel to determine whether to reject or retain the null hypothesis. The chi-square distribution is similar to the t and F distributions in that it takes different forms based on the degrees of freedom associated with the test. To view the shape of the chi-square distributions for different degrees of freedom, visit this site: http://highered.mcgraw-hill.com/sites/dl/free/0072868244/124727/CSDistribution.html. For a goodness-of-fit test, the degrees of freedom is the number of columns minus 1 (C-1). The Excel formula for obtaining the associated probability for a chi-square value of .469 (derived above) with 1 degree of freedom is given by the following: =CHIDIST(0.469,1). The SPSS output looks like this:

Because .494 is greater than .05, we fail to reject the null hypothesis, and instead, determine that the two patterns of genders are similar.

You can use this tool (http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html) to compute the p-value for the chi-square statistic you've calculated and its associated degrees of freedom. Scroll down to the section entitled Calculate probability from X² and d and then enter the computed X² amount and the appropriate degrees of freedom.

There are two assumptions that must be met in order to conduct a chi-square test:

- The categories for the observations cannot overlap - each observation is in only one category.
- Each category needs to have an expected frequency of at least 5. This assumption is difficult to meet when there are many categories and few observations.

Suppose that you are studying the instructional styles for different types of courses. You are interested in whether art, humanities, math, and science courses utilize the same predominant instructional style. You might think of this test as a type of correlation. We are interested in whether instructional style is related to course type or, alternatively, whether the two variables are independent. This interpretation is where the term "chi-square test of independence" originates. The null hypothesis that is tested is that the two variables are independent (not related). The alternative hypothesis is that the two variables are related (not independent).

Based on observations, 150 courses were classified as lecture/discussion, collaborative/cooperative, hands-on/interactive, or individualized/self-paced. Shown in the graph to the right are the observed frequencies for each combination of course type and instructional style. Below that, the observed frequency, expected frequency, and residual for each combination is identified in a crosstab table. |

Notice
in the crosstab table above that the observed and expected
row and
column totals are all equal. The expected frequencies just redistribute
the counts among the cells. For example, 30 out of the 150 courses are
Art courses, which represents 20% of the courses. If the pattern of
instructional styles is independent of the type of course, then 20% of
the total Lecture/Discussion courses should be Art courses; or stated
another way, of the 46 Lecture/Discussion courses, there should be 46 *
.20 = 9.2 Art courses. This is the expected frequency for the first
cell, which represents Lecture/Discussion Art courses. The remaining
expected frequencies are calculated similarly. After the expected frequencies are calculated, they are subtracted from the observed frequencies to obtain the residuals. To derive the chi-square value of 20.739, shown in the lower table, the residuals are each squared, then divided by each cell's expected frequency, and then added up. You might consider entering these numbers into a grid in Excel and manually repeating the calculation to reinforce the derivation of the chi-square statistic. Once the chi-square statistic is computed, its associated probability - the significance level or p-value, is derived, either using a table or using Excel or SPSS. The degrees of freedom for the test are equal to the number of rows minus 1 times the number of columns minus 1, or (R-1)*(C-1). The SPSS output shown to the right reports the Asymptotic Significance in the right-most column. The value shown is .014, which is less than .05, so we can reject the null hypothesis. The two variables are related, not independent, meaning that the pattern of instructional styles is not the same for all course types. |

Create your own example for a crosstab table (also called a contingency table) by thinking of two nominal-level variables, determining the number of categories for each of the two variables, and then entering counts for each combination of categories. Use this interactive applet to calculate the expected frequencies, chi-square statistic, and p-value: http://www.physics.csbsju.edu/stats/contingency_NROW_NCOLUMN_form.html. For practice generating and interpreting chi-square statistics for different types of distributions, visit: http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html.

Here are the quiz scores:

Quiz 1 | Quiz 2 |

1 | 4 |

22 | 17 |

3 | 6 |

4 | 8 |

11 | 10 |

8 | 12 |

9 | 7 |

10 | 15 |

15 | 21 |

16 | 22 |

17 | 28 |

18 | 11 |

5 | 34 |

24 | 18 |

27 | 40 |

Instead of the actual scores, the Spearman rank-order correlation uses the ranks of the scores. Ranks are derived by sorting the scores from lowest to highest and then assigning the lowest score a 1, the next lowest a 2, and so on. Here are the ranks:

Quiz 1 | Quiz 2 | Rank Q1 | Rank Q2 |

1 | 4 | 1 | 1 |

22 | 17 | 13 | 9 |

3 | 6 | 2 | 2 |

4 | 8 | 3 | 4 |

11 | 10 | 8 | 5 |

8 | 12 | 5 | 7 |

9 | 7 | 6 | 3 |

10 | 15 | 7 | 8 |

15 | 21 | 9 | 11 |

16 | 22 | 10 | 12 |

17 | 28 | 11 | 13 |

18 | 11 | 12 | 6 |

5 | 34 | 4 | 14 |

24 | 18 | 14 | 10 |

27 | 40 | 15 | 15 |

The Spearman rank-order correlation is based on the squared differences between the two ranks. For example, the difference between the first two ranks is 1-1=0 and the difference between the second two ranks is 13-9=4. These differences are squared and then summed, and then standardized based on the sample size. The resulting number, ρ, is interpreted exactly like the Pearson r is - the sign of ρ identifies a direct or indirect relationship and the size of ρ indicates the strength of the relationship.

The SPSS output for the data listed above is:

You can practice computing Spearman correlations for sets of numbers at this site: http://www.fon.hum.uva.nl/Service/Statistics/RankCorrelation_coefficient.html.

Suppose that you want to compare the ages of males and females in a small sample of homeless shelter volunteers. Here is an illustration of their ages:

Instead of comparing the average ages for males and females, the Mann-Whitney U test combines all of the ages into one group, sorts them, and then assigns a rank to each age. These ranks are then added up for the two groups of numbers. The test determines whether the two sums of ranks are equal. The null hypothesis is that they are equal and the alternative hypothesis is that they are not equal. Here is the Mann-Whitney output from SPSS:

Notice in the Ranks table that the sum of the ranks for the males is larger than that of the females. This difference matches the pattern shown in the graph above. The question is whether this difference between ages is statistically significant. The p-value (i.e., significance level) associated with the Mann-Whitney U of 85 is .768, which is greater than .05 indicating that we should not reject the null hypothesis and that even though the two patterns of ages are different, they are not different enough to be regarded as statistically significant.

The Mann-Whitney U test is used instead of the independent t test when assumptions about the samples are violated. Usually these violations happen with small sample sizes.

You can practice obtaining p-values for different values of U and different sample sizes here: http://www.socr.ucla.edu/Applets.dir/U_Test.html.

Total GPA | Major GPA | Sign |

2.7 | 2.6 | + |

2.8 | 3 | - |

3.1 | 3.2 | - |

3.2 | 3.4 | - |

2.8 | 2.9 | - |

3.5 | 3.4 | + |

3.8 | 3.9 | - |

2.7 | 2.9 | - |

3 | 3.3 | - |

3.6 | 3.8 | - |

3.6 | 3.4 | + |

3.6 | 3.5 | + |

3.1 | 3.2 | - |

2.9 | 3.1 | - |

2.9 | 2.7 | + |

2.7 | 2.6 | + |

3.7 | 3.8 | - |

4 | 4 | 0 |

3.3 | 3.2 | + |

3.5 | 3.3 | + |

2.9 | 3 | - |

3.7 | 3.8 | - |

3.7 | 3.6 | + |

3.5 | 3.5 | 0 |

Here is what the sign test output from SPSS looks like:

In this test, total GPA is being compared with major GPA for 24 students. There are 13 pairs where major GPA is higher than total GPA and 9 where the opposite is true, along with 2 cases where major and total GPA are the same. The null hypothesis for this test is that the number of pluses and minuses is the same, and the alternative is that the number of pluses and minuses is different. The likelihood of obtaining 13 minuses and 9 pluses is based on a binomial distribution, which approximates a normal distribution for large sample sizes. In this case, the chance of committing a Type I error is .523, so we fail to reject the null hypothesis and conclude that the number of positive and negative differences between total GPA and major GPA is the same, which means that there is no significant difference between total and major GPA.

Based on the p-value of .284 from the Test Statistics table, we fail to reject the null hypothesis and conclude that the two sums of ranks are equal (or close enough), which indicates that the two GPAs are similar. One way to interpret the Wilcoxon signed ranks test is as a dependent t test comparing the medians instead of the means.

Based on the p-value of .295, we fail to reject the null hypothesis of no differences between the total GPAs for the three groups. Age is not associated with statistically significant differences in total GPA.

Here is a second example in which total GPA is being compared for students who work different amounts during a school week. Here again, the total GPAs for all 24 students are sorted and ranked from 1 for the lowest to 24 for the highest GPA. Then, the ranks for each group of students, based on their Work Level, were added and divided by the number in the group to obtain the average rank. These average ranks are then compared using a chi-square statistic. In this case, based on the p-value of .001, the null hypothesis is rejected, and the differences between the three average ranks are determined to be too large to be due to chance.