Frequency Distributions and Graphical Displays

Frequency Distributions

Single-number summaries of distributions, such as the mean, median, or mode, have definite limitations. Even when combined with measures of dispersion, such as the range, interquartile range, or standard deviation, a complete understanding of the entire distribution cannot be achieved. Perhaps there is a better way to communicate the nature of an entire distribution? Following the saying about words and pictures, pictures (i.e., graphs and charts) can portray a great deal of information about numbers, too. An intermediate step between having just a list of raw, unordered data and producing an informative graph will be to construct a table called a frequency distribution. A frequency distribution provides an ordered tally of the observed values and a count of the occurrences of each value. The observed values can be tallied individually or they can be grouped and then tallied. Let's look at a simple example.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Score	Frequency
4	1
5	1
6	1
7	3
8	2
9	2
sum	10

Quite a bit of information can be discerned from the table. First, the minimum score was 4, which was observed once. Second, the maximum score was 9, which was observed twice. Third, the most frequently occurring score (i.e., the mode) was a 7, which was observed three times. The total number of scores was 10 - represented by the sum in the last row. Let's add two more columns to the frequency distribution. The two new columns represent two types of percentages.

Score	Frequency	Percent	Cumulative Percent
4	1	10	10
5	1	10	20
6	1	10	30
7	3	30	60
8	2	20	80
9	2	20	100
sum	10	100

Recall that a percentage is calculated by dividing a number representing a part of a group by a number representing the whole group, and then multiplying by 100. [Percent means per 100.] In this example, the whole is the entire collection of scores - all 10 of them. The parts that relate to the percentages are the frequency counts in each row. So, the calculation for the percentage of scores of 5 is the following: 1 divided by 10 times 100, which equals 10%; for the percent of 7s, the calculation is 3 divided by 10 times 100, which equals 30%. Calculating percentages makes the frequencies more easily comparable. The last column is a running total of the Percent column. Consider why this might be useful information? If a 7 is a passing score, then a quick glance at the Cumulative Percent column shows that 30% of the scores were 6 or lower. A more positive outlook would be to report that 70% (100-30) of the scores were 7 or higher.

The previous example is very much simplified in order to introduce the meaning of the columns of a frequency distribution. In practice, frequency distributions involve more values and, quite often, researchers will group the values in order to make the frequency distribution more understandable. Here is another, more involved, example. Suppose you've conducted a survey in which you asked your colleagues to report their age (and they actually did). The data that you received are the following:

24, 53, 44, 30, 34, 36, 59, 61, 30, 48,
22, 65, 31, 33, 40, 37, 27, 29, 25, 23,
39, 30, 40, 55, 50, 38, 32, 28, 31, 30,
62, 41, 38, 66, 29, 31, 28, 24, 39, 64,
57, 53, 38, 32, 30, 25, 28, 44, 37, 50

Glancing at this list of numbers doesn't help to understand the ages of the survey participants. Let's construct a grouped frequency distribution by trying to implement the following suggested guidelines. The official term for a group is a class interval, and the term, bin, is used frequently as well.

Use a class interval that includes 2, 5, 10, or 20 values
Aim for 10-20 class intervals (range / number of class intervals = class interval range)
Tally observed values into class intervals - result gives frequency counts for each class interval (number of observed values with each class interval)

For the group of 50 ages listed above, the maximum age is 65 and the minimum is 22; so, the range is 65-22 or 43. If we divide 43 by 5, we get approximately 8 class intervals. A typical grouping of ages is into 5-year intervals (e.g., 20-24, 25-29). To sufficiently cover all of the ages, the highest category should be 65-69. In the frequency distribution table below, notice that none of the class intervals overlap and that there are no holes between intervals. [We're using the standard convention that you are 29 up until your 30th birthday - there are other ways to measure age, especially with younger people.]

Age	Frequency	Percent	Cumulative Percent
20-24	4	8	8
25-29	8	16	24
30-34	12	24	48
35-39	8	16	64
40-44	5	10	74
45-49	1	2	76
50-54	4	8	84
55-59	3	6	90
60-64	3	6	96
65-69	2	4	100
sum	50	100

This table might help you understand the ages a bit better. For example, the largest age group, which is called the modal class interval, is the range of ages from 30-34, representing 24% of the survey participants. Based on the grouped frequencies, we can't tell that the actual mode is 30, which occurs 5 times. For that information, you need to inspect the individual values, not the grouped values. From the grouped frequencies, you can also tell that 24% of participants are under 30 years of age and that 24% (100-76) are 50 or older.

It may be somewhat difficult to tell just by looking at the table, but as we will see later, this distribution has relatively more smaller scores than larger ones. This situation represents what statisticians call a skew in the data. The skewness statistic measures the lack of symmetry of a distribution. If the mean is larger than the median, there is a positive skew. If the mean is smaller than the median, there is a negative skew. If the mean and median are equal, the distribution is symmetric and the skewness is 0.

There is one more descriptive statistic for a distribution - the kurtosis, which measures the peakedness of a distribution. A positive kurtosis represents a peaked distribution, a negative kurtosis represents a flat distribution, and a kurtosis of 0 represents a bell shape. We won't spend much time studying the kurtosis or the skewness. There are very thorough explanations of them here: http://jalt.org/test/bro_1.htm. There is one application of skewness that is worthy of a short note. Before that, review this output from SPSS about the full (not grouped) distribution of ages that was listed previously.

table of descriptive statistics for age example

table of descriptive statistics for age example

With the exception of the three standard error statistics (i.e., Std. Error of Mean, Skewness, and Kurtosis), all of the other statistics should be familiar to you by now. [Ignore the Missing entry - Missing represents blanks in the data - there weren't any.] Can you identify the three measures of central tendency and the three measures of spread? Note that the interquartile range is not reported, but it can be easily derived by subtracting 29.75 from 48.50. Why? Also, note that the number 36.50 occurs twice and has two different labels. What is the relationship between the median and the 50th percentile?

So, what do the skewness and kurtosis indicate? First, because the kurtosis statistic is negative (-.497), this distribution is flatter than a standard, bell-shaped curve. Second, because the skewness is positive (also note that the mean is larger than the median), this distribution is not symmetric. In fact, lack of symmetry is a warning signal about the accuracy of the mean as a measure of the center. Statisticians compute a skewness ratio and determine that the distribution is severely skewed if the ratio is 2 or higher or -2 or lower. SPSS does not calculate the skewness ratio, but it does give you the two numbers you need to compute it yourself. The skewness ratio equals the skewness divided by the standard error of the skewness. In this example, the skewness ratio is .784/.337 or 2.33, which indicates a severely, positively skewed distribution because the ratio is larger than 2. In this case, the mean should only be used with caution when describing the distribution.

Summary of the Relationship Between Descriptive Statistics and Distributions

Where is the center of the distribution? Answered by the mean
How spread out are the values from the center? Answered by the standard deviation
How symmetric is the distribution? Answered by the skewness
How peaked or flat is the distribution? Answered by the kurtosis

Graphical Displays

The graphical version of a frequency distribution is called a histogram. Here are two examples of histograms for the grouped age data, which were generated using Excel and SPSS. Review them and compare them with each other and with the grouped frequencies presented in the table above. Can you visualize where the mean, median, and mode are? Can you see why the distribution is skewed? The relatively fewer ages on the right side pull the mean away from the larger grouping of ages on the left.

histogram of grouped ages from Excel

Figure 1. Frequency distribution of participants' ages, generated using Excel.

histogram of grouped ages from SPSS

Figure 2. Frequency distribution of participants' ages, generated using SPSS.

These graphs are usually generated with the aid of a computer program, but here are the steps to construct one by hand:

Draw the axes - the horizontal (x-axis) indicates the distribution values, class intervals; the vertical (y-axis) indicates the frequency counts.
Mark endpoints of each class interval.
Label each class interval with the midpoint, the range, or the endpoints.
Plot the tally (frequency count) for each class interval.
Construct rectangles for each class interval.

The APA manual has separate sections on properly formatting tables and figures. The figures included above are properly formatted according to APA guidelines, with captions below the graphs. Please note that APA-formatted figures do not have titles - they have captions instead.

Here is an example of a properly APA-formatted table. Note that all numbers are aligned to the right. If the numbers included decimal points, the decimal points would provide the alignment point. Microsoft Word and other word-processing programs have special tab settings for tab-alignment of numbers. Notice also that APA-formatted tables have no vertical borders. Please consult the APA manual for additional examples. All tables and figures included in the required assignments for the course must be formatted to APA specifications.

Table 1
Distribution of Ages for Survey Participants (n=50)
Age	Frequency	Percent	Cumulative percent

20-24	4	8	8
25-29	8	16	24
30-34	12	24	48
35-39	8	16	64
40-44	5	10	74
45-49	1	2	76
50-54	4	8	84
55-59	3	6	90
60-64	3	6	96
65-69	2	4	100
Total	50	100

Histograms are only one of the many types of graphical displays available for numerical data. The type of graph that you should choose depends on the information that you are trying to communicate. As you read research reports, begin to develop an understanding of the graphs that researchers typically use. Here are some other types of graphical displays. For these examples, you will see some additional variables for the sample of 50 survey participants.

Stem-and-Leaf Plot

The stem-and-leaf plot is not used very often in research reports, but it is used to help students understand the transition from a frequency distribution in table form to a histogram, which displays the distribution in graphical form. The stem-and-leaf plot is appropriate for ordinal level variables or higher. Here is the stem-and-leaf plot for the ages of the 50 survey participants.

Age Stem-and-Leaf Plot Frequency Stem & Leaf 4.00 2 . 2344 8.00 2 . 55788899 12.00 3 . 000001112234 8.00 3 . 67788899 5.00 4 . 00144 1.00 4 . 8 4.00 5 . 0033 3.00 5 . 579 3.00 6 . 124 2.00 6 . 56 Stem width: 10 Each leaf: 1 case(s)

Before reading the explanation of this plot, try this: look at the right side of the plot by tilting your head to the right - if you have printed this or are viewing on a laptop, rotate the image 90 degrees counterclockwise. Can you see that the stem-and-leaf plot resembles the histogram? The numbers of leaves (i.e., frequencies) represent the heights of the histogram's bars. The stem-and-leaf plot displays an ordered list of the values in the distribution. The stem represents the first digit, the 10s digit - in this example, the stems represent 20, 30, 40, 50, and 60. The leaves represent the second digit, which in this case is the 1s digit. So, the numbers in the top row represent 22, 23, 24, and 24. Notice that there is actually more information in the stem-and-leaf plot than in the grouped histogram because we can re-generate the entire distribution. With the first bar of the histogram, we only know that there are four values between 20 and 24, but we don't know exactly which four values they are. The mode is fairly easy to find in this display - can you find it?

Boxplot

A common way of describing a distribution is called the 5-number summary. These five numbers are used to draw a boxplot, also called a box-and-whiskers plot. The five numbers are the minimum value and the four quartiles. These are plotted on a graph vertically with vertical lines drawn between the bottom two points and the top two points. The upper and lower vertical lines are the whiskers. Three horizontal lines are drawn through the first, second, and third quartiles. Then two vertical lines are drawn connecting the ends of the horizontal lines to form the box. The result looks like this:
boxplot of ages

The bottom and top points are the minimum (22) and maximum (66) values. The longer horizontal lines represent Q1 (29.75), Q2 or the median (36.5), and Q3 (48.5). The height of the box is the interquartile range. What should you understand about the distribution by looking at this boxplot? First, realize that each section between consecutive horizontal lines represents 25% of the values. Notice that the lower whisker is shorter than the upper whisker, which means that the lower values are less spread out than the higher ones. Glance back at the stem-and-leaf plot or the histogram and compare the spread of the lower values with that of the higher values. The histogram bars are taller for the younger ages and shorter for the older ages.

Now let's see how a stem-and-leaf plot can identify outliers (remember, those are unusually large or small values in the distribution). For this example, data for an 80-year-old participant will be added to the distribution and a new boxplot will be generated.

boxplot with outlier

The new boxplot looks similar to the previous one except for the little circle labeled 51 (the 51 is the case number for the value) at the top. SPSS has designated this new value as an outlier because it is more than 1.5 times the interquartile range away from the top of the box (Q3). For values that are more than 3 times the interquartile range away from the top (or bottom) of the box, SPSS designates these to be extreme values and displays a different symbol. Outliers can increase the inaccuracy of the mean; so, the boxplot is a tool to identify them. You can either argue to justify the removal of any outliers from the data set or use alternative statistics that are insensitive to extreme values.

Pie Chart

A pie chart is appropriate to display a distribution of nominal (categorical or qualitative) data. The survey asked participants to select an ethnicity with which they identified. Here is the pie chart for their selections.

pie chart of ethnicities

Just a note - colorful graphics are fine for dissertation defense presentations and conference or workshop displays, but figures in your printed dissertation need to be in shades of gray.

Column and Bar Graphs

Column and bar graphs are also appropriate for nominal data. Column and bar graphs are similar to histograms, but there is an important difference. The horizontal axis (x-axis) for a histogram always represents an ordinal-level variable or higher. Column and bar graphs usually portray nominal variables. By combining two nominal variables, a bar chart can support comparisons between groups. Here is a grouped bar chart comparing ethnicities of female and male participants.

bar graph of ethnicity by gender

Line Graphs

Line graphs are often used to show changes in phenomena over time. In this example, a survey question asked about participants' attitudes in the fall and in the spring. The means for the attitudes at the two times were calculated for females and for males. The lines illustrate how male attitudes decreased over time while female attitudes increased.
line graph of attitudes

Beyond Univariate Statistics

Before introducing the last type of graphical display, a clearer distinction should be made about the data source for these graphs. So far, the statistics that you've studied have involved a single variable, and so, they are called univariate statistics. The two preceding graphs are examples of ways to combine multiple variables in order to make comparisons. In the next section, you'll learn about statistics that involve two variables, called bivariate statistics.

Scatterplot

A scatterplot represents the distribution of two variables simultaneously. The values for one variable (sometimes called the independent variable) are plotted based on the horizontal axis (x-axis). The values of the other variable (sometimes called the dependent variable) are plotted based on the vertical axis (y-axis). Each survey participant answer a question about age and another question about years of work experience. Age is plotted along the x-axis and years of work experience along the y-axis. Each dot represents a pair of answers for an individual participant. Study the graph below and describe any general trends that you notice. Why are there no points in the upper left portion of the graph?

scatterplot of age and experience

In addition to observing general patterns, scatterplots also help to identify unique individuals. In this example, the three dots in the lower right part of the graph represent older participants with relatively few years of work experience. Depending on the purposes of the study and the research questions being investigated, these individuals might be sought out for a deeper understanding of their experiences.

In the next section, you'll learn about a single statistic that attempts to describe the relationship between two variables.