Introduction to Statistics

Applications of Statistics

Think about when and where you've encountered statistics - in newspaper articles, on news reports, in journal articles, at the ball game, in advertising, at a swim meet, in product evaluations, in your child's homework, in your profession, in your classroom, or in testing results. Not only are statistical treatments of data widely found in our society, but these techniques are fundamental to almost all disciplines of study. Evidence of the increasing importance of statistics can be found in the early introduction of the topic in school curricula. For example, there is a content standard for statistics for kindergarten students. Thirty years ago, the first mention of statistics might have been in an introductory graduate research course. Let's explore possible reasons for this change.

Purpose

Statistical techniques are used to organize and analyze data, where data are unambiguous quantifications of observed phenomena (i.e., a regular way of assigning numbers to certain events or characteristics). In addition to organizing and analyzing data, statistical techniques include methods for illustrating the results of the statistical analysis. The prime example of this use of statistics involves the creation of graphical displays of information using charts and graphs. Not only is a picture worth a thousand words, it's worth even more numbers. Through these analyses and graphics, trends and other patterns may emerge that help researchers and others understand and explain phenomena in nature.

Brief History

One of the earliest applications of statistics derives from the need to make sense out of census data. In fact, the derivation of the word, statistics, includes the notion of understanding the political state. A census is used to understand the characteristics of the residents of a country. In the United States, a census is conducted every ten years. Among other decisions, the composition of the House of Representatives is based on census data. As you can see, the need to describe observed phenomena is central to the use of statistics.

The mathematics of statistics has its roots in probability theory - the branch of mathematics that deals with chance and uncertainty. Explaining the characteristics of games of chance through mathematics was one of the early goals of probability theory. Essentially understanding games of chance involves predicting outcomes (e.g., heads or tails; blackjack; a royal flush; a roll of seven). How might understanding phenomena and predicting outcomes be linked? They are linked through a process of reasoning called statistical inference, which uses a known state of nature as the basis for predictions about the future or about a wider context. To read more about the general background of statistics, you might visit: http://en.wikipedia.org/wiki/Statistics.

Role of Computers

So, statistics combines tools to describe phenomena and make inferences based on descriptions - why has the subject increased in popularity, as evident from its inclusion in kindergarten curricula? As with many modern trends, the answer can be linked to the influence of cheap, yet powerful, computers with graphical displays. Not too long ago, people who needed to make sense of data either needed to conduct tedious calculations by hand or needed access to a powerful mainframe computer. Imagine perusing lists of census data to determine the average household income in the Bay Area and then calculating the figure by hand. Now, not only are many results of statistical analyses readily available to you, but you can obtain the actual raw data and conduct your own analysis. Visit http://www.census.gov/ to view what is available. Putting these data and tools in the hands of more people requires a more general understanding of statistical concepts and techniques. As you will see first-hand, having the data and statistical software program is necessary but not sufficient for conducting meaningful analyses. That's the reason for the increased emphasis on topics like statistical reasoning and statistical literacy.

Statistics as a Subset of the Tools for Research

If you've completed another Research Methods course, you've been introduced to many tools for conducting research. The primary divisions of these tools is usually along the continuum that runs from qualitative, interpretive inquiry to quantitative, scientific investigations. At the quantitative end of the continuum lie most of the statistical techniques that we will study. There are many, many more that we will not have the time to study. So, one of the goals for the course is to introduce you to fundamental statistical concepts that underlie most statistical procedures. As with qualitative and quantitative approaches in general, the specific statistical tests and routines that you should use are determined by the questions you are addressing. As you will see in the text and elsewhere, your selection of appropriate statistical tools can be guided by decision trees or flow charts that lead you through a series of questions intended to identify the appropriate statistical test to use for your given situation.

Examples of Data

Data are all around us - physical characteristics, such as height, weight, eye color, dominant hand, gender, age, ethnicity; social characteristics, such as socioeconomic level, citizenship, residency, marital status, family structure, years of education; or school-based data, such as grade in school, test scores, number of absences, number of referrals, grades, placements, abilities, aptitude, achievements. Each item named in the previous list represents a variable, which is a set of data points, all of which represent the same construct. In the context of a research study, a variable is a set of data that comprise different values - the values of variables vary! If you are studying students at a boys' high school, gender is not a variable in your study, because there is only one value for gender - male.

Scan through the list in the preceding paragraph once more. In addition to the categories of physical, social, and school data, can you discern other differences between these variables? For example, would you describe the process of assigning of numbers to height to be similar to the process of assigning numbers to ethnic classifications? Hopefully, you find these two processes to differ in a fundamental way - namely, heights are measured with a measuring device like a tape measure and assigned a length whereas ethnicities are based on ancestry and any number assigned to a specific ethnicity is completely arbitrary. For example, you could assign a 1 to Asian or a 5 to Asian and, as long as no other group is assigned a 1 or 5, either would be an adequate quantification of ethnicity.

Let's describe these differences more formally. Variables are measured at different levels - called levels of measurement. There are four levels: nominal, ordinal, interval, and ratio. Before explaining each one, you might wonder why these measurement levels matter. The reason is simple - the level of measurement determines which statistical techniques are appropriate to use. The assignment of numbers to values of the variable is often called coding.

Nominal - the numbers assigned to the values of the variable are completely arbitrary - they are just labels that are consistently applied. For example, gender is a variable that typically has two values, male and female. You can assign 1 to represent male and 2 to represent female or vice versa or 5 to represent female and 3 to represent male. As long as all males are assigned the same number and all females are assigned the same number (and the two numbers are different), the coding of gender is appropriate. Variables measured at the nominal level are also called qualitative or categorical variables. When there are just two values, as is the case with gender, the variable is called dichotomous. Dichotomous variables are quite common in research because they represent the division of a sample into two groups.

Ordinal - the numbers assigned to the values of the variable indicate order but not the actual size. The prototype to remember is rankings. For example, runners who finish a race are labeled first, second, and third (these are called ordinal numbers, by the way). The place in which they finish does not indicate their actual time though. In fact, the difference between first place and second place may be 2 seconds, while the difference between second and third may be 10 seconds. In more formal terms, the intervals between consecutive values on an ordinal scale are not equal. Think of ranking your students by some ability - the top five students may be very close in ability levels and then there may be a substantial decrease between the fifth place student and the sixth place student.

Interval - the numbers assigned to the values of the variable indicate measured amounts. The intervals between these numbers are equal. For example, temperature is measured on an interval scale - the interval between 40 degrees and 50 degrees is the same as the interval between 70 degrees and 80 degrees. Most educational variables are treated as if they were measured on interval scales.

Ratio - the numbers assigned to the values of the variable meet the properties of an interval scale and include a "true" zero, which makes a comparison of values meaningful. A true zero is the complete lack of the measured quantity. If we counted the change in students' pockets, students without any change would be assigned a 0. Reporting that Juan had twice the change that Nina had would make sense. On the other hand, if we measured math ability using a math test, a student who scored 0 on the test can't be said to lack all math ability. Furthermore, reporting that Beatrix, who scored a 90, is twice as able, mathematically, as Que, who scored a 45, isn't meaningful.

Return to the previous list of variables and try your hand at classifying them according to these four levels of measurement.

Uses of Statistics - Types of Research Questions

Research questions that are answered through the use of statistics involve describing a current context or making an inference about a different, but related, setting. A typical descriptive question might be one that asks: What are the reading levels of first-grade students at ABC Elementary who use the Write-to-Read instructional program? A typical inferential question might be one that asks: What is the effect of the Write-to-Read instructional program for students in the XYZ district?

Descriptive Statistics

Descriptive statistics summarize data by reporting a number that represents the entire set of data. For example, the mean (average) score represents a summary of all of the scores on a test.
Descriptive statistics can also be used to organize a set of data. For example, a table of age ranges and tallies (frequencies) of participants within those age ranges helps to organize the observed values of the age variable.
Descriptive statistics allow researchers to illustrate entire sets of data so that overall patterns can be seen. For example, a pie chart showing ethnic classifications can describe these characteristics for a group of students. Likewise, a graph that compares reading levels and hearing abilities can illustrate how these two variables are related.

Inferential Statistics

Inferential statistics are used to compare groups on one or more variables. For example, reading ability of girls might be compared to that of boys for a subgroup of students in order to make general statements about the comparable abilities of girls and boys.
Inferential statistics can also be used to compare two variables within a group of people. For example, the relationship between nutritional habits and school achievement levels might be studied for a subgroup of students so that general statements might be made about how these two variables could be linked.
Similar to comparing variables, inferential statistics can be used to generate mathematical models that help to predict particular outcomes. For example, an admission officer might use historical data about high school performance and subsequent success in college to help inform an admissions decision.

Generalizing Results - from Sample to Population; from Today to Tomorrow; from Here to There

Remember that a researcher using quantitative methods, and specifically inferential statistics, is intent on producing results that generalize to other people, in other settings, and at other times. In order to achieve this goal, the proper selection of a particular set of participants, called a sample, is vital. In generalizing, the researcher has a large group of people in mind - this is called the population. If the generalization to the population from the sample is going to be seen by others as valid, the sample needs to mirror the larger population in every way that is determined to be important. Keep this important point in mind as you continue to read research reports and learn about statistics.

Suggestions for Succeeding in Studying Statistics

Math, in general, and statistics, in particular, are both very hierarchical by their nature. This means that later concepts and techniques depend on earlier ones. This situation can be either good or bad. If you build a solid foundation to start with and keep up with the material, you can build your skills incrementally. The downside is that if you do not understand a topic or concept, you cannot skip over it, hoping to avoid it in the future. Reading carefully and thoroughly, taking notes, working exercises, and practicing new skills will help you achieve success in this course.

Frequency Distributions and Graphical Displays

Frequency Distributions

You may be familiar with statistical techniques that summarize data by reporting the average (mean), median, standard deviation, or some other summary statistic. Single-number summaries of distributions, such as the mean, median, or mode, have definite limitations. Even when combined with measures of dispersion, such as the range, interquartile range, or standard deviation, a complete understanding of the entire distribution cannot be achieved. Perhaps there is a better way to communicate the nature of an entire distribution? Following the saying about words and pictures, pictures (i.e., graphs and charts) can portray a great deal of information about numbers, too. An intermediate step between having just a list of raw, unordered data and producing an informative graph will be to construct a table called a frequency distribution. A frequency distribution provides an ordered tally of the observed values and a count of the occurrences of each value. The observed values can be tallied individually or they can be grouped and then tallied. Let's look at a simple example.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Score	Frequency
4	1
5	1
6	1
7	3
8	2
9	2
sum	10

Quite a bit of information can be discerned from the table. First, the minimum score was 4, which was observed once. Second, the maximum score was 9, which was observed twice. Third, the most frequently occurring score (i.e., the mode) was a 7, which was observed three times. The total number of scores was 10 - represented by the sum in the last row. Let's add two more columns to the frequency distribution. The two new columns represent two types of percentages.

Score	Frequency	Percent	Cumulative Percent
4	1	10	10
5	1	10	20
6	1	10	30
7	3	30	60
8	2	20	80
9	2	20	100
sum	10	100

Recall that a percentage is calculated by dividing a number representing a part of a group by a number representing the whole group, and then multiplying by 100. [Percent means per 100.] In this example, the whole is the entire collection of scores - all 10 of them. The parts that relate to the percentages are the frequency counts in each row. So, the calculation for the percentage of scores of 5 is the following: 1 divided by 10 times 100, which equals 10%; for the percent of 7s, the calculation is 3 divided by 10 times 100, which equals 30%. Calculating percentages makes the frequencies more easily comparable. The last column is a running total of the Percent column. Consider why this might be useful information? If a 7 is a passing score, then a quick glance at the Cumulative Percent column shows that 30% of the scores were 6 or lower. A more positive outlook would be to report that 70% (100-30) of the scores were 7 or higher.

The previous example is very much simplified in order to introduce the meaning of the columns of a frequency distribution. In practice, frequency distributions involve more values and, quite often, researchers will group the values in order to make the frequency distribution more understandable. Here is another, more involved, example. Suppose you've conducted a survey in which you asked your colleagues to report their age (and they actually did). The data that you received are the following:

24, 53, 44, 30, 34, 36, 59, 61, 30, 48,
22, 65, 31, 33, 40, 37, 27, 29, 25, 23,
39, 30, 40, 55, 50, 38, 32, 28, 31, 30,
62, 41, 38, 66, 29, 31, 28, 24, 39, 64,
57, 53, 38, 32, 30, 25, 28, 44, 37, 50

Glancing at this list of numbers doesn't help to understand the ages of the survey participants. Let's construct a grouped frequency distribution by trying to implement the following suggested guidelines. The official term for a group is a class interval, and the term, bin, is used frequently as well.

Use a class interval that includes 2, 5, 10, or 20 values
Aim for 10-20 class intervals (range / number of class intervals = class interval range)
Tally observed values into class intervals - result gives frequency counts for each class interval (number of observed values with each class interval)

For the group of 50 ages listed above, the maximum age is 65 and the minimum is 22; so, the range is 65-22 or 43. If we divide 43 by 5, we get approximately 8 class intervals. A typical grouping of ages is into 5-year intervals (e.g., 20-24, 25-29). To sufficiently cover all of the ages, the highest category should be 65-69. In the frequency distribution table below, notice that none of the class intervals overlap and that there are no holes between intervals. [We're using the standard convention that you are 29 up until your 30th birthday - there are other ways to measure age, especially with younger people.]

Age	Frequency	Percent	Cumulative Percent
20-24	4	8	8
25-29	8	16	24
30-34	12	24	48
35-39	8	16	64
40-44	5	10	74
45-49	1	2	76
50-54	4	8	84
55-59	3	6	90
60-64	3	6	96
65-69	2	4	100
sum	50	100

This table might help you understand the ages a bit better. For example, the largest age group, which is called the modal class interval, is the range of ages from 30-34, representing 24% of the survey participants. Based on the grouped frequencies, we can't tell that the actual mode is 30, which occurs 5 times. For that information, you need to inspect the individual values, not the grouped values. From the grouped frequencies, you can also tell that 24% of participants are under 30 years of age and that 24% (100-76) are 50 or older.

It may be somewhat difficult to tell just by looking at the table, but as we will see later, this distribution has relatively more smaller scores than larger ones. This situation represents what statisticians call a skew in the data. The skewness statistic measures the lack of symmetry of a distribution. If the mean is larger than the median, there generally is a positive skew. If the mean is smaller than the median, there generally is a negative skew. If the mean and median are equal, the distribution is symmetric and the skewness is 0.

There is one more descriptive statistic for a distribution - the kurtosis, which measures the peakedness of a distribution. A positive kurtosis represents a peaked distribution, a negative kurtosis represents a flat distribution, and a kurtosis of 0 represents a bell shape. We won't spend much time studying the kurtosis or the skewness. There are very thorough explanations of them here: http://jalt.org/test/bro_1.htm. There is one application of skewness that is worthy of a short note. Before that, review this output from SPSS about the full (not grouped) distribution of ages that was listed previously.

table of descriptive statistics for age example

table of descriptive statistics for age example

So, what do the skewness and kurtosis indicate? First, because the kurtosis statistic is negative (-.497), this distribution is flatter than a standard, bell-shaped curve. Second, because the skewness is positive (also note that the mean is larger than the median), this distribution is not symmetric. In fact, as we'll discuss in the next module, this lack of symmetry is a warning signal about the accuracy of the mean as a measure of the center. Statisticians compute a skewness ratio and determine that the distribution is severely skewed if the ratio is 2 or higher or -2 or lower. SPSS does not calculate the skewness ratio, but it does give you the two numbers you need to compute it yourself. The skewness ratio equals the skewness divided by the standard error of the skewness. In this example, the skewness ratio is .784/.337 or 2.33, which indicates a severely, positively skewed distribution because the ratio is larger than 2. In this case, the mean should only be used with caution when describing the distribution.

Summary of the Relationship Between Descriptive Statistics and Distributions

Where is the center of the distribution? Answered by the mean
How spread out are the values from the center? Answered by the standard deviation
How symmetric is the distribution? Answered by the skewness
How peaked or flat is the distribution? Answered by the kurtosis

Graphical Displays

The graphical version of a frequency distribution is called a histogram. Here are two examples of histograms for the grouped age data, which were generated using Excel and SPSS. Review them and compare them with each other and with the grouped frequencies presented in the table above. Can you visualize where the mean, median, and mode are? Can you see why the distribution is skewed? The relatively fewer ages on the right side pull the mean away from the larger grouping of ages on the left.

histogram of grouped ages from Excel

Figure 1. Frequency distribution of participants' ages, generated using Excel.

histogram of grouped ages from SPSS

Figure 2. Frequency distribution of participants' ages, generated using SPSS.

These graphs are usually generated with the aid of a computer program, but here are the steps to construct one by hand:

Draw the axes - the horizontal (x-axis) indicates the distribution values, class intervals; the vertical (y-axis) indicates the frequency counts.
Mark endpoints of each class interval.
Label each class interval with the midpoint, the range, or the endpoints.
Plot the tally (frequency count) for each class interval.
Construct rectangles for each class interval.

The APA manual has separate sections on properly formatting tables and figures. The figures included above are properly formatted according to APA guidelines, with captions below the graphs. Please note that APA-formatted figures do not have titles - they have captions instead.

Here is an example of a properly APA-formatted table. Note that all numbers are aligned to the right. If the numbers included decimal points, the decimal points would provide the alignment point. Microsoft Word and other word-processing programs have special tab settings for tab-alignment of numbers. Notice also that APA-formatted tables have no vertical borders. Please consult the APA manual for additional examples. All tables and figures included in the required assignments for the course must be formatted to APA specifications.

Table 1
Distribution of Ages for Survey Participants (n=50)
Age	Frequency	Percent	Cumulative percent

20-24	4	8	8
25-29	8	16	24
30-34	12	24	48
35-39	8	16	64
40-44	5	10	74
45-49	1	2	76
50-54	4	8	84
55-59	3	6	90
60-64	3	6	96
65-69	2	4	100
Total	50	100

Histograms are only one of the many types of graphical displays available for numerical data. The type of graph that you should choose depends on the information that you are trying to communicate. As you read research reports, begin to develop an understanding of the graphs that researchers typically use. Here are some other types of graphical displays. For these examples, you will see some additional variables for the sample of 50 survey participants.

Stem-and-Leaf Plot

The stem-and-leaf plot is not used very often in research reports, but it is used to help students understand the transition from a frequency distribution in table form to a histogram, which displays the distribution in graphical form. The stem-and-leaf plot is appropriate for ordinal level variables or higher. Here is the stem-and-leaf plot for the ages of the 50 survey participants.

Age Stem-and-Leaf Plot Frequency Stem & Leaf 4.00 2 . 2344 8.00 2 . 55788899 12.00 3 . 000001112234 8.00 3 . 67788899 5.00 4 . 00144 1.00 4 . 8 4.00 5 . 0033 3.00 5 . 579 3.00 6 . 124 2.00 6 . 56 Stem width: 10 Each leaf: 1 case(s)

Before reading the explanation of this plot, try this: look at the right side of the plot by tilting your head to the right - if you have printed this or are viewing on a laptop, rotate the image 90 degrees counterclockwise. Can you see that the stem-and-leaf plot resembles the histogram? The numbers of leaves (i.e., frequencies) represent the heights of the histogram's bars. The stem-and-leaf plot displays an ordered list of the values in the distribution. The stem represents the first digit, the 10s digit - in this example, the stems represent 20, 30, 40, 50, and 60. The leaves represent the second digit, which in this case is the 1s digit. So, the numbers in the top row represent 22, 23, 24, and 24. Notice that there is actually more information in the stem-and-leaf plot than in the grouped histogram because we can re-generate the entire distribution. With the first bar of the histogram, we only know that there are four values between 20 and 24, but we don't know exactly which four values they are. The mode is fairly easy to find in this display - can you find it?

Boxplot

A common way of describing a distribution is called the 5-number summary. These five numbers are used to draw a boxplot, also called a box-and-whiskers plot. The five numbers are the minimum value and the four quartiles. These are plotted on a graph vertically with vertical lines drawn between the bottom two points and the top two points. The upper and lower vertical lines are the whiskers. Three horizontal lines are drawn through the first, second, and third quartiles. Then two vertical lines are drawn connecting the ends of the horizontal lines to form the box. The result looks like this:
boxplot of ages

The bottom and top points are the minimum (22) and maximum (66) values. The longer horizontal lines represent Q1 (29.75), Q2 or the median (36.5), and Q3 (48.5). The height of the box is the interquartile range. What should you understand about the distribution by looking at this boxplot? First, realize that each section between consecutive horizontal lines represents 25% of the values. Notice that the lower whisker is shorter than the upper whisker, which means that the lower values are less spread out than the higher ones. Glance back at the stem-and-leaf plot or the histogram and compare the spread of the lower values with that of the higher values. The histogram bars are taller for the younger ages and shorter for the older ages.

Now let's see how a stem-and-leaf plot can identify outliers (remember, those are unusually large or small values in the distribution). For this example, data for an 80-year-old participant will be added to the distribution and a new boxplot will be generated.

boxplot with outlier

The new boxplot looks similar to the previous one except for the little circle labeled 51 (the 51 is the case number for the value) at the top. SPSS has designated this new value as an outlier because it is more than 1.5 times the interquartile range away from the top of the box (Q3). For values that are more than 3 times the interquartile range away from the top (or bottom) of the box, SPSS designates these to be extreme values and displays a different symbol. Outliers can increase the inaccuracy of the mean; so, the boxplot is a tool to identify them. You can either argue to justify the removal of any outliers from the data set or use alternative statistics that are insensitive to extreme values.

Pie Chart

A pie chart is appropriate to display a distribution of nominal (categorical or qualitative) data. The survey asked participants to select an ethnicity with which they identified. Here is the pie chart for their selections.

pie chart of ethnicities

Just a note - colorful graphics are fine for dissertation defense presentations and conference or workshop displays, but figures in your printed dissertation need to be in shades of gray.

Column and Bar Graphs

Column and bar graphs are also appropriate for nominal data. Column and bar graphs are similar to histograms, but there is an important difference. The horizontal axis (x-axis) for a histogram always represents an ordinal-level variable or higher. Column and bar graphs usually portray nominal variables. By combining two nominal variables, a bar chart can support comparisons between groups. Here is a grouped bar chart comparing ethnicities of female and male participants.

bar graph of ethnicity by gender

Line Graphs

Line graphs are often used to show changes in phenomena over time. In this example, a survey question asked about participants' attitudes in the fall and in the spring. The means for the attitudes at the two times were calculated for females and for males. The lines illustrate how male attitudes decreased over time while female attitudes increased.
line graph of attitudes

Beyond Univariate Statistics

Before introducing the last type of graphical display, a clearer distinction should be made about the data source for these graphs. So far, the statistics that you've studied have involved a single variable, and so, they are called univariate statistics. The two preceding graphs are examples of ways to combine multiple variables in order to make comparisons. In the next section, you'll learn about statistics that involve two variables, called bivariate statistics.

Scatterplot

A scatterplot represents the distribution of two variables simultaneously. The values for one variable (sometimes called the independent variable) are plotted based on the horizontal axis (x-axis). The values of the other variable (sometimes called the dependent variable) are plotted based on the vertical axis (y-axis). Each survey participant answer a question about age and another question about years of work experience. Age is plotted along the x-axis and years of work experience along the y-axis. Each dot represents a pair of answers for an individual participant. Study the graph below and describe any general trends that you notice. Why are there no points in the upper left portion of the graph?

scatterplot of age and experience

In addition to observing general patterns, scatterplots also help to identify unique individuals. In this example, the three dots in the lower right part of the graph represent older participants with relatively few years of work experience. Depending on the purposes of the study and the research questions being investigated, these individuals might be sought out for a deeper understanding of their experiences.

In the next section, you'll learn about a single statistic that attempts to describe the relationship between two variables.