Measures of Central Tendency and Measures of Dispersion

What is central tendency?

By describing the central tendency of a set of data, statisticians attempt to summarize the entire set using only one number. The center or average (mean); middle value or halfway point (median); or typical or most common value (mode) all attempt to summarize the data. These summaries represent three different ways to report the central tendency. Why are there different ways to determine this summary? The reason relates to the levels of measurement that were introduced previously. Let's explore these three measures (also called descriptive statistics) within the context of an example and learn when each is appropriate to use.

Before introducing the mean, there is another term that you should know. A set of data is called a distribution. In the remainder of these notes, the term, distribution, will be used to describe a set of data or numbers that have been collected and that represent different quantified observations of phenomena.

The Mean (M or X)

You probably know this descriptive statistic better by its more common name, the average. Because there are many averages, statisticians like to be more precise with the terms they use. So, what you probably refer to as the average of a distribution, we will join statisticians in calling the mean, or arithmetic mean, to be very precise. There are other forms of means that you may see in research reports but which are infrequently used. Before explaining the routine to calculate this number, consider what it represents. One way to think of this number is the following. Suppose you had 10 test scores, and you had a brick for each score (all equally weighted). Then suppose you had a board that had the scale for the scores drawn on it. One end would represent 0% and the other would represent 100%. The board is balanced, without any scores on it yet, at 50%, which is halfway between the two ends. Now you place the 10 bricks on the scores that they represent, stacking them up if two scores are the same. When you finish placing the bricks, it is very likely that the board isn't balanced at 50% anymore. Instead you will have to adjust the board until it is balanced. The number on the test score scale that matches the balancing point is the mean score. Getting bricks and a board and balancing the arrangement of bricks isn't the most efficient way to calculate the mean. Let's look at the calculation steps instead.

Add up all of the values (e.g., test scores)
Divide the sum in Step 1 by the number of values (e.g., number of tests)

The mathematical formula looks like this:

X-bar equals summation of Xs divided by number of values

The first step of adding all the numbers happens quite frequently in statistical calculations; so statisticians have chosen to use the uppercase Greek letter sigma, ∑, to represent summation. Just remember the S in Sigma and the S in Summation to link the symbol with the process. Also n represents the number of values that we're adding. The symbols used for the mean include M and X, pronounced X-bar.

Now let's use the formula with a small distribution of test scores. Even though this example will only involve a few scores, the procedure for calculating the mean is the same no matter how many scores there are.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Step 1. Add the values. ∑X = 7 + 5 + 8 + 9 + 7 + 6 + 9 + 8 + 7 + 4 = 70
Step 2. Divide by the number of values. X = 70/10 = 7

If you have Excel, you can use it to calculate the mean. First, enter the numbers in a column - one number in each cell. Let's say that the numbers are in cells A1 through A10. Then, enter the following formula in cell A11 (or any other empty cell): = average(A1:A10)

That should be fairly straightforward. So, why isn't the mean sufficient for describing central tendency for all distributions? There are two reasons. The first reason is that the first step in the process, namely adding the values, doesn't always make sense. Think of a variable that represents ethnicity. Different ethnic classifications are arbitrarily assigned different numerical values. African American may be assigned a 1, while Asian is assigned a 3, and Caucasian is assigned a 5. Because these assignments are arbitrary, the actual values, as quantities, mean nothing. So adding them together would make no sense. Calculating a mean score for ethnicity would result in a meaningless number. Means make no sense for variables measured at a nominal level.

The second reason involves the mean's sensitivity to extreme values. What does this refer to? Recall the bricks and board example (or think of a teeter-totter). Placing a new brick near the middle (balancing point) has only a small effect on the balance, but if you placed a new brick at one of the far ends, it would have a great effect on the balance. This effect is larger when there is a small number of bricks. The brick on the far end is called an outlier. The presence of outliers makes the mean inaccurate as a measure of the center. An alternative measure of central tendency is needed in situations with outliers.

The Median (Mdn)

The second central tendency measure we'll study is the median. Think of the median as the middle point or halfway point of a distribution. Here is an example. Let's say that you had a group of nine middle-school students. As is often the case at this age, one of the students is unusually tall compared with the others in the group. Calculating the mean would be inappropriate because of the effect this one unusually tall student would have on the calculated mean. Instead, you decide to describe the "average" height as that of the halfway point. The first step is to line up all of the students according to their height - shortest to tallest. Then, because there are nine students, you go to the fifth student and measure her height. Her height is the median height for the group. Why? Because there are four students who are taller than she is and four students who are shorter than she is. So, the two steps for determining the median are to 1) sort the values and 2) pick the middle value. If you are following along closely, you should have a question - what if there is an even number of values. Suppose we have 10 students instead of 9 - how is the median determined in this case? There is just a slight twist - instead of picking the one middle value (because there isn't just one), you pick the middle two values. In our example of 10 students, the middle two are the fifth and sixth students. Then you average their heights by adding their heights and dividing by 2. Their average is the median for the group.

Sort the values from smallest to largest
If there is an odd number of values, choose the middle value. If there is an even number of values, choose the middle two and average them - add them and divide by 2.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Step 1. Sort the values. 4, 5, 6, 7, 7, 7, 8, 8, 9, 9
Step 2. Because there are 10 scores, pick the middle two scores, which are 7 and 7. Add them, 7 + 7 = 14, and then divide by 2, 14/2 = 7.

If you have Excel, you can use it to calculate the median. First, enter the numbers in a column - one number in each cell. Let's say that the numbers are in cells A1 through A10. Then, enter the following formula in cell A11 (or any other empty cell): = median(A1:A10)

The median is insensitive to extreme values. In the example, you could replace one of the 9s with 100 and the median would not change. Likewise, you could replace the 4 with a 0 and the median would not change either. This illustrates the insensitivity of the median to extreme values. There are situations when neither the mean nor the median is appropriate to use. Because the first step in determining the median requires that the values be sorted, the measurement level of the variable must be at the ordinal level or higher. Like the case with the mean, reporting a median score for a nominal variable is meaningless.

The Mode

So you have a nominal variable, like ethnicity, and want consolidate the information and present a summary of this variable. Neither the mean nor the median is appropriate to use for nominal variables. Your only remaining option is to report the mode. The mode represents the most frequently occurring value. For the ethnicity variable, the mode represents the largest ethnic subgroup for a group of people. You might use the mode for other nominal variables, such as gender, eye color, or handedness. There might be occasions in which the mode helps you understand other types of variables. If you are arranging a classroom for groupwork, the mode for group size would tell you how many students to put at the most tables. If you are ordering graduation caps and gowns, the modes for cap size and gown size indicate the most common size to order. Someone ordering shoes for a store would order more shoes for the modal (adjective for mode) foot size. Someone arranging tables at a restaurant would set the tables for the modal dining party size. Let's see how the mode is derived.

List the unique values that occur.
Tally the number of occurrences for each value - the one with the largest tally is the mode.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Step 1. List the unique values that occur.

Values

Step 2. Tally the occurrences

Values	Tally (frequency)
4	1
5	1
6	1
7	3
8	2
9	2

So, what is the mode? Is it 7 or 3? The mode is the most frequently occurring value, which is 7. The number of times that the modal value occurs is 3. So, in this example, the mode is 7. Notice that for this distribution the mean, median, and mode are all the same. This happens in certain situations but in most cases these three descriptive statistics will be different. Notice that in this distribution there is only one mode. Sometimes there are two modes or even more. In the context of education, you may have heard a group of students referred to as being bimodal, which would indicate that there are two distinct groups - perhaps good readers and struggling readers, for example.

If you have Excel, you can use it to calculate the mode. First, enter the numbers in a column - one number in each cell. Let's say that the numbers are in cells A1 through A10. Then, enter the following formula in cell A11 (or any other empty cell): = mode(A1:A10)

Main Ideas about Central Tendency

What are the main ideas to remember about measures of central tendency?

There are three main types of measures: mean, median, and mode.
Each measure attempts to provide a summary for a distribution, namely where the center occurs.
Which measure to use depends on the specific characteristics of the distribution.

The only appropriate statistic for a nominal variable is the mode.
Medians are used for ordinal variables and interval or ratio variables with outliers.
The mean is used with interval or ratio variables without outliers.

Each measure has a particular process for determining its value.

The mode is determined from a tally of values.
The median is determined from a sorted list of values.
The mean is calculated by adding the values and dividing by the number of values.

Whenever something is summarized, numerically or otherwise, information is lost. Furthermore, when you choose a particular measure to represent a distribution, you are forced to make a compromise. Statisticians have introduced alternative measures of central tendency to reduce the limitations of the more traditional measures. For example, you may read about the trimean or trimmed means. The trimean involves a weighted average of the median with two other numbers that we'll meet in the next section, called the first and third quartiles. In calculating a trimmed mean, some percentage of the extreme scores are ignored and a traditional mean is calculated using the other scores. You may have seen trimmed means in action during judged sporting events, such as Olympic performances, where the highest and lowest marks are thrown out and the score is determined from the remaining judges' marks.

The subject of the next section addresses the need to describe the spread of scores in addition to their central tendency. With these two descriptive statistics, more summary information can be provided.

What is dispersion?

Deriving one of the measures of central tendency provides some idea about the center of a distribution, but you might want to report how well the derived measure of central tendency represents the other values in the distribution. Stated another way, you might want to indicate how much the other values in the distribution are spread away from the center. They may be closely grouped around the center, or they may be quite widespread. Dispersion refers to the spread of the scores in a distribution. Other synonyms for dispersion include differences, spread, deviation, and variability. It is extremely important to realize that these measures of dispersion apply to ordinal-level variables or higher - dispersion, as it will be covered here, does not apply to nominal variables.

The Range

The first measure of dispersion that we'll consider is the range. The range represents the distance between the smallest value in the distribution and the largest value. For example, if the smallest is 3 and the largest is 9, then the range equals the difference between the two, 9-3 or 6. Notice that the range depends on two numbers only - the maximum value and the minimum value. None of the other values in the distribution affect the range. The other numbers could be grouped close to the mean or grouped near the extremes. As a result, the range by itself provides little information about the spread of the distribution. Here's how the range is calculated.

Locate the maximum value (Max) and the minimum value (Min).
Subtract the minimum value from the maximum value. Range = Max - Min.

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Step 1. Max = 9 and Min = 4
Step 2. Range = 9 - 4 = 5

If you have Excel, you can use it to calculate the range. First, enter the numbers in a column - one number in each cell. Let's say that the numbers are in cells A1 through A10. Then, enter the following formula in cell A11 (or any other empty cell): = max(A1:A10) - min(A1:A10)

The Interquartile Range

There is another type of range that we will see later when we look at graphs of distributions. It is called the interquartile range and can be thought of as the range for the middle 50% of the values in the distribution. Think of a distribution being sorted and then divided into four groups - the first 25%, the second 25%, the third 25%, and the fourth 25% of the scores. The values that separate these four groups are called the quartiles. We already studied the second quartile (Q2), because it is the median or halfway point, or 50th percentile, of the distribution. Half of the values occur above the median and half below. Twenty-five percent of the values occur below the 25th percentile or Q1 or the first quartile, and 75% of the values occur below the 75th percentile or Q3 or the third quartile. The interquartile range is defined as the difference between Q3 and Q1. Because calculating the quartiles and the interquartile range requires more information about the distribution, it is viewed as a more stable measure of a distribution's spread. It also plays an important role in visually portraying a distribution using a boxplot, which we'll see later. One final note - because the interquartile range only requires the values to be sorted to find Q1 and Q3, it is an appropriate measure for ordinal-level variables or higher. We won't be calculating this statistic much, but you will see it used in the next section.

The Standard Deviation

For interval and ratio-level variables, where we are able to calculate the mean of a distribution, you might be thinking that it might be informative to derive some sort of average for how the values are spread out around the mean. For example, if the average spread is large, then the values are not grouped near the mean, and if the average spread is small, then the values are grouped close to the mean. Statisticians have derived just such a statistic; however, due to one of the properties of the mean, the simplest way to calculate the average spread always (yes, always) results in an average spread of 0. Consequently, the calculation of this average spread statistic, called the standard deviation, is more complicated. Understanding this calculation and what a standard deviation represents is extremely important for the rest of the material in the course and for correctly interpreting and using statistics in education in general. As you read through the steps, realize that Excel or SPSS performs the calculation for you, but you need to understand how these programs derive the statistic.

First, a bit of terminology. What is a deviation and what might make it standard? A deviation is the distance between a value in a distribution and the mean. Every value in the distribution has a deviation associated with it. Because the process of averaging deviations always results in a value of 0, there is no average deviation. Instead, statisticians use a process of squaring the deviations, averaging them, and then taking the square root of the result to generate a standard deviation. [Squaring is multiplying a number by itself - 3 squared is 3X3 or 9. Taking the square root is the opposite operation - the square root of 9 is 3. For numbers that are not perfect squares, like 4, 9, 16, 25, 36, 49, 64, 81, and 100 are - among infinitely many others, it is handy to have a spreadsheet program or a calculator.] Here are the steps and the formula for the standard deviation. Refer back to these steps to understand the example presented after the formula.

Calculate the mean, X.
Subtract each value from the mean - these are the deviations.
Square the deviations.
Sum the squared deviations.
Divide the sum of squared deviations by n-1 - this is called the variance.
Calculate the square root of the variance - this is the standard deviation.

The mathematical formula looks like this:

standard deviation equals square root of summation of squared deviations divided by n-1

If you have Excel, you can use it to calculate the standard deviation. First, enter the numbers in a column - one number in each cell. Let's say that the numbers are in cells A1 through A10. Then, enter the following formula in cell A11 (or any other empty cell): = stdev(A1:A10)

Scores: 7, 5, 8, 9, 7, 6, 9, 8, 7, 4

Step 1.    The mean that we calculated earlier is X = 7.
Step 2.    The deviations are shown in the second column below.
Step 3.    The squared deviations are shown in the third column below.
Step 4.    Summing the squared deviations gives 24, as shown in the third to last cell of the third column.
Step 5.    Dividing 24 by n-1 (10-1=9) gives 2.67, as shown in the second to last cell of the third column.
Step 6.    Calculating the square root of 2.67 gives 1.63, as shown in the last cell of the third column.

X	X-X	(X-X)²
7	7-7=0	0X0=0
5	5-7=-2	-2X-2=4
8	8-7=1	1X1=1
9	9-7=2	2X2=4
7	7-7=0	0X0=0
6	6-7=-1	-1X-1=1
9	9-7=2	2X2=4
8	8-7=1	1X1=1
7	7-7=0	0X0=0
4	4-7=-3	-3X-3=9
	sum of squared deviations	24
	sum divided by n-1 (9)	2.67
	square root (standard deviation)	1.63

Just to reiterate, no one calculates the standard deviation by hand following the steps listed here; however, understanding that the standard deviation is a measure of the amount of variation (or spread) in a distribution is very important. For example, if you have two standard deviations for similar sets of data and one is quite a bit larger than the other, the means that describe the centers of the two distributions are not equally accurate. The mean with the smaller deviation is a better summary of the distribution than the mean with the larger standard deviation. Think of the standard deviation as a quality-control measure for the mean. In fact, means should never be reported without accompanying standard deviations.

Here is one way to remember what the standard deviation tells you. Suppose you are attending a conference and are offered your choice of two dorm rooms for your housing. The only difference between the two rooms has to do with the plumbing. Because it is farther from the water heater, the shower water temperature in one room fluctuates more than in the other. Let's say that the hot water in Room 652 has a mean temperature of 100 degrees and a standard deviation of 6 degrees. Room 247 also has a mean temperature of 100 degrees but its standard deviation is only 2 degrees. Assuming that the distribution of water temperatures resembles a bell-shaped curve (much more about this in the weeks ahead), the person showering in Room 247 will generally experience hot water temperatures between 96 and 104 degrees 95% of the time - generally tolerable. Consider the person showering in Room 652. The hot water temperature in that shower will fluctuate between 88 and 112 degrees 95% of the time. [The range of temperatures reflects plus or minus two standard deviations from the mean of 100 degrees.] Which room would you pick?

How does this apply to educational settings? Would you prefer to teach in a classroom where the standard deviation of students' reading scores is large or small? Of course, this is not a math question. Students with different skill levels can benefit from interaction with more and less capable peers. Large fluctuations in prerequisite skill levels can serve to frustrate the teacher as well as the students, however. If you are an administrator and you don't really understand standard deviations, you might create classrooms where the mean score levels of students are equal but where some classrooms have more variation than others. As we are focusing on raising mean scores of students, leaving no students behind, should we set goals to increase standard deviations as well or should these be decreasing?

Correlations, Reliability and Validity, and Linear Regression

Correlations

A correlation describes a relationship between two variables. Unlike descriptive statistics in previous sections, correlations require two or more distributions and are called bivariate (for two) or multivariate (for more than two) statistics. There are different types of correlations that correspond to different levels of measurement. The first correlation you'll study requires that the two variables be measured at the interval or ratio level. The full name of this statistic is the Pearson product-moment correlation coefficient, and it is denoted by the letter, r. In research reports, you'll see references to Pearson r, correlation, correlation coefficient, or just r.

The formula for calculating r is one of the most complex that you will see. Correlation coefficients can be calculated by hand, but most people use a spreadsheet or statistics program. Two important aspects of this formula are that both an X distribution and a Y distribution are involved in the formula and that in addition to squaring and taking the square root of quantities, X and Y are multiplied together in the numerator (upper half) of the formula. The result of multiplying two numbers is called the product, which is why "product" is part of the full name of the correlation coefficient. Visit this site: http://allpsych.com/stats/unit2/14.html if you want to learn about evaluating the formula.

The formula for r looks like this:

formula for the Pearson product-moment correlation coefficient

If you have Excel, you can use it to calculate r, the Pearson correlation coefficient. First, enter the numbers in two columns - one column for Xs and one column for Ys. Let's say that the numbers are in cells A1 through A10 and B1 through B10. Then, enter the following formula in cell A11 (or any other empty cell): = correl(A1:A10,B1:B10)

Before you can interpret the numerical value of r, you need to determine if it was appropriate to calculate r to begin with. The Pearson correlation coefficient, r, describes the strength and direction of a linear relationship between two variables. The first step in exploring a linear relationship is to generate the scatterplot, which you learned about in the previous section. By inspecting the scatterplot, you can assess whether a line describes the pattern or if some other curve might provide a better description. Similar to the mean, the accuracy of a correlation coefficient can also be compromised by outliers in either distribution.Visually inspect the scatterplot can identify these cases.

As mentioned, Pearson r's describe both strength and direction of a relationship. The sign of r, either + or -, indicates the direction of he relationship. A positive r indicates a direct relationship, as X increases so does Y and as X decreases so does Y. A negative r indicates an indirect relationship, as X increases, Y decreases and as X decreases Y increases. The absolute value of r indicates the strength of the relationship. Pearson r's range from -1 to +1. Values close to -1 or +1 indicate a strong linear relationship - the associated scatterplot displays the pattern of dots in a nearly straight line. A positive Pearson r isn't necessarily better (i.e., stronger) than a negative r - you need to compare the values while ignoring the signs.

Here are some websites where you can explore the meaning of the strength and direction of the correlation coefficient: http://www.stattucino.com/berrie/dsl/correlation.html, http://www.stat.berkeley.edu/~stark/Java/Html/Correlation.htm, and http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GCAppletFrame.html.

Returning to our previous example involving age and years of work experience, here is the scatterplot that was generated earlier.

scatterplot of age and experience

The Pearson correlation coefficient associated with these two variables is shown in the following SPSS output.
correlation matrix for age and experience

correlation matrix for age and experience

This output is an example of the simplest form of a correlation matrix. This matrix has two rows and two columns, resulting in four cells. The upper left cell contains the correlation of AGE with AGE, which is always 1. The lower right cell is similar. The upper right and lower left cells are copies of each other - both contain the correlation of AGE with EXPERIENCE. The Pearson r is reported first - in this case, r = .834, which is based on 50 pairs of numbers (N). You will learn the meaning of the middle number later in the course.

What meaning can be associated with r = .834? First, because r is positive (greater than 0), older ages are associated with more years of work experience and younger ages are associated with fewer years of work experience. Second, because .834 is close to 1, the linear pattern between AGE and EXPERIENCE is strong, which means that the points fall fairly close to a straight line. Statisticians like to be precise with their interpretations; so, levels of strengths have been proposed. Ignoring the plus or minus sign, r's have been assigned the following strengths:

r value	Strength
0-.2	very weak
.2-.4	weak
.4-.6	moderate
.6-.8	strong
.8-1	very strong

An even more precise measure of strength is to use the Coefficient of Determination, r², which represents the amount of variation both variables share - an indication of some underlying characteristic that they have in common. In this example, AGE and EXPERIENCE share .834² or about 70% of their variation. The other 30% of unshared variation (also called unexplained variance) is due to the points that fall far from the straight line.

Often when studying correlations between two variables, you might begin to attribute a cause-and-effect relationship between the two variables. The existence of a meaningful correlation coefficient does not indicate causation. Foot size and reading ability are correlated, but neither causes the other.

The Pearson correlation coefficient applies to pairs of interval or ratio-level variables. Here are some other types of correlations for different measurement levels.

X	Y	Correlation
nominal	nominal	Phi coefficient
nominal	ordinal	Rank biserial coefficient (r_rb)
nominal	interval/ratio	Point biserial coefficient (r_pb)
ordinal	ordinal	Spearman rank-order coefficient (rho)
interval/ratio	interval/ratio	Pearson r

Reliability and Validity

The concepts of reliability and validity refer to properties of the instruments used in quantitative research to operationally define important variables. In more general ways, these concepts relate to the overall quality of a research study, including how uniformly, or consistently, the procedures are carried out (reliability), how well the collected data and their analysis support the results or findings (internal validity), and whether the results or findings extend to other contexts, or generalize (external validity).

It is quite appropriate for these topics to be addressed just after correlations have been explained, because most of the ways in which researchers estimate the reliability and validity of measurement instruments, procedures, or the use of their results involve the use of correlations. Furthermore, embedded in the process of quantitative research is the process of converting educational constructs into numerical values (i.e., operational definitions), which must be continually scrutinized for its legitimacy.

Whenever you hear or read the word, reliability, think of the synonym, consistency. Don't confuse reliability with validity. I'm sure that you know someone who is very reliable but always late or always wrong. Their response or behavior is always the same, but that doesn't mean it is appropriate. Reliability is necessary, but not sufficient, for validity.

Type of Reliability	Application
Test-retest	Use this type of reliability estimate whenever you are measuring a trait over a period of time. Example: teacher job satisfaction during the school year
Parallel forms	Use this type of reliability estimate whenever you need different forms of the same test to measure the same trait. Example: multiple forms of the SAT
Internal consistency	Use this type of reliability estimate whenever you need to summarize scores on individual items by an overall score. Example: combining the 20 items on a statistics test to represent level of knowledge about a particular aspect of statistics
Interrater	Use this type of reliability estimate whenever you involve multiple raters in scoring tests. Example: AP essay test grading

Validity is described as a unitary concept or property that can have multiple forms of evidence. Whether a test is valid or invalid does not depend on the test itself, but rather, validity depends on how the test results are used. For example, a statistics quiz may be a valid way to measure understanding of a statistical concept, but the same quiz is not valid if you intend to assign grades in English composition based on it. This situation may seem to be quite far fetched, but educators and politicians who do not carefully investigate the properties of a quantitative measurement may be making serious mistakes when basing decisions on the data. Here are some of the forms of evidence of validity that are presented in the text - there are others that you may encounter.

Type of Validity	Application
Content	Use this type of validity estimate whenever you are comparing test items with a larger domain of knowledge. Example: assessing the breadth of a comprehensive final exam
Criterion	Use this type of validity estimate whenever you are comparing a new test with an established standard. Example: developing a test to predict genius (predictive), or developing a new test comparable to the CAHSEE (high school exit exam)
Construct	Use this type of validity estimate whenever you are comparing a test with the elements of a theoretical definition of a trait. Example: developing a new test for musical intelligence - distinguishing between musical and other types of intelligence

With the exception of interrater reliability and content validity, all other estimates usually involve the calculation of a type of correlation. Multiple estimates may be used to evaluate the properties of an instrument in certain situations. One form of validity, namely predictive validity, relates a test score to some future event or condition. In the following section, you'll see how prediction and correlation can be related to help make decisions or analyze patterns.

Linear Regression

Linear regression is a mathematical routine that links two related variables by attempting to predict one variable using the other variable. The variable upon which the prediction is based (i.e., the predictor) is called the independent variable, which is plotted on the x-axis. The variable that is being predicted is called the dependent variable, and it is plotted on the y-axis. In our previous example, AGE was used as the independent variable and EXPERIENCE as the dependent variable. Think of it this way: Y depends on X - X precedes/predicts Y. Just because regression uses one variable to predict another does not mean that the two variables are causally related. Remember that correlation is not causation - the two variables are associated/related but not necessarily as a cause and an effect.

The process of applying linear regression techniques assumes that there is a basis of historically observed data on which to base future predictions. One example of this situation is the school admissions process, where data on applicants are compared with data from previously successful and unsuccessful students. Another example is actuarial work (e.g., life insurance rate-setting), where historical data about longevity is used to predict lifespan. Regression models for these examples are much more complicated than the straight line model that is used in linear regression. Here is the regression line for the age and experience example.

scatterplot of age and experience with regression line

scatterplot of age and experience with regression line

Many lines can be drawn through the scatterplot of points, but one line provides the best fit, where best is defined as having a minimum of error. What is error? Whenever you use a theoretical model to make a prediction, and then check that prediction against historical data, the predicted results can either match the data or not. In the scatterplot shown above, the prediction matches the data where the line goes right through a point. Whenever the line misses a point, error occurs. The amount of error could be small (i.e., the line is near the point), or it could be large (i.e., the line is far from the point). Notice how the three lower points to the right generate quite a bit of error. One method for determining the line's location is called the method of least squares, where the vertical distance between a potential line and each point is squared and these squares are summed, and the line with the least sum of squares is selected as the best fit. The mathematics involves calculus and is beyond the scope of this introduction. The average amount by which the line misses the points is called the standard error of the estimate, which you can think of as similar to a standard deviation.

When conducting linear regression analysis using SPSS or Excel, the programs will generate the two numbers (b and a) needed to describe the line. The equation for a line is the following:

Y' = bX + a,

where Y' represents the prediction,
X represents the independent variable,
b is called the slope, and
a is called the y-intercept.

Excel uses a function named LINEST (for linear regression estimate). Please read the help notes in Excel about how to use this function.

SPSS provides the following table, among others that we'll see later.

coefficients for linear regression equation predict experience using age

coefficients for linear regression equation predict experience using age

Unfortunately, the values are not labeled consistently between statistical programs and statistics texts. The value for the y-intercept (a in the equation of the line) is found in the B column of the SPSS output within the (Constant) row. The value for the slope (b in the equation of the line) is also found in the B column within the AGE row. Remember that the slope (b) is paired with X in the equation, and, because X is the AGE variable, the slope is in the AGE row. The resulting equation is the following.

Y' = .627 X + (-13.374)

What does this equation tells us? It provides a way to estimate the likely years of work experience for a person if you already know the person's age. For example. a 42-year-old would be predicted to have almost 13 years of work experience (.627 X 42 - 13.374 = 12.96). [Do the multiplication first and then the subtraction.] Another way to interpret the equation is that for every 10 years of age, the years of work experience increases by about 6.3 years.

One word of caution about predicting these values - notice that the range of age values started at 22 years old and ended at 66. The equation is only valid to use where there were data points in the original set of data for the independent variable, age. In other words, the prediction equation generates meaningless numbers for people younger than 22 or older than 66. Try calculating the predicted years of work experience for a 15-year-old. When employing prediction equations, the range of legitimate values for the independent variable must be known.

Here's another example to consider, let's say you are determining grade-level reading ability based on reading test scores. The test publisher includes a regression equation for calculating the reading ability levels. The regression equation was derived from the data of third grade students who were moderately good readers. You have a student who tested very, very high and the regression formula places her reading ability at that of a sophomore in college. Because the regression equation was not developed with high-scoring readers, the predicted reading ability is not valid to use.

To learn more about linear regression by working with some interactive applets, visit the following web pages:
http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
http://bcs.whfreeman.com/ips4e/cat_010/applets/CorrelationRegression.html
http://www.stattucino.com/berrie/dsl/regression/regression.html
http://www.stat.sc.edu/~west/javahtml/Regression.html