Correlations, Reliability and Validity, and Linear Regression

Correlations

A correlation describes a relationship between two variables. Unlike descriptive statistics in previous sections, correlations require two or more distributions and are called bivariate (for two) or multivariate (for more than two) statistics. There are different types of correlations that correspond to different levels of measurement. The first correlation you'll study requires that the two variables be measured at the interval or ratio level. The full name of this statistic is the Pearson product-moment correlation coefficient, and it is denoted by the letter, r. In research reports, you'll see references to Pearson r, correlation, correlation coefficient, or just r.

The formula for calculating r is one of the most complex that you will see. Correlation coefficients can be calculated by hand, but most people use a spreadsheet or statistics program. Two important aspects of this formula are that both an X distribution and a Y distribution are involved in the formula and that in addition to squaring and taking the square root of quantities, X and Y are multiplied together in the numerator (upper half) of the formula. The result of multiplying two numbers is called the product, which is why "product" is part of the full name of the correlation coefficient. Visit this site: http://allpsych.com/stats/unit2/14.html if you want to learn about evaluating the formula.

The formula for r looks like this: formula for the Pearson product-moment correlation coefficient

If you have Excel, you can use it to calculate r, the Pearson correlation coefficient. First, enter the numbers in two columns - one column for Xs and one column for Ys. Let's say that the numbers are in cells A1 through A10 and B1 through B10. Then, enter the following formula in cell A11 (or any other empty cell): = correl(A1:A10,B1:B10)

Before you can interpret the numerical value of r, you need to determine if it was appropriate to calculate r to begin with. The Pearson correlation coefficient, r, describes the strength and direction of a linear relationship between two variables. The first step in exploring a linear relationship is to generate the scatterplot, which you learned about in the previous section. By inspecting the scatterplot, you can assess whether a line describes the pattern or if some other curve might provide a better description. Similar to the mean, the accuracy of a correlation coefficient can also be compromised by outliers in either distribution.Visually inspect the scatterplot can identify these cases.

As mentioned, Pearson r's describe both strength and direction of a relationship. The sign of r, either + or -, indicates the direction of he relationship. A positive r indicates a direct relationship, as X increases so does Y and as X decreases so does Y. A negative r indicates an indirect relationship, as X increases, Y decreases and as X decreases Y increases. The absolute value of r indicates the strength of the relationship. Pearson r's range from -1 to +1. Values close to -1 or +1 indicate a strong linear relationship - the associated scatterplot displays the pattern of dots in a nearly straight line. A positive Pearson r isn't necessarily better (i.e., stronger) than a negative r - you need to compare the values while ignoring the signs.

Here are some websites where you can explore the meaning of the strength and direction of the correlation coefficient: http://www.stattucino.com/berrie/dsl/correlation.html, http://www.stat.berkeley.edu/~stark/Java/Html/Correlation.htm, and http://www.stat.uiuc.edu/courses/stat100/java/GCApplet/GCAppletFrame.html.

Returning to our previous example involving age and years of work experience, here is the scatterplot that was generated earlier.

scatterplot of age and experience
The Pearson correlation coefficient associated with these two variables is shown in the following SPSS output.
correlation matrix for age and experience
This output is an example of the simplest form of a correlation matrix. This matrix has two rows and two columns, resulting in four cells. The upper left cell contains the correlation of AGE with AGE, which is always 1. The lower right cell is similar. The upper right and lower left cells are copies of each other - both contain the correlation of AGE with EXPERIENCE. The Pearson r is reported first - in this case, r = .834, which is based on 50 pairs of numbers (N). You will learn the meaning of the middle number later in the course.

What meaning can be associated with r = .834? First, because r is positive (greater than 0), older ages are associated with more years of work experience and younger ages are associated with fewer years of work experience. Second, because .834 is close to 1, the linear pattern between AGE and EXPERIENCE is strong, which means that the points fall fairly close to a straight line. Statisticians like to be precise with their interpretations; so, levels of strengths have been proposed. Ignoring the plus or minus sign, r's have been assigned the following strengths:

r value Strength
0-.2 very weak
.2-.4 weak
.4-.6 moderate
.6-.8 strong
.8-1 very strong

An even more precise measure of strength is to use the Coefficient of Determination, r2, which represents the amount of variation both variables share - an indication of some underlying characteristic that they have in common. In this example, AGE and EXPERIENCE share .8342 or about 70% of their variation. The other 30% of unshared variation (also called unexplained variance) is due to the points that fall far from the straight line.

Often when studying correlations between two variables, you might begin to attribute a cause-and-effect relationship between the two variables. The existence of a meaningful correlation coefficient does not indicate causation. Foot size and reading ability are correlated, but neither causes the other.

The Pearson correlation coefficient applies to pairs of interval or ratio-level variables. Here are some other types of correlations for different measurement levels.

X Y Correlation
nominal nominal Phi coefficient
nominal ordinal Rank biserial coefficient (rrb)
nominal interval/ratio Point biserial coefficient (rpb)
ordinal ordinal Spearman rank-order coefficient (rho)
interval/ratio interval/ratio Pearson r

Reliability and Validity

The concepts of reliability and validity refer to properties of the instruments used in quantitative research to operationally define important variables. In more general ways, these concepts relate to the overall quality of a research study, including how uniformly, or consistently, the procedures are carried out (reliability), how well the collected data and their analysis support the results or findings (internal validity), and whether the results or findings extend to other contexts, or generalize (external validity).

It is quite appropriate for these topics to be addressed just after correlations have been explained, because most of the ways in which researchers estimate the reliability and validity of measurement instruments, procedures, or the use of their results involve the use of correlations. Furthermore, embedded in the process of quantitative research is the process of converting educational constructs into numerical values (i.e., operational definitions), which must be continually scrutinized for its legitimacy.

Whenever you hear or read the word, reliability, think of the synonym, consistency. Don't confuse reliability with validity. I'm sure that you know someone who is very reliable but always late or always wrong. Their response or behavior is always the same, but that doesn't mean it is appropriate. Reliability is necessary, but not sufficient, for validity.

Type of Reliability Application
Test-retest Use this type of reliability estimate whenever you are measuring a trait over a period of time.
Example: teacher job satisfaction during the school year
Parallel forms Use this type of reliability estimate whenever you need different forms of the same test to measure the same trait.
Example: multiple forms of the SAT
Internal consistency Use this type of reliability estimate whenever you need to summarize scores on individual items by an overall score.
Example: combining the 20 items on a statistics test to represent level of knowledge about a particular aspect of statistics
Interrater Use this type of reliability estimate whenever you involve multiple raters in scoring tests.
Example: AP essay test grading

Validity is described as a unitary concept or property that can have multiple forms of evidence. Whether a test is valid or invalid does not depend on the test itself, but rather, validity depends on how the test results are used. For example, a statistics quiz may be a valid way to measure understanding of a statistical concept, but the same quiz is not valid if you intend to assign grades in English composition based on it. This situation may seem to be quite far fetched, but educators and politicians who do not carefully investigate the properties of a quantitative measurement may be making serious mistakes when basing decisions on the data. Here are some of the forms of evidence of validity that are presented in the text - there are others that you may encounter.

Type of Validity Application
Content Use this type of validity estimate whenever you are comparing test items with a larger domain of knowledge.
Example: assessing the breadth of a comprehensive final exam
Criterion Use this type of validity estimate whenever you are comparing a new test with an established standard.
Example: developing a test to predict genius (predictive), or developing a new test comparable to the CAHSEE (high school exit exam)
Construct Use this type of validity estimate whenever you are comparing a test with the elements of a theoretical definition of a trait.
Example: developing a new test for musical intelligence - distinguishing between musical and other types of intelligence

With the exception of interrater reliability and content validity, all other estimates usually involve the calculation of a type of correlation. Multiple estimates may be used to evaluate the properties of an instrument in certain situations. One form of validity, namely predictive validity, relates a test score to some future event or condition. In the following section, you'll see how prediction and correlation can be related to help make decisions or analyze patterns.

Linear Regression

Linear regression is a mathematical routine that links two related variables by attempting to predict one variable using the other variable. The variable upon which the prediction is based (i.e., the predictor) is called the independent variable, which is plotted on the x-axis. The variable that is being predicted is called the dependent variable, and it is plotted on the y-axis. In our previous example, AGE was used as the independent variable and EXPERIENCE as the dependent variable. Think of it this way: Y depends on X - X precedes/predicts Y. Just because regression uses one variable to predict another does not mean that the two variables are causally related. Remember that correlation is not causation - the two variables are associated/related but not necessarily as a cause and an effect.

The process of applying linear regression techniques assumes that there is a basis of historically observed data on which to base future predictions. One example of this situation is the school admissions process, where data on applicants are compared with data from previously successful and unsuccessful students. Another example is actuarial work (e.g., life insurance rate-setting), where historical data about longevity is used to predict lifespan. Regression models for these examples are much more complicated than the straight line model that is used in linear regression. Here is the regression line for the age and experience example.

scatterplot of age and experience with regression line

Many lines can be drawn through the scatterplot of points, but one line provides the best fit, where best is defined as having a minimum of error. What is error? Whenever you use a theoretical model to make a prediction, and then check that prediction against historical data, the predicted results can either match the data or not. In the scatterplot shown above, the prediction matches the data where the line goes right through a point. Whenever the line misses a point, error occurs. The amount of error could be small (i.e., the line is near the point), or it could be large (i.e., the line is far from the point). Notice how the three lower points to the right generate quite a bit of error. One method for determining the line's location is called the method of least squares, where the vertical distance between a potential line and each point is squared and these squares are summed, and the line with the least sum of squares is selected as the best fit. The mathematics involves calculus and is beyond the scope of this introduction. The average amount by which the line misses the points is called the standard error of the estimate, which you can think of as similar to a standard deviation.

When conducting linear regression analysis using SPSS or Excel, the programs will generate the two numbers (b and a) needed to describe the line. The equation for a line is the following:

Y' = bX + a,

where Y' represents the prediction,
X represents the independent variable,
b is called the slope, and
a is called the y-intercept.

Excel uses a function named LINEST (for linear regression estimate). Please read the help notes in Excel about how to use this function.

SPSS provides the following table, among others that we'll see later.

coefficients for linear regression equation predict experience using age

Unfortunately, the values are not labeled consistently between statistical programs and statistics texts. The value for the y-intercept (a in the equation of the line) is found in the B column of the SPSS output within the (Constant) row. The value for the slope (b in the equation of the line) is also found in the B column within the AGE row. Remember that the slope (b) is paired with X in the equation, and, because X is the AGE variable, the slope is in the AGE row. The resulting equation is the following.

Y' = .627 X + (-13.374)

What does this equation tells us? It provides a way to estimate the likely years of work experience for a person if you already know the person's age. For example. a 42-year-old would be predicted to have almost 13 years of work experience (.627 X 42 - 13.374 = 12.96). [Do the multiplication first and then the subtraction.] Another way to interpret the equation is that for every 10 years of age, the years of work experience increases by about 6.3 years.

One word of caution about predicting these values - notice that the range of age values started at 22 years old and ended at 66. The equation is only valid to use where there were data points in the original set of data for the independent variable, age. In other words, the prediction equation generates meaningless numbers for people younger than 22 or older than 66. Try calculating the predicted years of work experience for a 15-year-old. When employing prediction equations, the range of legitimate values for the independent variable must be known.

Here's another example to consider, let's say you are determining grade-level reading ability based on reading test scores. The test publisher includes a regression equation for calculating the reading ability levels. The regression equation was derived from the data of third grade students who were moderately good readers. You have a student who tested very, very high and the regression formula places her reading ability at that of a sophomore in college. Because the regression equation was not developed with high-scoring readers, the predicted reading ability is not valid to use.

To learn more about linear regression by working with some interactive applets, visit the following web pages:
http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html
http://bcs.whfreeman.com/ips4e/cat_010/applets/CorrelationRegression.html
http://www.stattucino.com/berrie/dsl/regression/regression.html
http://www.stat.sc.edu/~west/javahtml/Regression.html