Inferential Correlation and Regression

Correlation Revisited

In a previous module, correlation was introduced as a descriptive statistic that describes the nature of a relationship between two variables. There are various types of correlations depending on the specific characteristics of the two variables between compared. One commonly used correlation is the Pearson product-moment correlation coefficient, which measures the extent of a linear relationship between two interval or ratio-level variables. The Pearson correlation, represented by r, ranges from -1 to +1. The magnitude of r indicates the degree to which the pattern of paired points represents a line. The sign of r (- or +) indicates the slope of the line that represents the relationship - a positive r indicates a direct relationship a negative r indicates an indirect relationship. When r = 0, there is no linear relationship between the two variables. The characteristics of relationships that are represented by a correlation must be checked by inspecting a scatterplot of pairs of points.

For example, here is a scatterplot of 2006 and 2007 API scores for San Francisco elementary schools.

Here is a correlation matrix of various characteristics of the San Francisco elementary schools. Meals represents the percentage of students who are eligible for free or reduced lunched. P_EL represents the percentage of English language learners in the schools. Not-HSG through Grad_Sch represent percentages of parent education levels. Each cell in the matrix includes the Pearson r, significance level (p-value), and N, which represents the number of schools.

Inferential Correlation

As shown in the matrix above, correlation can be used in an inferential test. The second number in each cell of the matrix is the level of statistical significance (p-value) associated with the inferential test of the correlation value.

The assumptions for conducting an inferential test of correlation rely on the concept of a conditional distribution. A conditional distribution can be thought of as either horizontal or vertical bands of a scatterplot. The conditional distribution of Y given X is the distribution of Y values for any given X value. The actual assumptions for the test are the following:

Normality - both variables must be normally distributed as well as each conditional distribution is normally distributed.
Homoscedasticity - the standard deviations for each conditional distribution are equal.
Linearity - the means of each conditional distribution lie on a straight line.

These assumptions can be checked by inspecting the scatterplot, identifying outliers, and analyzing skewness ratios.

The null hypothesis for the inferential test is:

H₀: ρ = 0,

that is, the correlation between the two variables in the population is equal to 0, or stated another way, the two variables are not correlated.

The alternative hypothesis is one of the following:

H₁: r ≠ 0 (the non-directional alternative hypothesis)
H₁: r < 0 (the correlation is indirect)
H₁: r > 0 (the correlation is direct)

The distribution associated with the inferential test is Student's t distribution, which is based on a function of r and n, namely

The associated degrees of freedom for the test are N-2. If the p-value obtained is less than the preset value of α (usually .05), then the null hypothesis of no correlation between the variables is rejected in favor of the alternative hypothesis. .

In the matrix shown above, the correlation between api06 and api07 of r = .938 is statistically significant at the .05 level, and the correlation between api07 and some_col (percentage of parents who have some college education but not a degree) of r = -.220 is not statistically significant. For correlations that are found to be statistically significant, an effect size of r² can be computed to report the amount of shared variance between the two variables. This effect size measure is called the coefficient of determination. In the example of api06 and api07, .938² = .8798, which indicates that 88.0% of the variance is shared between the two sets of API scores. Caution must be applied when statistically significant correlations are found for large samples. A statistically significant correlation does not necessarily indicate a meaningful result.

Regression Revisited

When a statistically and practically significant correlation is found between two variables, it may be appropriate and worthwhile to consider predicting one of the variables based on the other variable. The variable used in the prediction is called the predictor or independent variable. The variable being predicted is called the criterion or dependent variable. The process used in the prediction is called simple linear regression and uses the following linear equation:

Y' = bX + a

where Y' is the predicted variable, b is the slope of the line, X is the independent variable, and a is the y-intercept (the point at which the line crosses the vertical axis). The equation parameters, a and b, can be derived as detailed in the text, or can be determined using the correlation between X and Y and the fact that the line passes through the point (X, Y). The magnitude (absolute value) of the correlation coefficient determines the accuracy of the prediction. This accuracy is based on the difference between the observed Y value and the predicted Y' value (Y - Y'), which is called a deviation, discrepancy, or residual. An overall measure of the accuracy of the prediction is also given by the standard error of estimate, which is the square root of the sum of these deviations squared and divided by the sample size.

Explanation of SPSS's Regression Output

Here is an example of a regression analysis in which 2007 API scores for San Francisco elementary schools are predicted using the 2006 API scores.

Note that the Model Summary table displays R, R², and the standard error of estimate, which was mentioned earlier. R is the correlation between Y' and Y, which, for simple linear regression with one predictor variable, is equal to the absolute value of r. The adjusted R² corrects for the effects of small sample sizes, if any.

The ANOVA table shown below reports the sums of squares for the predicted variable and the error (residuals/deviations/discrepancies). It is important to note that the decimal places in the sum of squares column are not aligned - use caution when analyzing these values. The interpretation of this table is the following. First of all, the large value of F and small significance (p-value) indicates that the fit of the regression model to the data is good - the null hypothesis of no relationship between the data and the regression model is rejected. The second important source of information is the sum of squares column. The amount of explained variance is given by the ratio of the Regression sum of squares to the Total sum of squares, or 439291.6 divided by 499042.2, which equals .88 or 88%. This coincides with the value of R² listed in the previous table. Both of these measures are effect sizes for the regression model. Consequently, the remaining unexplained variance, which is given by the ratio of the Residual sum of squares to the Total sum of squares, is 12%.

The Coefficients table provides the regression equation parameters, a and b, as well as two t tests about those parameters. In the model, the regression equation is

Y' = .962 X + 34.988

The first t test assesses whether the y-intercept, a, is different from 0. In this case, based on the standard error, the value of the y-intercept is not statistically significantly different than 0. This result may have meaning in particular settings. In this example, it has no real importance. The second t test assesses whether the slope is different from 0. This test is more important than the previous one because if the slope is likely to be 0, then so is the correlation and, consequently, the prediction would not be very accurate. In this case, the slope of .962 is statistically significantly different from 0 because the p-value associated with t = 22.194 is less than .0005.

Analyzing Discrepancy, Leverage, and Influence of Observations

Once a regression equation with adequate prediction accuracy has been determined, more specific analysis of individual points that were used to develop the regression model can be performed. Three characteristics will be introduced: Discrepancy (also known as deviation or residual), Leverage, and Influence.

Discrepancy (deviation or residual) is the vertical distance between an observed point and its predicted value, which is the point directly above or below it on the regression line. Observed points that fall far from the regression line have high discrepancy. Observed points that fall on the regression line have no discrepancy.

Leverage is the horizontal distance that a point falls from the center (mean) of the predictor (independent) variable. Observed points far from the mean for the predictor variable (i.e., the extreme scores for X) have high leverage. Those near the mean of X have low leverage.

The combination of discrepancy and leverage determines a points influence. Those points with both high discrepancy and high leverage have a greater degree of influence on the parameters for the regression equation. Eliminating the points with high influence may improve the accuracy of the regression model. These highly influential points may indicate unique characteristics that merit more careful investigation.

Here are some examples of plots of the regression line, discrepancy, leverage, and influence for the 2006 and 2007 API scores for San Francisco elementary schools.

In the scatterplot of the two API scores below, the regression line has been superimposed to represent the predicted Y' scores. Notice that most of the observed points fall fairly close to the line. Some of the lower API 2007 scores fail to match the pattern.

The discrepancies are the vertical distances between the observed points and the line. Notice that most observed points are within a distance of 50 score points from the line. One observation is over 150 points below the line.

As mentioned earlier, leverage is the horizontal distance for the mean of X (API06). The mean API score for these schools in 2006 was 774 points. The points with the most leverage are those scores well below and above 774.

The combination of discrepancy and leverage indicates the influence a point has on the regression model. Note here that the two points with the most influence are the circled ones. Closer investigation of the data reveals that Cesar Chavez Elementary went from an API score of 735 in 2006 to an API score of 596 in 2007, when the school's predicted score was 742. John Muir Elementary went from an API score of 615 in 2006 to an API score of 573 in 2007, when the school's predicted score was 627. The observation for Chavez Elementary has very high discrepancy and relatively low leverage, because its 2006 API score was near the mean of 774 points. The observation for Muir Elementary has moderate discrepancy but high leverage, because its 2006 API score was among the lowest. Inspecting the discrepancy plot will show two other schools whose predicted scores were off by over 50 points, but because their 2006 API scores are closer to the mean than Muir's was, their influence on the regression model is less.