Correlation and Regression
previous module, correlation was introduced as a descriptive
statistic that describes the nature of a relationship between two
variables. There are various types of correlations depending on the
specific characteristics of the two variables between compared. One
commonly used correlation is the Pearson product-moment correlation
coefficient, which measures the extent of a linear relationship between
two interval or ratio-level variables. The Pearson correlation,
represented by r, ranges from -1 to +1. The magnitude of r indicates
the degree to which the pattern of paired points represents a line. The
sign of r (- or +) indicates the slope of the line that represents the
relationship - a positive r indicates a direct relationship a negative
r indicates an indirect relationship. When r = 0, there is no linear
relationship between the two variables. The characteristics of
relationships that are represented by a correlation must be checked by
inspecting a scatterplot of pairs of points.
example, here is a scatterplot of 2006 and 2007 API scores for San
Francisco elementary schools.
Here is a
correlation matrix of various characteristics of the San Francisco
elementary schools. Meals represents the percentage of students who are
eligible for free or reduced lunched. P_EL represents the percentage of
English language learners in the schools. Not-HSG through Grad_Sch
represent percentages of parent education levels. Each cell in
the matrix includes the Pearson r, significance level (p-value), and N,
which represents the number of schools.
shown in the matrix above, correlation can be used in an inferential
test. The second number in each cell of the matrix is the level of
statistical significance (p-value) associated with the inferential test
of the correlation value.
The assumptions for
conducting an inferential test of correlation rely on the concept of a
conditional distribution. A conditional distribution can be thought of
as either horizontal or vertical bands of a scatterplot. The
conditional distribution of Y given X is the distribution of Y values
for any given X value. The actual assumptions for the test are the
assumptions can be checked by inspecting the scatterplot, identifying
outliers, and analyzing skewness ratios.
- Normality - both variables must
be normally distributed as well as each conditional distribution is
- Homoscedasticity - the
standard deviations for each conditional distribution are equal.
- the means of each conditional distribution lie on a straight line.
The null hypothesis for the
inferential test is:
H0: ρ = 0,
that is, the
correlation between the two variables in the population is equal to 0,
or stated another way, the two variables are not correlated.
The alternative hypothesis is one
of the following:
H1: r ≠ 0
(the non-directional alternative hypothesis)
r < 0 (the correlation is indirect)
r > 0 (the correlation is direct)
distribution associated with the inferential test is Student's t distribution,
which is based on a function of r and n, namely
The associated degrees of
freedom for the test are N-2. If the p-value obtained is less than the
preset value of α (usually .05), then the null hypothesis of no
correlation between the variables is rejected in favor of the
alternative hypothesis. .
In the matrix shown
above, the correlation between api06 and api07 of r = .938 is
statistically significant at the .05 level, and the correlation between
api07 and some_col (percentage of parents who have some college
education but not a degree) of r = -.220 is not statistically
significant. For correlations that are found to be statistically
significant, an effect size of r2 can be
computed to report the amount of shared variance between the two
variables. This effect size measure is called the coefficient of determination.
In the example of api06 and api07, .9382 =
.8798, which indicates that 88.0% of the variance is shared between the
two sets of API scores. Caution must be applied when statistically
significant correlations are found for large samples. A statistically
significant correlation does not necessarily indicate a meaningful
When a statistically and practically significant
correlation is found between two variables, it may be appropriate and
worthwhile to consider predicting one of the variables based on the
other variable. The variable used in the prediction is called
or independent variable.
The variable being predicted is called the criterion
or dependent variable. The
process used in the prediction is called simple linear regression and
uses the following linear equation:
Y' = bX + a
Y' is the predicted variable, b is the slope of the line, X is the
independent variable, and a is the y-intercept (the point at which the
line crosses the vertical axis). The equation parameters, a and b, can
be derived as detailed in the text, or can be determined
using the correlation between X and Y and the fact that the
line passes through the point (X,
The magnitude (absolute value) of the correlation coefficient
determines the accuracy of the prediction. This accuracy is based on
the difference between the observed Y value and the predicted Y' value
(Y - Y'), which is called a deviation, discrepancy, or residual. An
overall measure of the accuracy of the prediction is also given by
the standard error of
estimate, which is the square root of the sum of
these deviations squared and divided by the sample size.
Here is an example of a regression analysis
in which 2007 API scores for San Francisco elementary schools are
predicted using the 2006 API scores.
Note that the Model Summary
table displays R, R2, and the standard error of
estimate, which was mentioned earlier. R is the correlation between Y'
and Y, which, for simple linear regression with one predictor variable,
is equal to the absolute value of r. The adjusted R2
corrects for the effects of small sample sizes, if any.
ANOVA table shown below reports the sums of squares for the predicted
variable and the error (residuals/deviations/discrepancies). It is
important to note that the decimal places in the sum of squares column
are not aligned - use caution when analyzing these values. The
interpretation of this table is the following. First of all, the large
value of F and small significance (p-value) indicates that the fit of
the regression model to the data is good - the null hypothesis of no
relationship between the data and the regression model is rejected. The
second important source of information is the sum of squares
column. The amount of
explained variance is given by the ratio of the Regression
sum of squares to the Total sum of squares, or 439291.6 divided by
499042.2, which equals .88 or 88%. This coincides with the value of R2
listed in the previous table. Both of these measures are effect sizes
for the regression model. Consequently, the remaining unexplained variance, which
is given by the ratio of the Residual sum of squares to the Total sum
of squares, is 12%.
table provides the regression equation parameters, a and b, as well as
two t tests
about those parameters. In the model, the regression equation is
= .962 X + 34.988
The first t test assesses
whether the y-intercept, a, is different from 0. In this case, based on
the standard error, the value of the y-intercept is not statistically
significantly different than 0. This result may have meaning in
particular settings. In this example, it has no real importance. The
second t test
assesses whether the slope is different from 0. This test is more
important than the previous one because if the slope is likely to be 0,
then so is the correlation and, consequently, the prediction would not
be very accurate. In this case, the slope of .962 is statistically
significantly different from 0 because the p-value associated with t = 22.194 is less
Leverage, and Influence of Observations
Once a regression
equation with adequate prediction accuracy has been determined, more
specific analysis of individual points that were used to develop the
regression model can be performed. Three characteristics will be
introduced: Discrepancy (also known as deviation or residual),
Leverage, and Influence.
(deviation or residual) is the vertical distance between an
observed point and its predicted value, which is the point
directly above or below it on the regression line. Observed points that
fall far from the regression line have high discrepancy. Observed
points that fall on the regression line have no discrepancy.
Leverage is the horizontal distance that a
point falls from the center (mean) of the predictor (independent)
Observed points far from the mean for the predictor variable (i.e., the
extreme scores for X) have high leverage. Those near the mean of X have
combination of discrepancy and leverage determines a points
influence. Those points with both high discrepancy and
high leverage have a greater degree of influence on the parameters for
the regression equation. Eliminating the points with high influence may
improve the accuracy of the regression model. These highly influential
points may indicate unique characteristics that merit more careful
Here are some examples of plots of
the regression line, discrepancy, leverage, and influence for the 2006
and 2007 API scores for San Francisco elementary schools.
the scatterplot of the two API scores below, the regression line has
been superimposed to represent the predicted Y' scores. Notice that
most of the observed points fall fairly close to the line. Some of the
lower API 2007 scores fail to match the pattern.
The discrepancies are the vertical
distances between the observed points and the line. Notice that most
observed points are within a distance of 50 score points from the line.
One observation is over 150 points below the line.
As mentioned earlier, leverage is
the horizontal distance for the mean of X (API06). The mean API score
for these schools in 2006 was 774 points. The points with the most
leverage are those scores well below and above 774.
combination of discrepancy and leverage indicates the influence a
point has on the regression model. Note here that the two
points with the most influence are the circled ones. Closer
investigation of the data reveals that Cesar Chavez Elementary
went from an API score of 735 in 2006 to an API score of 596 in 2007,
when the school's predicted score was 742. John Muir
Elementary went from an API score of 615 in 2006 to an API
score of 573 in 2007, when the school's predicted score was 627. The
observation for Chavez Elementary has very high discrepancy and
relatively low leverage, because its 2006 API score was near the mean
of 774 points. The observation for Muir Elementary has moderate
discrepancy but high leverage, because its 2006 API score was among the
lowest. Inspecting the discrepancy plot will show two other schools
whose predicted scores were off by over 50 points, but because their
2006 API scores are closer to the mean than Muir's was, their influence
on the regression model is less.