For example, here is a scatterplot of 2006 and 2007 API scores for San Francisco elementary schools.

Here is a correlation matrix of various characteristics of the San Francisco elementary schools. Meals represents the percentage of students who are eligible for free or reduced lunched. P_EL represents the percentage of English language learners in the schools. Not-HSG through Grad_Sch represent percentages of parent education levels. Each cell in the matrix includes the Pearson r, significance level (p-value), and N, which represents the number of schools.

The assumptions for conducting an inferential test of correlation rely on the concept of a conditional distribution. A conditional distribution can be thought of as either horizontal or vertical bands of a scatterplot. The conditional distribution of Y given X is the distribution of Y values for any given X value. The actual assumptions for the test are the following:

- Normality - both variables must be normally distributed as well as each conditional distribution is normally distributed.
- Homoscedasticity - the standard deviations for each conditional distribution are equal.
- Linearity - the means of each conditional distribution lie on a straight line.

The null hypothesis for the inferential test is:

H

that is, the correlation between the two variables in the population is equal to 0, or stated another way, the two variables are not correlated.

The alternative hypothesis is one of the following:

H

H

H

The distribution associated with the inferential test is Student's t distribution, which is based on a function of r and n, namely

The associated degrees of freedom for the test are N-2. If the p-value obtained is less than the preset value of α (usually .05), then the null hypothesis of no correlation between the variables is rejected in favor of the alternative hypothesis. .

In the matrix shown above, the correlation between api06 and api07 of r = .938 is statistically significant at the .05 level, and the correlation between api07 and some_col (percentage of parents who have some college education but not a degree) of r = -.220 is not statistically significant. For correlations that are found to be statistically significant, an effect size of r

Y' = bX + a

where Y' is the predicted variable, b is the slope of the line, X is the independent variable, and a is the y-intercept (the point at which the line crosses the vertical axis). The equation parameters, a and b, can be derived as detailed in the text, or can be determined using the correlation between X and Y and the fact that the line passes through the point (X, Y). The magnitude (absolute value) of the correlation coefficient determines the accuracy of the prediction. This accuracy is based on the difference between the observed Y value and the predicted Y' value (Y - Y'), which is called a deviation, discrepancy, or residual. An overall measure of the accuracy of the prediction is also given by the standard error of estimate, which is the square root of the sum of these deviations squared and divided by the sample size.

Note that the Model Summary table displays R, R

The ANOVA table shown below reports the sums of squares for the predicted variable and the error (residuals/deviations/discrepancies). It is important to note that the decimal places in the sum of squares column are not aligned - use caution when analyzing these values. The interpretation of this table is the following. First of all, the large value of F and small significance (p-value) indicates that the fit of the regression model to the data is good - the null hypothesis of no relationship between the data and the regression model is rejected. The second important source of information is the sum of squares column. The amount of explained variance is given by the ratio of the Regression sum of squares to the Total sum of squares, or 439291.6 divided by 499042.2, which equals .88 or 88%. This coincides with the value of R

The Coefficients table provides the regression equation parameters, a and b, as well as two t tests about those parameters. In the model, the regression equation is

Y' = .962 X + 34.988

The first t test assesses whether the y-intercept, a, is different from 0. In this case, based on the standard error, the value of the y-intercept is not statistically significantly different than 0. This result may have meaning in particular settings. In this example, it has no real importance. The second t test assesses whether the slope is different from 0. This test is more important than the previous one because if the slope is likely to be 0, then so is the correlation and, consequently, the prediction would not be very accurate. In this case, the slope of .962 is statistically significantly different from 0 because the p-value associated with t = 22.194 is less than .0005.

Discrepancy (deviation or residual) is the vertical distance between an observed point and its predicted value, which is the point directly above or below it on the regression line. Observed points that fall far from the regression line have high discrepancy. Observed points that fall on the regression line have no discrepancy.

Leverage is the horizontal distance that a point falls from the center (mean) of the predictor (independent) variable. Observed points far from the mean for the predictor variable (i.e., the extreme scores for X) have high leverage. Those near the mean of X have low leverage.

The combination of discrepancy and leverage determines a points influence. Those points with both high discrepancy and high leverage have a greater degree of influence on the parameters for the regression equation. Eliminating the points with high influence may improve the accuracy of the regression model. These highly influential points may indicate unique characteristics that merit more careful investigation.

Here are some examples of plots of the regression line, discrepancy, leverage, and influence for the 2006 and 2007 API scores for San Francisco elementary schools.

In the scatterplot of the two API scores below, the regression line has been superimposed to represent the predicted Y' scores. Notice that most of the observed points fall fairly close to the line. Some of the lower API 2007 scores fail to match the pattern.

The discrepancies are the vertical
distances between the observed points and the line. Notice that most
observed points are within a distance of 50 score points from the line.
One observation is over 150 points below the line.

As mentioned earlier, leverage is
the horizontal distance for the mean of X (API06). The mean API score
for these schools in 2006 was 774 points. The points with the most
leverage are those scores well below and above 774.

The
combination of discrepancy and leverage indicates the influence a
point has on the regression model. Note here that the two
points with the most influence are the circled ones. Closer
investigation of the data reveals that Cesar Chavez Elementary
went from an API score of 735 in 2006 to an API score of 596 in 2007,
when the school's predicted score was 742. John Muir
Elementary went from an API score of 615 in 2006 to an API
score of 573 in 2007, when the school's predicted score was 627. The
observation for Chavez Elementary has very high discrepancy and
relatively low leverage, because its 2006 API score was near the mean
of 774 points. The observation for Muir Elementary has moderate
discrepancy but high leverage, because its 2006 API score was among the
lowest. Inspecting the discrepancy plot will show two other schools
whose predicted scores were off by over 50 points, but because their
2006 API scores are closer to the mean than Muir's was, their influence
on the regression model is less.