correlational analysis with linear regression
- Here is the simple
"scattergram" that plotted how the states voted for
president in 1980 and 1984.
correlation, r = .90
- The entries in
two-dimensional space stand for the number of states that
voted about the same for Reagan in each
- The correlation in vote
for Reagan between the two years is very high, meaning
that one could fairly well predict a state's vote in 1984
(Y) with knowledge of the state's vote in 1980
- -- IF one knew
the formula for the underlying "regression
Determining the "regression line" that underlies the
- The plot of
reagan84 with reagan80 suggests that a
"linear relationship" exists.
- One can imagine a
straight line through the swarm of points.
- The formula for such
a line is
the predicted value of the dependent variable,
- a = a
constant, the point at which the line
crosses the Y axis when X = 0
- b = a
coefficient representing the "slope" of
= the observed value of the
independent variable for the ith
- In fact, a line can be
drawn that constitutes the "best fit" in the sense that
it minimizes the squared deviations of observed Ys
from any alternative line.
- Such a criterion for
drawing a line is referred to as ordinary least
- More formally:
- Where: =
sum of squared errors in
- The square root
of these mean squared deviations is the standard error
Computing the OLS
(Ordinary Least Squares) regression line (these
values are automatically computed within SPSS):
- The slope of the line,
b, is computed by this basic formula:
- In words, this is
- It is also
- The formula for,
a, the intercept is
- Note that if there is
no slope (i.e., an increase in X produces no
increase in Y), b=0
- Thus the second term
on the right would also be 0
- and the intercept,
a, would be equal to the mean of the
dependent variable, Y.
- Thus, the "slope" in
the scatterplot would be a straight line from right to
left, drawn at the mean of Y.
Computing the regression coefficient,
byx (variable Y regressed on
- Conceptually, the
regression coefficient is the ratio of the
covariation between both variables to the
variation of the independent variable.
- The regression
coefficient byx is an unstandardized coefficient,
which means that it is calculated for the "raw" or
- It represents the
slope of the regression line--the amount of
change in Y due to a change of 1 unit of
- Calculating b
using cross-products and standard deviations: for
variable Y regressed on X,
Here is the line and the regression equation superimposed
on the scatterplot:
produce a scatterplot with the regression line in SPSS
Example: Consider the
regression equation predicting to REAGAN84 from REAGAN80, as
calculated by the scatterplot output:
= a + b Xi
= 14.86 +
Correlation output (from a previous run) shows
Dev of REAGAN84 = 3554.4369 [Thus,
SSyx = 3554]
of REAGAN80 = 80.95; N = 50 [Thus,
SSx = 4047 (variance x N)]
the regression coefficient, b:
coefficient .88 in the regression equation
means that the percentage of a state's vote for
Reagan in 1984 (Yi) increased by
.88 for each percentage point that the state voted for
Reagan in 1980
- If the
slope is only .88, how did Reagan win more votes in 1984
than in 1980?
the value of the intercept, which is
Reagan ran almost 15 percentage points better in every
state in 1984 than he did in 1980.
is the relationship between b ( the slope) and
r (the correlation coefficient)?
- That is,
if the two variables being correlated have
equal standard deviations (sy =
b=r, for r would be multiplied by
implication of all this is
value of the slope, b, always differs
from the correlation coefficient, r,
the extent that the two variables being
correlated, X and Y,
in their standard deviations, (sy
the value of b (the slope) does not necessarily
indicate the value of r (the correlation).
if the two variables, X and Y, vary
greatly in their standard deviations,
- it is
possible to encounter a very small slope (e.g.,
b=.001) and a high correlation (e.g.,
Equivalent methods for calculating
covariance of XY and variance of X:
using r and standard deviations of the x and y
variables as described in the section above,
REAGAN84 regressed on
How can we interpret the b coefficient?
coefficients refer to the slopes of the regression
coefficients are interpreted as the amount of
change in the dependent variable (Y) that
is associated with a change in one unit of the
independent variable (X).
b coefficients are unstandardized, which
means that the magnitude of their values is relative
to the means and standard deviations of the
independent and dependent variables in the
means that the slopes can be interpreted
directly in terms of the raw values of X and
the values are
(as in the regression of REAGAN84 on REAGAN80),
(as in the regression of tax paid to income
scales (e.g., battlefield casualties on
tonnage of bombs dropped).
the case of REAGAN84 regressed on REAGAN80, for
example, b=.878 can be interpreted in terms of a
state's voting percentage for Reagan in 1984 and in
the value of a b coefficient depends on the scaling of
the raw data, which is somewhat arbitrary (should time
be measured in years, months, days?), b
coefficients cannot be easily compared within a
we will consider another type of regression
coefficient, beta,which is standardized
such that it adjusts for the different means and
variances of the variables being
Note that there is "another" regression line for any two
correlated variables, X and Y.
- The product-moment
correlation, r, is symmetrical--
- the correlation is
the same whether either X or Y is regarded as
independent or dependent variables.
- But regression analysis
- when the dependent
and independent variables are switched, a different
formula defining the least squares line for X
regressed on Y.
- See Schmidt, pp.
192-193, for calculating this "other"
There is another way to measure prediction "error" in
units of measurement for Y:
- The standard error of
the estimate is the standard deviation of
observed values, Y, around predicted
- It is discussed in
the handout from Schmidt on pp. 191-192.
ERR OF EST
in the scatterplot statistics) is
- The standard error of
the estimate is less frequently used in statistical
analysis than the coefficient of determination,
Comments on the effect of the pattern of plots on the
regression line and the value of the correlation
- Regression and
correlation analysis is most appropriate when the plot is
linear and homoscedastic.
- Linear regression
analysis underestimates a curvilinear plot between
homoscedastic plot occurs when the variances of
observed Y values are equal regardless of the X values.
- When the plot is
heteroscadestic, the accuracy of predictions from
X to Y depends on the value of X:
- Note also that outliers -- such as
Washington, D.C.--can affect the relationship--acting to
either lower or raise it.
Example: Go here to see
the effect of dropping Washington, D.C. from the
- Whether one does or does not exclude a case from the
analysis rests with the analyst.
- In this instance, the researcher might exclude
Washington, D.C., which is 100% urban, because it is not
really a "state."