- Measures for
continuous variables: regression and
Theoretical statements are
statements of relationships between variables
- They adopt the general
form: the greater A, the greater B (or the greater A, the
- In the social sciences,
these theoretical statements are probabilistic
rather than deterministic
- Because the
relationships don't hold unfailingly, there is room to
doubt whether the relationship exceeds that due to chance
Regression and correlation
analysis allow us to measure the strength of the observed
relationship between variables that are hypothesized to be
- Regression analysis
produces the BEST prediction (in a least-squares sense)
of a Y variable from knowledge of X.
- Given any
two-dimensional pattern of points, there is one line
that can passed through the points to minimize the
"squared deviations" from that line.
- We will learn later
this week how to calculate that line.
- Correlation analysis
succinctly summarizes the "fit" of the observed values to
those predicted by the "regression line."
- The term "regression"
comes from Sir Francis Galton's pioneering analyses in
the 1880s on the relationship between parents' mean
heights and height of adult offspring.
- After plotting the
relationship of parent height on the baseline and
offspring height on the ordinate, he observed that the
relationship was linear and devised a procedure for
finding the straight line that best fit the plotted
points in the sense that they minimized the deviations
of the points from the line.
- He called this line
the "regression" or "reversion" line, because he noted
that short parents tended to have slightly taller
offspring while tall parents tended to have slightly
- It was left to Karl
Pearson who perfected the formula for summarizing the
closeness of the fit of points to the regression line
in the product-moment correlation
- Neither regression nor
correlation PROVES any causal connection between the
variables, but the techniques do permit tests for
The Pearson Product-Moment CORRELATION
correlation, commonly expressed as r, indicates the
strength of a relationship between two variables that
are assumed to be measured on an interval or ratio
- Properties of the
- r ranges from
-1.0 to + 1.0; the signs indicating whether the
relationship is direct (+) or inverse (-).
- The absolute value of
the coefficient indicates its strength.
- Many ways to skin a
cat--and to compute the correlation coefficient,
- The SPSS Guide
on page 178 gives as the formula for r
- Personally, I regard
this formula as too difficult to explain, and I prefer
as a definitional the following:
- Interpreting the
numerator in the definitional formula (above) for the
- The top term,
is the product of the joint deviations of the
individual X and Y scores from their respective
means. This term is called the cross-product sum
of squares or simply the
- The cross-product
gives the correlation its name as the
product-moment correlation, for it is the product
of the moments (deviations) of the X and Y values
from their respective means.
- Another common
name for the cross-product is covariation,
for the larger the value, the more the two
variables vary together or covary.
- As in the
formula above, the covariation can be adjusted for
the number of observations by dividing by N (number
of cases) to produce the covariance--the
average or mean amount that the paired observations
- Interpreting the
denominator in the formula for the correlation
- Close examination
of this term, ,
reveals that it is the product of the variances of
the X and Y variables.
- This term is used
in the denominator to adjust for the overall
variation in the individual variables.
- In effect,
the complete formula expresses the extent to which
the X and Y variables covary as a proportion
of the product of the standard deviations of
the X and Y variables.
Sometimes, the deviations of X and Y from their means
are expressed as lowercase x, and y,.
Then r is expressed as:
Alternative ways to calculate the correlation
- The method above can be
described as the ratio of covariance over the product
of the standard deviations.
- This method can be
expressed in these alternative ways:
- Alternatively, one can
compute r with a formula that factors our the N's
in the numerator and denominators--thus relying on only
the covariation and the product of the square root of
the sums of squares (variation).
REVIEW OF TERMS
variation for a single variable
SQUARES = SS (also known simply as VARIATION)
("mean" sum of squares, mean as in "average")
DEVIATION (square root of variance)
- Measures of variation
for two variables taken together
(sum of cross-product deviation = cross-product SS) =
(average cross-product deviation) =
- Comments on the suffixes
"ATION" and "ANCE"
- "ATION" refers
to "SUM OF SQUARES"
- "ANCE" refers
to average or mean sum of squares, so these
terms have a divisor of N.
SPSS procedures for correlational
- From the Analyze Menu choose Correlate
and then Bivariate
- Transfer the variables you want to correlate into
the right hand window.
- Before clicking on OK to run the correlation,
click on Options
- Check off both boxes in the "Statistics"
- Then click the "Continue" button
and then "OK" to run the correlation.
Your next research assignment:
- Formulate a
hypothesis about the effect of a socioeconomic
variable on any political variable at the state
- Run Correlation
with the nustates2000 data to compute the
- One possibility would
be to hypothesize about the causes of the Bush vote in
the 2000 election.
- Alternatively, you
may wish to hypothesize about the causes of voting
turnout in 20000.
- Compute the
- How strong is the
relationship between your independent variable and
your dependent variable?
- I will give a
Valuable Prize to the student who produces
the strongest valid correlation coefficient.
- A valid
coefficient is one that reflects a genuine
association, one not based on artifacts of