Path: > Syllabus > Review: Multivariate Statistics
Multiple Regression 

Review of points in linear regression analysis:

  •  When the regression line is calculated for raw data for X and Y,
    • the regression line passes through the means of both variables.
    • the "unstandardized" slope of the regression line, b, states the change in Y that is due to a change of one unit of X.
      • Obviously, interpreting the change in Y depends on the "units" used to measure both X and Y.
      • A slope of 0 (b=0) would indicate the absence of a correlation between X and Y.
      • The correlation coefficient, r, would thus also be zero.
    • Therefore, testing for the significance of the slope (b), is equivalent to testing for the correlation coefficient (r).
    • However, the simple magnitude of a non-zero b-coefficient is not a reliable guide to the size of the correlation coefficient, for the slope is dependent on the ratio of the covariation of X and Y, to the variation of X.
    • This is eqiuivalent to the ratio of the covariance of X and Y, to the covariance of X.
    bx =
    covariation of X and Y
    variation of X
  •  When the regression line is calculated for z-scores calculated for the same data:
    • the regression line still passes through the means of both variables, which in the case of z-scores are both 0.
    • Therefore, the intercept is 0, and the a term in the unstandardized regression equation simply drops out.
    • Because the data are standardized, the slope of the equation is itself a standardized value, called a beta-coefficient.
    • The beta-coefficient also measures change in Y coming from one unit change in X, but now the "units" are in "standard deviations."
    • The correlation coefficient, r, remains the same whether calculated for the same data in raw or standardized form.
    • Moreover, the correlation coefficient, r, is equal to the standardized beta-coefficient.
    • The great value of the beta-coefficient is that it expresses the "effect" of one variable on another without regard to how differently the variables are scaled.
    • It does this by expressing the change in Y in standard deviations produced from a change of one standard deviation in X. 

Multiple regression analysis:

  •  An extension of simple regression to the case of multiple independent variables, X1 to Xn, and a single dependent variable, Y:
    • It is most appropriate when Y is a continuous variable.
    • The classic model also assumes the X variables to be continuous.
      • They certainly cannot be polychotomous nominal,
      • but regression analysis can handle independent nominal variables in dichotomous form -- so-called "dummy" variables.
      • The effects of dummy variables are more easily interpreted when they are scored "0" and "1" to indicate the absence or presence of an attribute.
      • Then the presence of the attribute -- e.g., SOUTH indicating a southern state if "1" and non-southern if "0" -- turns on the associated coefficient, which shows the effect of SOUTH on Y.
  • Under the Analyze Menu in SPSS 10, the Regression menu offers several choices, choose Linear
    • The box allows you to enter a single dependent variable but multiple independent variables
    • The "Statistics" button offers several boxes to check, the most relevant for our class are
      • regression Coefficient "estimates"
      • Model fit
      • Descriptives
    • Unfortunately, the "Plots" button only produces plots for residuals, so it is not useful for us.
    • The "Method" box offers several options, most useful for us are
      • Enter --which automatically forces all the dependent variables you listed.
      • Stepwise -- which proceeds step-by-step,
        • first entering the variable that explains the most variance--if it is significant at .05
        • then the variable that explains most of the remaining variance--if it significant at .05
        • and so on.
  •  The mathematics of regression analysis operates on the intercorrelations among all the variables -- dependent and independent.
    • An intercorrelation matrix is a table of all possible bivariate Pearson correlations among variables Y, X1, . . , Xn
    • Regression analysis tries to produce the best additive combination of independent variables that produces the best linear relationship between the observed Y values and the Y values predicted by the resulting regression equation.
    • The multiple correlation coefficient, R, is equal to the product moment correlation, r, that would be calculated if the observed values were correlated with the values computed from the regression equation.
    • Similarly, the multiple R-squared is equal to the proportion of variance in the dependent variable Y that is explained by the additive combination of effects of the independent variables, X1 to Xn.
  •  Multiple regression analysis assesses only direct, additive effects according to this model:
    X1 and X2 are uncorrelated, such that rx1x2 = 0.
  • If this causal model actually operated for the variables, then one could obtain the same results by computing two separate, simple regression equations and adding together the variance in Y that is explained separately by X1 and X2.

Y = beta(X1)

= .5(X1)

r= ,5

r-squared = .25

Y = beta(X2)

= .4(X2)

r = .4

r-squared = .16

Variance explained by adding variance of X1 and X2=

.25 + .16

= .41

  • This would be equivalent to constructing a regression equation from the beta-coefficients in the two independent regression equations:
EXAMPLE: Y = beta(X1) + beta(X2) = .5(X1) + .4(X2)
  •  In most cases the simple model of uncorrelated independent variables does not hold, and the effects of variables computed from separate simple regressions cannot simply be added together.
  •  The value of multiple regression analysis is that it discounts for overlapping explanation of Y between correlated independent variables and expresses the NET effects of each independent variable controlling for any others in the equation.
  •  Multiple regression through the stepwise procedure in SPSS demonstrates these features.
    • The first variable entered in the equation is that which has the highest simple correlation with the dependent variable Y -- that is, the program selects as the first variable that which explains most of the variance in Y.
    • The next variable entered is that which has the highest partial correlation with Y -- that is, the variable that explains most of the remaining variance in Y while controlling for the first -- assuming that the variable meets the stated level of significance (the default for entering a variable is .05).
    • With the addition each a new variable, a variable previously IN the equation may drop below significance (the default value for removing a variable already in the equation is .10) and be removed from the equation--its explanatory power replaced by other variables now in the equation.
    • And so it goes with each remaining variable in the set of independent variables until one of two conditions is met:
      • All the variables are entered in the equation, which usually happens only when the number of variables is less than 5.
      • No additional variables would show "significant" effects IF added to the equation.
  • Interpretation of the resulting regression equation can be done using either b-coefficients or beta-coefficients -- or both:
    • b-coefficients, which are unstandardized, show the net effect in Y which is associated with one unit change in X -- all in raw data values.
    • beta-coefficents, which are standardized, show the net effect in Y which is associated with one unit change in X -- but now the changes are in standard deviations of both variables.
      • Because b-coeffficients deal with raw (or "original") values, the b-coefficients should be used to construct the prediction equation from the X variables to the Y variable.
      • Because beta-coefficients are standardized, they should be used to compare the "effects" of variables within equations.
    • Both b-coefficients and beta-coefficients can be interpreted as controlling for the effects of other variables.
    • If the b-coefficient is significant, determined by applying the t-test to the ratio of the coefficient to its standard error, then the beta-coefficient is significant.
    • The F value associated with a multiple regression equation tests for the significance of the multiple R for the entire equation.
      • It is possible to have a significant R and some variables that are not significant.
      • These probably should be removed from the equation, for they are not adding any appreciable explanation -- but there may be theoretical reasons for leaving them in.
  • Evaluating the "fit" of a multiple regression equation, or the strength of the relationship:
    • The R-squared value for a multiple regression equation tends to increase with the addition of new variables up to the total number of cases, N.
    • The "adjusted" R-square allows for the additional explanation of a new variable matched against the loss of one degree of freedom for entering it in the equation.
    • If the adjusted r-square goes down with the addition of a new variable, it ordinarily should not be included in the equation.