4 Linear regression

4.1 Linear regression line of best fit using OLS

Linear regression is a method to summarize associations between variables. To calculate a linear regression line of best fit, researchers often use the ordinary least squares (OLS) method. Let’s discuss how OLS works, using the plot below, which has three points, with (X,Y) values of (0,2), (1,4), and (2,12). The diagonal line in the plot below is the line of best fit from an ordinary least squares regression. For causal claims, the convention is to use Y for an outcome and X for an influence on that outcome.

In the plot below, the red vertical lines indicate the residuals. A residual is the difference between an observed Y and the corresponding predicted Y.

For an OLS ordinary least squares regression, the line of best fit minimizes the sum of the squared residuals, so that another line cannot be drawn that would produce a smaller sum of the squared residuals. In the table below, the squared residuals are 1, 4, and 1, and the sum of the squared residuals is 6 (which is 1+4+1). So no other line of best fit for these three points would produce a sum of the squared residuals that is less than 6.

X Observed Y Predicted Y Residual Squared residual
0 2 1 1 1
1 4 6 -2 4
2 12 11 1 1

4.2 Simple linear regression

A simple linear regression involves two variables. For a simple linear regression, a straight line of best fit is drawn through the center of the set of observations, to indicate the general association between the two variables. The plot below illustrates this, using variation in an X variable on the horizontal axis (the number of police officers per 100,000 residents in a state in the United States in 2018), to predict variation in a Y variable on the vertical axis (the number of homicides per 100,000 residents in the state in 2018). Each point in the plot represents one of the 50 states in the United States.

Statistical software can draw the straight line of best fit through the points and can report statistical output about this line of best fit, like below:

reg HOMICIDES POLICE, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
   HOMICIDES |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      POLICE |      0.016      0.006     2.75   0.008        0.004       0.028
       _cons |     -0.290      2.194    -0.13   0.895       -4.701       4.121
------------------------------------------------------------------------------

The linear regression output has a few important numbers. Let’s start with the -0.290 estimate for the coefficient on the constant (_cons, also referred to as the constant/intercept). The constant/intercept for a linear regression is the predicted outcome when all predictors are set to zero. In this case, the constant/intercept indicates that the predicted homicide rate per 100,000 residents is -0.290 for a state that had zero police officers. It’s impossible to have a negative homicide rate, but linear regression merely draws a line of best fit through points, and nothing prevents a linear regression from producing impossible predictions.

For our analysis, a more important number is the 0.016 estimate on the POLICE predictor. For a linear regression, the estimate for a predictor can be thought of as a slope: for a one-unit increase in the predictor, the predicted outcome changes by the coefficient for the predictor. In this case, the 0.016 coefficient estimate indicates that, for each one-unit increase in POLICE, the HOMICIDES outcome is predicted to increase by 0.016. That positive coefficient indicates that states with a higher number of police officers per 100,000 residents are predicted to have a higher homicide rate, on average, compared to states with a lower number of police officers per 100,000 residents.

The column “P>|t|” reports two-tailed p-values. The “noheader” option omits some output that we don’t need to discuss yet.


Notice that the linear regression coefficient estimates can be placed in the formula for a line: Y = mX + b, in which m is the slope and b is the y-intercept:

HOMICIDES = 0.016*POLICE + -0.290
We can use this formula to make predictions. Let’s calculate the predicted number of homicides per 100,000 residents for a state that had 400 police officers per 100,000 residents:

HOMICIDES = 0.016*POLICE + -0.290
HOMICIDES = 0.016*400    + -0.290
HOMICIDES = 6.11

That prediction merely tells us where the X and Y meet the line of best fit, as indicated by the red dashed line:


Often a linear regression equation places the constant/intercept first:

HOMICIDES = -0.290 + 0.016*POLICE

But the order of the terms of the equation doesn’t change the predictions.

Sample practice items

The output below is based on survey data from the ANES 2016 Time Series Study. The output is for an analysis that uses respondent years of age (a variable called “AGE”, coded from 18 to 90) to predict respondent feeling thermometer ratings about police (a variable called “FTPOLICE”, coded from 0 for very cold ratings to 100 for very warm ratings):

reg FTPOLICE AGE, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
    FTPOLICE |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         AGE |      0.267      0.021    12.66   0.000        0.225       0.308
       _cons |     62.268      1.105    56.34   0.000       60.101      64.435
------------------------------------------------------------------------------

The constant/intercept coefficient of 62.268 indicates the predicted rating about police among…

  1. respondents who are zero years old
  2. respondents who are 18 years old
  3. respondents who are the average age among respondents
  4. respondents who are 62.268 years old
Answer
  1. respondents who are zero years old

What does the age coefficient of 0.267 indicate?

  1. How much the predicted rating about police changes for each one-unit increase in age
  2. The difference in predicted ratings about police between a young respondent and an old respondent
  3. The predicted rating about police among a respondent zero years old
  4. The predicted rating about police among a respondent at the average age among respondents
Answer
  1. How much the predicted rating about police changes for each one-unit increase in age

Based on the linear regression output, what would be the predicted rating about police from a 50-year-old respondent, to two decimal places?

  1. 13.5
  2. 48.9
  3. 62.3
  4. 75.6
Answer
  1. 75.6
    62.268 + 0.267*50 = 75.6

Let’s use the linear regression below, which uses a measure of respondent age to predict respondent ratings about the U.S. Congress (FTCONGRESS), based on data from the ANES 2016 Time Series Study:

reg FTCONGRESS AGE, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
  FTCONGRESS |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         AGE |     -0.064      0.021    -2.98   0.003       -0.105      -0.022
       _cons |     45.887      1.121    40.95   0.000       43.690      48.084
------------------------------------------------------------------------------

What does the 45.887 coefficient for the constant/intercept indicate?

  1. The predicted mean rating about the U.S. Congress is 45.887.
  2. The predicted mean rating about the U.S. Congress is 45.887 among respondents who are 18 years old.
  3. The predicted mean rating about the U.S. Congress is 45.887 among respondents who are 0 years old.
Answer
  1. The predicted mean rating about the U.S. Congress decreases by -0.063 for each one-unit increase in age.

What does the -0.064 coefficient estimate for age indicate?

  1. The mean rating about the U.S. Congress is -0.064.
  2. The predicted mean rating about the U.S. Congress decreases by -0.064 for each one-unit increase in age.
  3. The predicted mean rating about the U.S. Congress is 0.064 lower for old respondents than for young respondents.
Answer
  1. The predicted mean rating about the U.S. Congress decreases by -0.064 for each one-unit increase in age.

Does the analysis provide sufficient evidence at the conventional level in political science that, at least in these data, respondent age associates with respondent ratings about the U.S. Congress.

  1. Yes
  2. No
Answer
  1. Yes

Does the analysis provide sufficient evidence at the conventional level in political science that, at least in these data and at least on average, respondents getting older caused respondent ratings about the U.S. Congress to get lower?

  1. Yes
  2. No
Answer
  1. No

Let’s practice interpreting a linear regression, using survey data from the ANES 2016 Time Series Study. The output below predicts a participant’s ratings about Donald Trump (FTTRUMP) using a predictor for the participant’s age in years (AGE):

reg FTTRUMP AGE, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
     FTTRUMP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         AGE |       0.24       0.03     7.31   0.000         0.18        0.31
       _cons |      30.04       1.74    17.24   0.000        26.62       33.45
------------------------------------------------------------------------------

What does the 30.04 coefficient estimate for the constant/intercept indicate?

  1. The mean rating about Donald Trump is predicted to be 30.04.
  2. The mean rating about Donald Trump is predicted to be 30.04 among participants of age zero.
  3. The mean rating about Donald Trump is predicted to be 30.04 among participants of the average age.
  4. The mean rating about Donald Trump is predicted to increase 30.04 for a one-unit increase in participant age.
  5. The mean rating about Donald Trump is predicted to be 30.04 units higher for old participants than for young participants.
Answer
  1. The mean rating about Donald Trump is predicted to be 30.04 among participants of age zero.

What does the 0.24 coefficient estimate for AGE indicate?

  1. The mean rating about Donald Trump is predicted to be 0.24.
  2. The mean rating about Donald Trump is predicted to be 0.24 among participants of age zero.
  3. The mean rating about Donald Trump is predicted to be 0.24 among participants of the average age.
  4. The mean rating about Donald Trump is predicted to increase 0.24 for a one-unit increase in participant age.
  5. The mean rating about Donald Trump is predicted to be 0.24 units higher for old participants than for young participants.
Answer
  1. The mean rating about Donald Trump is predicted to increase 0.24 for a one-unit increase in participant age.

Of the following, which are justified interpretations of the p-value p<0.05 for the 0.24 coefficient estimate for AGE?

  1. The p-value indicates that there is sufficient evidence at the conventional level in political science that participant age positively associates with ratings about Donald Trump, at least among these participants and at least on average.
  2. The p-value indicates that there is sufficient evidence at the conventional level in political science that getting older causes a participant to have higher ratings about Donald Trump, at least among these participants and at least on average.
  3. Both of the above
  4. Neither of the above
Answer
  1. The p-value indicates that there is sufficient evidence at the conventional level in political science that participant age positively associates with ratings about Donald Trump, at least among these participants and at least on average.

4.3 Linear regression with categorical predictors

Let’s use survey data from the ANES 2016 Time Series Study to illustrate how to read statistical output for categorical predictors in a linear regression. The outcome variable is respondent feeling thermometer ratings about feminists (a variable called “FTFEMINISTS”, coded from 0 to 100). Our predictor is PARTY06, which is participant partisan identification, on a scale that has seven levels that range from 0 for “strong Democrat” to 6 for “strong Republican”. The plot below has dots to indicate the observations (jittered a bit left or right, so that we can see more of the observations that would otherwise be on top of each other). The blue line in the plot below is the line of best fit through these points, which indicates an on-average negative association between PARTY06 and FTFEMINISTS.

Below is the statistical output for this blue regression line. The 71.20 coefficient for the constant/intercept indicates the y-intercept of the line. This y-intercept indicates the predicted value of the outcome when all predictors are set to zero. In this case, when PARTY06 is zero, that’s a strong Democrat, so the mean predicted value of FTFEMINISTS is 71.20 among strong Democrats. The -5.32 coefficient for PARTY06 indicates the slope of the line. The slope of the line indicates the change in the predicted outcome for a one-unit change in the predictor. In this case, as PARTY06 changes from, say, 0 to 1, the predicted value of FTFEMINISTS changes from 71.20 to (71.20 + -5.32), so that the mean predicted value of FTFEMINISTS is 65.88 among not strong Democrats.

reg FTFEMINISTS PARTY06, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
 FTFEMINISTS |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     PARTY06 |      -5.32       0.18   -29.17   0.000        -5.68       -4.96
       _cons |      71.20       0.65   109.44   0.000        69.92       72.47
------------------------------------------------------------------------------

The y-intercept and slope of the line can be used to write a formula for calculating predicted values of the outcome, using the format Y=mX + b. In this case, and placing b before mX:

FTEMINISTS = 71.20 + (-5.32 * PARTY06)

So the predicted value of FTFEMINISTS among strong Republicans would be:

FTEMINISTS = 71.20 + (-5.32 * PARTY06)
FTEMINISTS = 71.20 + (-5.32 * 6)
FTEMINISTS = 71.20 + (-31.92)
FTEMINISTS = 39.28

Our original linear regression using PARTY06 to predict FTFEMINISTS used a single line of best fit. That line of best fit makes better predictions than random guessing, but a single line of best fit isn’t the best that we can do. Check the plot below, in which the red dots indicate the mean value of FTFEMINISTS at different levels of PARTY06. The line of best fit prediction is close for PARTY06 of 0 and PARTY06 of 6, but is too high for PARTY06 of 2 and is too low for PARTY06 of 3.

To improve our predictions, we can model PARTY06 as a categorical predictor and then make separate predictions for each category of PARTY06. Let’s conduct a linear regression doing that, below. In Stata, the “i.” prefix tells Stata to treat the predictor as a categorical predictor (the “i” stands for “indicator”).

Below is the linear regression output:

reg FTFEMINISTS i.PARTY06, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
 FTFEMINISTS |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     PARTY06 |
Not stron..  |     -11.50       1.36    -8.46   0.000       -14.17       -8.84
Lean Demo..  |      -7.07       1.41    -5.02   0.000        -9.83       -4.31
Independent  |     -21.51       1.36   -15.78   0.000       -24.19      -18.84
Lean Repu..  |     -27.89       1.39   -20.06   0.000       -30.62      -25.17
Not stron..  |     -24.47       1.40   -17.53   0.000       -27.21      -21.73
Strong Re..  |     -33.18       1.27   -26.09   0.000       -35.67      -30.69
             |
       _cons |      73.08       0.84    86.57   0.000        71.43       74.74
------------------------------------------------------------------------------

For the categorical predictor above – and for all categorical predictors – one of the categories much be omitted, to be used as the reference category. This reference category is placed into the constant/intercept. In the linear regression above, the omitted category is “Strong Democrat”, and all other categories are interpreted relative to the omitted category. So the constant/intercept of 73.08 is the predicted level of FTFEMINISTS among the omitted category of strong Democrat. The -11.50 coefficient for not strong Democrat indicates that the predicted level of FTFEMINISTS among not strong Democrats is 11.50 units below the predicted level of FTFEMINISTS among the omitted category of strong Democrat (so that’s 73.08 minus 11.50, which is 61.58). The -7.07 coefficient for Lean Democrat indicates that the predicted level of FTFEMINISTS among participants who lean Democrat is 7.07 units below the predicted level of FTFEMINISTS among the omitted category of strong Democrat (so that’s 73.08 minus 7.07, which is 66.01).

Just like before, we can write an equation for the predictions:

FTEMINISTS = 73.08 
             + -11.50*(Not strong Democrat) 
             +  -7.07*(Lean Democrat) 
             + -21.51*(Independent)
             + -27.89*(Lean Republican) 
             + -24.47*(Not strong Republican) 
             + -33.18*(Strong Republican)

Important note: For a categorical predictor, the coefficient always refers to a comparison with the omitted category. The numeric coding of a predictor does not matter when the predictor is used in a regression as a categorical predictor. The PARTY06 variable is coded so that Strong Republican is coded 6, but the calculation of the coefficient for the “Strong Republican” category of PARTY06 does not use this 6. So, for example, to get the predicted level of FTFEMINISTS among Strong Republicans, we start with the constant/intercept of 73.08 and then add in the -33.18 coefficient for Strong Republican, to get a predicted FTFEMINISTS level of 39.90 among Strong Republicans. For a categorical predictor, do not multiply the -33.18 coefficient for Strong Republican by the numeric coding of 6 for Strong Republican. The -33.18 is essentially multiplied only by 1, because the -33.18 coefficient indicates the predicted change from the omitted category to the Strong Republican category.

Let’s illustrate that below, by plugging in 0 if the Strong Republican category is not used and 1 if the Strong Republican category is used.

FTEMINISTS = 73.08 
             + -11.50*(Not strong Democrat) 
             +  -7.07*(Lean Democrat) 
             + -21.51*(Independent)
             + -27.89*(Lean Republican) 
             + -24.47*(Not strong Republican) 
             + -33.18*(Strong Republican)
             
FTEMINISTS = 73.08 
             + -11.50*(0) 
             +  -7.07*(0) 
             + -21.51*(0)
             + -27.89*(0) 
             + -24.47*(0) 
             + -33.18*(1)      
             
FTEMINISTS = 39.90          

Let’s do another example, to get the predicted level of FTEMINISTS among a Lean Democrat respondent:

FTEMINISTS = 73.08 
             + -11.50*(Not strong Democrat) 
             +  -7.07*(Lean Democrat) 
             + -21.51*(Independent)
             + -27.89*(Lean Republican) 
             + -24.47*(Not strong Republican) 
             + -33.18*(Strong Republican)
             
FTEMINISTS = 73.08 
             + -11.50*(0) 
             +  -7.07*(1) 
             + -21.51*(0)
             + -27.89*(0) 
             + -24.47*(0) 
             + -33.18*(0)     
             
FTEMINISTS = 66.01      

Let’s conduct the same linear regression but exclude the “Independent” category. The constant/intercept of 51.57 is now the predicted outcome among the omitted category, of Independents.

reg FTFEMINISTS ib3.PARTY06, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
 FTFEMINISTS |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     PARTY06 |
Strong De..  |      21.51       1.36    15.78   0.000        18.84       24.19
Not stron..  |      10.01       1.51     6.63   0.000         7.05       12.97
Lean Demo..  |      14.45       1.55     9.29   0.000        11.40       17.49
Lean Repu..  |      -6.38       1.54    -4.15   0.000        -9.40       -3.36
Not stron..  |      -2.96       1.54    -1.92   0.055        -5.98        0.07
Strong Re..  |     -11.67       1.43    -8.15   0.000       -14.47       -8.86
             |
       _cons |      51.57       1.07    48.18   0.000        49.47       53.67
------------------------------------------------------------------------------

Just like before, we can write an equation for the predictions:

FTEMINISTS = 51.57 
             +  21.51*(Strong Democrat) 
             +  10.01*(Not strong Democrat) 
             +  14.45*(Lean Democrat) 
             +  -6.38*(Lean Republican) 
             +  -2.96*(Not strong Republican) 
             + -11.67*(Strong Republican)

So if we wanted to get a prediction for a participant who leans Democrat:

FTEMINISTS = 51.57 
             +  21.51*(Strong Democrat) 
             +  10.01*(Not strong Democrat) 
             +  14.45*(Lean Democrat) 
             +  -6.38*(Lean Republican) 
             +  -2.96*(Not strong Republican) 
             + -11.67*(Strong Republican)
               
FTEMINISTS = 51.57 
             +  21.51*(0)               
             +  10.01*(0)                   
             +  14.45*(1) 
             +  -6.38*(0)               
             +  -2.96*(0)                     
             + -11.67*(0)
               
FTEMINISTS = 51.57 
             + 14.45     
 
FTEMINISTS = 66.02

That 66.02 prediction for lean Democrat from the regression that omitted the Independent category is the same as the 66.01 prediction from the regression that omitted the strong Republican category, with the difference due only to rounding error.

Sample practice items

Let’s predict a participant’s ratings about Donald Trump, but let’s use a predictor for the participant’s race, coded 1 for White, 2 for Black, 3 for Asian, or 99 for Other race, with “Other race” as the omitted category:

reg FTTRUMP ib99.RACE, noheader
running D:\OneDrive - IL State University\2-teaching\R for teaching\POL497\prof

> ile.do ...


------------------------------------------------------------------------------
     FTTRUMP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        RACE |
      White  |      15.04       1.60     9.38   0.000        11.89       18.18
      Black  |     -11.43       2.35    -4.86   0.000       -16.04       -6.82
      Asian  |       5.38       3.49     1.54   0.123        -1.46       12.22
             |
       _cons |      32.09       1.46    21.96   0.000        29.23       34.96
------------------------------------------------------------------------------

What does the 32.38 coefficient estimate for the constant/intercept indicate?

  1. The mean rating about Donald Trump is predicted to be 32.38.
  2. The mean rating about Donald Trump is predicted to be 32.38 among “Other race” participants.
  3. The mean rating about Donald Trump is predicted to increase by 32.38 for a one-unit increase in participant race.
  4. The mean rating about Donald Trump is predicted to be 32.38 units higher for “Other race” participants than for residual participants.
Answer
  1. The mean rating about Donald Trump is predicted to be 32.38 among “Other race” participants.

What does the 14.75 coefficient estimate for RACEWhite indicate?

  1. The mean rating about Donald Trump is predicted to be 14.75 among White participants.
  2. The mean rating about Donald Trump is predicted to be 14.75 higher among White participants than among all other participants.
  3. The mean rating about Donald Trump is predicted to be 14.75 higher among White participants than among “Other race” participants.
Answer
  1. The mean rating about Donald Trump is predicted to be 14.75 higher among White participants than among “Other race” participants.

Of the following, which are justified interpretations of the p-value p=0.14 for the coefficient estimate for RACEAsian?

  1. The p-value indicates that there is insufficient evidence at the conventional level in political science that Asians rate Donald Trump any different on average than all other participants rate Donald Trump.
  2. The p-value indicates that there is insufficient evidence at the conventional level in political science that Asians rate Donald Trump any different on average than “Other race” participants rate Donald Trump.
Answer
  1. The p-value indicates that there is insufficient evidence at the conventional level in political science that Asians rate Donald Trump any different on average than “Other race” participants rate Donald Trump.

4.4 Note on terminology

The variable that an analysis attempts to explain variation in has many names, such as the outcome variable, the predicted variable, the dependent variable, or Y. The variables that are used to try to explain variation in Y have many names, such as the explanatory variables, the independent variables, or Xs. The terms “outcome variable” and “predictor variables” seem relatively intuitive, so I’ll try to use them. But one trick to remember the difference between a dependent variable and an independent variable is that the predicted value of a dependent variable depends on the values of the independent variables.