Including a continuous predictor with a separate predictor indicating missing values in the continuous predictor

I posted earlier about Filindra et al 2022 "Beyond Performance: Racial Prejudice and Whites' Mistrust of Government". This post discusses part of the code for Filindra et al 2022.

---

Tables in Filindra et al 2022 have a pair of variables called "conservatism (ideology)" and "conservatism not known" and a pair of variables called "income" and "income not known". For an example of what the "not known" variables are for, if a respondent in the 2016 data did not provide a substantive response to the ideology item, Filindra et al 2022 coded that respondent as 1 in the dichotomous 0-or-1 "conservatism not known" variable and imputed a value of zero for the seven-level "conservatism (ideology)" variable, with zero indicating "extremely liberal".

I don't recall seeing that method before, so I figured I would post about it. I reproduced the Filindra et al. 2022 Table 1 results for the 2016 data and then changed the imputed value for "conservatism (ideology)" from 0 (extremely liberal) to 1 (extremely conservative). That changed the coefficient and t-statistic for the "conservatism not known" predictor but not the coefficient or t-statistic for the "conservatism (ideology)" predictor or for any other predictor (log of the Stata output).

---

I think that it might have been from Schaffner et al 2018 that I picked up the use of categories as a way to not lose observations from an analysis merely because the observation has a missing value for a predictor. For example, if a respondent doesn't indicate their income, then income can be coded as a series of categories with non-response as a category (such as income $20,000 or lower; income $20,001 to $40,000; ...; income $200,001 and higher; and income missing). Thus, in a regression with this categorical predictor for income, observations are not lost merely because of not having a substantive value for income. Another nice feature of this categorical approach is permitting nonuniform associations, in which, for example, the association of income might level off at higher categories.

But dealing with missing values on a control by using categorical predictors can produce long regression output, with, for example, fifteen categories of income, eight categories of ideology, ten categories of age, etc. The Filindra et al 2022 method seems like a reasonable shortcut, as long as it's understood that results for the "not known" predictors depend on the choice of imputed value. But these "not known" predictors aren't common in the research that I read, so maybe there is another flaw in that method that I'm not aware of.

---

NOTE

1. I needed to edit line 1977 in the Filindra et al 2022 code to:

recode V162345 V162346 V162347 V162348 V162349 V162350 V162351 V162352 (-9/-5=.)

Including a continuous predictor with a separate predictor indicating missing values in the continuous predictor

Leave a Reply Cancel reply