Research involves a lot of decisions, which in turn provides a lot of opportunities for research to be incorrect or substandard, such as mistakes in recoding a variable, not using the proper statistical method, or not knowing unintuitive elements of statistical software such as how Stata treats missing values in logical expressions.

Peer and editorial review provides opportunities to catch flaws in research, but some journals that publish political science don't seem to be consistently doing a good enough job at this. Below, I'll provide a few examples that I happened upon recently and then discuss potential ways to help address this.

---

Feinberg et al 2022

PS: Political Science & Politics published Feinberg et al 2022 "The Trump Effect: How 2016 campaign rallies explain spikes in hate", which claims that:

Specifically, we established that the words of Donald Trump, as measured by the occurrence and location of his campaign rallies, significantly increased the level of hateful actions directed toward marginalized groups in the counties where his rallies were held.

After Feinberg et al published a similar claim in the Monkey Cage in 2019, I asked the lead author about the results when the predictor of hosting a Trump rally is replaced with a predictor of hosting a Hillary Clinton rally.

I didn't get a response from Ayal Feinberg, but Lilley and Wheaton 2019 reported that the point estimate for the effect on the count of hate-motivated events is larger for hosting a Hillary Clinton rally than for hosting a Donald Trump rally. Remarkably, the Feinberg et al 2022 PS article does not address the Lilley and Wheaton 2019 claim about Clinton rallies, even though the supplemental file for the Feinberg et al 2022 PS article discusses a different criticism from Lilley and Wheaton 2019.

The Clinton rally counterfactual is an obvious way to assess the claim that something about Trump increased hate events. Even if the reviewers and editors for PS didn't think to ask about the Clinton rally counterfactual, that counterfactual analysis appears in the Reason magazine criticism that Feinberg et al 2022 discusses in its supplemental files, so the analysis was presumably available to the reviewers and editors.

Will May has published a PubPeer comment discussing other flaws of the Feinberg et al 2022 PS article.

---

Christley 2021

The impossible "p < .000" appears eight times in Christley 2021 "Traditional gender attitudes, nativism, and support for the Radical Right", published in Politics & Gender.

Moreover, Christley 2021 indicates that (emphasis added):

It is also worth mentioning that in these data, respondent sex does not moderate the relationship between gender attitudes and radical right support. In the full model (Appendix B, Table B1), respondent sex is correlated with a higher likelihood of supporting the radical right. However, this finding disappears when respondent sex is interacted with the gender attitudes scale (Table B2). Although the average marginal effect of gender attitudes on support is 1.4 percentage points higher for men (7.3) than it is for women (5.9), there is no significant difference between the two (Figure 5).

Table B2 of Christley 2021 has 0.64 and 0.250 for the logit coefficient and standard error for the "Male*Gender Scale" interaction term, with no statistical significance asterisks; the 0.64 is the only table estimate without results reported to three decimal places, so it's not clear to me from the table if the asterisks are missing or is the estimate should be, say, 0.064 instead of 0.64. The sample size for the Table B2 regression is 19,587, so a statistically significant 1.4-percentage-point difference isn't obviously out of the question, from what I can tell.

---

Hua and Jamieson 2022

Politics, Groups, and Identities published Hua and Jamieson 2022 "Whose lives matter? Race, public opinion, and military conflict".

Participants were assigned to a control condition with no treatment, to a placebo condition with an article about baseball gloves, or to an article about a U.S. service member being killed in combat. The experimental manipulation was the name of the service member, intended to signal race: Connor Miller, Tyrone Washington, Javier Juarez, Duc Nguyen, and Misbah Ul-Haq.

Inferences from Hua and Jamieson 2022 include:

When faced with a decision about whether to escalate a conflict that would potentially risk even more US casualties, our findings suggest that participants are more supportive of escalation when the casualties are of Pakistani and African American soldiers than they are when the deaths are soldiers from other racial–ethnic groups.

But, from what I can tell, this inference of participants being "more supportive" depending on the race of the casualties is based on differences in statistical significance when each racial condition is compared to the control condition. Figure 5 indicates a large enough overlap between confidence intervals for the racial conditions for this escalation outcome to prevent a confident claim of "more supportive" when comparing racial conditions to each other.

Figure 5 seems to plot estimates from the first column in Table C.7. The largest racial gap in estimates is between the Duc Nguyen condition (0.196 estimate and 0.133 standard error) and the Tyrone Washington condition (0.348 estimate and 0.137 standard error). So this difference in means is 0.152, and I don't think that there is sufficient evidence to infer that these estimates differ from each other. 83.4% confidence intervals would be about [0.01, 0.38] and [0.15, 0.54].

---

Walker et al 2022

PS: Political Science & Politics published Walker et al 2022 "Choosing reviewers: Predictors of undergraduate manuscript evaluations", which, for the regression predicting reviewer ratings of manuscript originality, interpreted a statistically significant -0.288 OLS coefficient for "White" as indicating that "nonwhite reviewers gave significantly higher originality ratings than white reviewers". But the table note indicates that the "originality" outcome variable is coded 1 for yes, 2 for maybe, and 3 for no, so that the "higher" originality ratings actually indicate lower ratings of originality.

Moreover, Walker et al 2022 claims that:

There is no empirical linkage between reviewers' year in school and major and their assessment of originality.

But Table 2 indicates p<0.01 evidence that reviewer major associates with assessments of originality.

And the "a", "b", and "c" notes for Table 2 are incorrectly matched to the descriptions; for example, the "b" note about the coding of the originality outcome is attached to the other outcome.

The "higher originality ratings" error has been corrected, but not the other errors. I mentioned only the "higher" error in this tweet, so maybe that explains that. It'll be interesting to see if PS issues anything like a corrigendum about "Trump rally / hate" Feinberg et al 2022, given that the flaw in Feinberg et al 2022 seems a lot more important.

---

Fattore et al 2022

Social Science Quarterly published Fattore et al 2022 "'Post-election stress disorder?' Examining the increased stress of sexual harassment survivors after the 2016 election". For a sample of women participants, the analysis uses reported experience being sexually harassed to predict a dichotomous measure of stress due to the 2016 election, net of controls.

Fattore et al 2022 Table 1 reports the standard deviation for a presumably multilevel categorical race variable that ranges from 0 to 4 and for a presumably multilevel categorical marital status variable that ranges from 0 to 2. Fattore et al 2022 elsewhere indicates that the race variable was coded 0 for white and 1 for minority, but indicates that the marital status variable is coded 0 for single, 1 for married/coupled, and 2 for separated/divorced/widowed, so I'm not sure how to interpret regression results for the marital status predictor.

And Fattore et al 2022 has this passage:

With 95 percent confidence, the sample mean for women who experienced sexual harassment is between 0.554 and 0.559, based on 228 observations. Since the dependent variable is dichotomous, the probability of a survivor experiencing increased stress symptoms in the post-election period is almost certain.

I'm not sure how to interpret that passage: Is the 95% confidence interval that thin (0.554, 0.559) based on 228 observations? Is the mean estimate of about 0.554 to 0.559 being interpreted as almost certain? Here is the paragraph that that passage is from.

---

Hansen and Dolan 2022

Political Behavior published Hansen and Dolan 2022 "Cross‑pressures on political attitudes: Gender, party, and the #MeToo movement in the United States".

Table 1 of Hansen and Dolan 2022 reported results from a regression limited to 694 Republican respondents in a 2018 ANES survey, which indicated that the predicted feeling thermometer rating about the #MeToo movement was 5.44 units higher among women than among men, net of controls, with a corresponding standard error of 2.31 and a statistical significance asterisk. However, Hansen and Dolan 2022 interpreted this to not provide sufficient evidence of a gender gap:

In 2018, we see evidence that women Democrats are more supportive of #MeToo than their male co-partisans. However, there was no significant gender gap among Republicans, which could signal that both women and men Republican identifiers were moved to stand with their party on this issue in the aftermath of the Kavanaugh hearings.

Hansen and Dolan 2022 indicated that this inference of no significant gender gap is because, in Figure 1, the relevant 95% confidence interval for Republican men overlapped with the corresponding 95% confidence interval for Republican women.

Footnote 9 of Hansen and Dolan 2022 noted that assessing statistical significance using overlap of 95% confidence intervals is a "more rigorous standard" than using a p-value threshold of p=0.05 in a regression model. But Footnote 9 also claimed that "Research suggests that using non-overlapping 95% confidence intervals is equivalent to using a p < .06 standard in the regression model (Schenker & Gentleman, 2001)", and I don't think that this "p < .06" claim is correct or at least not misleading.

My Stata analysis of the data for Hansen and Dolan 2022 indicated that the p-value for the gender gap among Republicans on this item is p=0.019, which is about what would be expected given data in Table 1 of a t-statistic of 5.44/2.31 and more than 600 degrees of freedom. From what I can tell, the key evidence from Schenker and Gentleman 2001 is Figure 3, which indicates that the probability of a Type 1 error using the overlap method is about equivalent to p=0.06 only when the ratio of the two standard errors is about 20 or higher.

This discrepancy in inferences might have been avoided if 83.4% confidence intervals were more commonly taught and recommended by editors and reviewers, for visualizations in which the key comparison is between two estimates.

---

Footnote 10 of Hansen and Dolan 2022 states:

While Fig. 1 appears to show that Republicans have become more positive towards #MeToo in 2020 when compared to 2018, the confidence bounds overlap when comparing the 2 years.

I'm not sure what that refers to. Figure 1 of Hansen and Dolan 2022 reports estimates for Republican men in 2018, Republican women in 2018, Republican men in 2020, and Republican women in 2020, with point estimates increasing in that order. Neither 95% confidence interval for Republicans in 2020 overlaps with either 95% confidence interval for Republicans in 2018.

---

Other potential errors in Hansen and Dolan 2022:

[1] The code for the 2020 analysis uses V200010a, which is a weight variable for the pre-election survey, even though the key outcome variable (V202183) was on the post-election survey.

[2] Appendix B Table 3 indicates that 47.3% of the 2018 sample was Republican and 35.3% was Democrat, but the sample sizes for the 2018 analysis in Table 1 are 694 for the Republican only analysis and 1001 for the Democrat only analysis.

[3] Hansen and Dolan 2022 refers multiple times to predictions of feeling thermometer ratings as predicted probabilities, and notes for Tables 1 and 2 indicate that the statistical significance asterisk is for "statistical significance at p > 0.05".

---

Conclusion

I sometimes make mistakes, such as misspelling an author's name in a prior post. In 2017, I preregistered an analysis that used overlap of 95% confidence intervals to assess evidence for the difference between estimates, instead of a preferable direct test for a difference. So some of the flaws discussed above are understandable. But I'm not sure why all of these flaws got past review at respectable journals.

Some of the flaws discussed above are, I think, substantial, such as the political bias in Feinberg et al 2022 not reporting a parallel analysis for Hillary Clinton rallies, especially with the Trump rally result being prominent enough to get a fact check from PolitiFact in 2019. Some of the flaws discussed above are trivial, such as "p < .000". But even trivial flaws might justifiably be interpreted as reflecting a review process that is less rigorous than it should be.

---

I think that peer review is valuable at least for its potential to correct errors in analyses and to get researchers to report results that they otherwise wouldn't report, such as a robustness check suggested by a reviewer that undercuts the manuscript's claims. But peer review as currently practiced doesn't seem to do that well enough.

Part of the problem might be that peer review at a lot of political science journals combines [1] assessment of the contribution of the manuscript and [2] assessment of the quality of the analyses, often for manuscripts that are likely to be rejected. Some journals might benefit from having a (or having another) "final boss" who carefully reads conditionally accepted manuscripts only for assessment [2], to catch minor "p < .000" types of flaws, to catch more important "no Clinton rally analysis" types of flaws, and to suggest robustness checks and additional analyses.

But even better might be opening peer review to volunteers, who collectively could plausibly do a better job than a final boss could do alone. I discussed the peer review volunteer idea in this symposium entry. The idea isn't original to me; for example, Meta-Psychology offers open peer review. The modal number of peer review volunteers for a publication might be zero, but there is a good chance that I would have raised the "no Clinton rally analysis" criticism had PS posted a conditionally accepted version of Feinberg et al 2022.

---

Another potentially good idea would be for journals or an organization such as APSA to post at least a small set of generally useful advice, such as reporting results for a test for differences between estimates if the manuscript suggests a difference between estimates. More specific advice could be posted by topic, such as, for count analyses, advice about predicting counts in which the opportunity varies by observation: Lilley and Wheaton 2019 discussed this page, but I think that this page has an explanation that is easier to understand.

---

NOTES

1. It might be debatable whether this is a flaw per se, but Long 2022 "White identity, Donald Trump, and the mobilization of extremism" reported correlational results from a survey experiment but, from what I can tell, didn't indicate whether any outcomes differed by treatment.

2. Data for Hansen and Dolan 2022. Stata code for my analysis:

desc V200010a V202183

svyset [pw=weight]

svy: reg metoo education age Gender race income ideology2 interest media if partyid2=="Republican"

svy: mean metoo if partyid2=="Republican" & women==1

3. The journal Psychological Science is now publishing peer reviews. Peer reviews are also available for the journal Meta-Psychology.

4. Regarding the prior post about Lacina 2022 "Nearly all NFL head coaches are White. What are the odds?", Bethany Lacina discussed that with me on Twitter. I have published an update at that post.

5. I emailed or tweeted to at least some authors of the aforementioned publications discussing the planned comments or indicating at least some of the criticism. I received some feedback from one of the authors, but the author didn't indicate that I had permission to acknowledge the author.

Tagged with: , , , , ,

I posted earlier about Jardina and Piston 2021 "The Effects of Dehumanizing Attitudes about Black People on Whites' Voting Decisions".

Jardina and Piston 2021 limited the analysis to White respondents, even though the Qualtrics_BJPS dataset at the Dataverse page for Jardina and Piston 2021 contained observations for non-White respondents. The Qualtrics_BJPS dataset had variables such as aofmanpic_1 and aofmanpic_6, and I didn't know which of these variables corresponded to which target groups.

My post indicated a plan to follow up if I got sufficient data to analyze responses from non-White participants. Replication code has now been posted at version 2 of the Dataverse page for Jardina and Piston 2021, so this is that planned post.

---

Version 2 of the Jardina and Piston 2021 Dataverse page has a Qualtrics dataset (Qualtrics_2016_BJPS_raw) that differs from the version 1 Qualtrics dataset (Qualtrics_BJPS): for example, the version 2 Qualtrics dataset doesn't contain data for non-White respondents, doesn't contain respondent ID variables V1 and uid, and doesn't contain variables such as aofmanpic_2.

I ran the Jardina and Piston 2021 "aofman" replication code on the Qualtrics_BJPS dataset to get a variable named "aofmanwb". In the version 2 dataset, this produced the output for the Trump analysis in Table 1 of Jardina and Piston 2021, so this aofmanwb variable is the "Ascent of man" dehumanization measure, coded so that rating Blacks as equally evolved as Whites is 0.5, rating Whites as more evolved than Blacks runs from just above 0.5 to 1, and rating Blacks more evolved than Whites runs from just under 0.5 down to zero.

The version 2 replication code for Jardina and Piston 2021 suggests that aofmanpic_1 is for rating how evolved Blacks are and aofmanpic_4 is for rating how evolved Whites are. So unless these variable names were changed between versions of the dataset, the version 2 replication code should produce the "Ascent of man" dehumanization measure when applied to the version 1 dataset, which is still available at the Jardina and Piston 2021 Dataverse page.

To check, I ran commands such as "reg aofmanwb ib4.ideology if race==1 & latino==2" in both datasets, and got similar but not exact results, with the difference presumably due to the differences between datasets discussed in the notes below.

---

The version 1 Qualtrics dataset didn't contain a variable that I thought was a weight variable, so my analyses below are unweighted.

In the version 1 dataset, the medians of aofmanwb were 0.50 among non-Latino Whites in the sample (N=450), 0.50 among non-Latino Blacks in the sample (N=98), and 0.50 among respondents coded Asian, Native American, or Other (N=125). Respective means were 0.53, 0.48, and 0.51.

Figure 1 of Jardina and Piston 2021 mentions the use of sliders to select responses to the items about how evolved target groups are, and I think that some unequal ratings might be due to respondent imprecision instead of an intent to dehumanize, such as if a respondent intended to select 85 for each group in a pair, but moved the slider to 85 for one group and 84 for the other group, and then figured that this was close enough. So I'll report percentages below with a strict dehumanization definition of anything differing from 0.5 on the 0-to-1 scale as dehumanization, but I'll also report percentages with a tolerance for potential unintentional dehumanization.

---

For the strict coding of dehumanization, I recoded aofmanwb into a variable that had levels for [1] rating Blacks as more evolved than Whites, [2] equal ratings of how evolved Blacks and Whites are, and [3] rating Whites as more evolved than Blacks.

In the version 1 dataset, 13% of non-Latino Whites in the sample rated Blacks more evolved than Whites, with an 83.4% confidence interval of [11%, 16%], and 39% rated Whites more evolved than Blacks [36%, 43%]. 42% of non-Latino Blacks in the sample rated Blacks more evolved than Whites [35%, 49%], and 23% rated Whites more evolved than Blacks [18%, 30%]. 19% of respondents not coded Black or White in the sample rated Blacks more evolved than Whites [15%, 25%], and 38% rated Whites more evolved than Blacks [32%, 45%].

---

For the non-strict coding of dehumanization, I recoded aofmanwb into a variable that had levels that included [1] rating Blacks at least 3 units more evolved than Whites on a 0-to-100 scale, and [5] rating Whites at least 3 units more evolved than Blacks on a 0-to-100 scale.

In the version 1 dataset, 8% of non-Latino Whites in the sample rated Blacks more evolved than Whites [7%, 10%], and 30% rated Whites more evolved than Blacks [27%, 34%]. 34% of non-Latino Blacks in the sample rated Blacks more evolved than Whites [27%, 41%], and 21% rated Whites more evolved than Blacks [16%, 28%]. 13% of respondents not coded Black or White in the sample rated Blacks more evolved than Whites [9%, 18%], and 31% rated Whites more evolved than Blacks [26%, 37%].

---

NOTES

1. Variable labels in the Qualtrics dataset ("male" coded 0 for "Male" and 1 for "Female") and associated replication commands suggest that Jardina and Piston 2021 might have reported results for a "Female" variable coded 1 for male and 0 for female, which would explain why Table 1 Model 1 of Jardina and Piston 2021 indicates that females were predicted to have higher ratings about Trump net of controls at p<0.01 compared to males, even though the statistically significant coefficients for "Female" in the analyses from other datasets in Jardina and Piston 2021 are negative when predicting positive outcomes for Trump.

The "Female" variable in Jardina and Piston 2021 Table 1 Model 1 is right above the statistically significant coefficient and standard error for age, of "0.00" and "0.00". The table note indicates that "All variables are transformed onto a 0 to 1 scale.", but that isn't correct for the age predictor, which ranges from 19 to 86.

2. I produced a plot like Jardina and Piston 2021 Figure 3, but with a range from most dehumanization of Whites relative to Blacks to most dehumanization of Blacks relative to Whites. The 95% confidence interval for Trump ratings at most dehumanization of Whites relative to Blacks did not overlap with the 95% confidence interval for Trump ratings at no / equal dehumanization of Whites and Blacks. But, as indicated in my later analyses, that might merely be due to the Jardina and Piston 2021 use of aofmanwb as a continuous predictor: the aforementioned inference wasn't supported using 83.4% confidence intervals when the aofmanwb predictor was trichotomized as described above.

3. Regarding differences between Qualtrics datasets posted to the Jardina and Piston 2021 Dataverse page, the Stata command "tab race latino, mi" returns 980 respondents who selected "White" for the race item and "No" for the Latino item in the version 1 Qualtrics dataset, but returns 992 respondents who selected "White" for the race item and "No" for the Latino item in the version 2 Qualtrics dataset.

Both version 1 and version 2 of the Qualtrics datasets contain exactly one observation with a 1949 birth year and a state of Missouri. In both datasets, this observation has codes that indicate a White non-Latino neither-liberal-nor-conservative male Democrat with some college but no degree who has an income of $35,000 to $39,999. That observation has values of 100 for aofmanvinc_1 and 100 for aofmanvinc_4 in the version 2 Qualtrics dataset, but, in the version 1 Qualtrics dataset, that observation has no numeric values for aofmanvinc_1, aofmanvinc_4, or any other variable starting with "aofman".

I haven't yet received an explanation about this from Jardina and/or Piston.

4. Below is a description of more checking about whether aofmanwb is correctly interpreted above, given that the Dataverse page for Jardina and Piston 2021 doesn't have a codebook.

I dropped all cases in the original dataset not coded race==1 and latino==2. Case 7 in the version 2 dataset is from New York, born in 1979, has an aofmanpic_1 of 84 , and an aofmanpic_4 of 92; this matches Case 7 in the version 1 dataset when dropping aforementioned cases. Case 21 in the version 1 dataset is from South Carolina, born in 1966, has an aofmanvinc_1 of 79, and an aofmanvinc_4 of 75; this matches Case 21 in the version 2 dataset when dropping aforementioned cases. Case 951 in the version 1 dataset is from Georgia, born in 1992, has an aofmannopi_1 of 77, and an aofmannopi_4 of 65; this matches case *964* in the version 2 dataset when dropping aforementioned cases.

5. From what I can tell, for anyone interested in analyzing the data, thermind_2 in the version 2 dataset is the feeling thermometer about Donald Trump, and thermind_4 is the feeling thermometer about Barack Obama.

6. Stata code and output from my analysis.

Tagged with: ,

Political Behavior recently published Filindra et al 2022 "Beyond Performance: Racial Prejudice and Whites' Mistrust of Government". Hypothesis 1 is the expectation that "...racial prejudice (anti-Black stereotypes) is a negative and significant predictor of trust in government".

Filindra et al 2022 limits the analysis to White respondents and measures anti-Black stereotypes by combining responses to available items in which respondents rate Blacks on seven-point scales, ranging from hardworking to lazy, and/or from peaceful to violent, and/or from intelligent to unintelligent. The data include items about how respondents rate Whites on these scales, but Filindra et al 2022 didn't use these responses to measure anti-Black stereotyping.

But information about how respondents rate Whites is useful for measuring anti-Black stereotyping. For example, a respondent who rates all racial groups at the midpoint of a stereotype scale hasn't indicated an anti-Black stereotype; this respondent's rating about Blacks doesn't differ from the respondent's rating about other racial groups, and it's not clear to me why rating Blacks equal to all other racial groups would be a moderate amount of "prejudice" in this case.

But this respondent who rated all racial groups equally on the stereotype scales nonetheless falls halfway along the Filindra et al 2022 measure of "negative Black stereotypes", in the same location as a respondent who rated Blacks at the midpoint of the scale and rated all other racial groups at the most positive end of the scale.

---

I think that this flawed measurement means that more analyses need to be conducted to know whether the key Filindra et al 2022 finding is merely due to the flawed measure of racial prejudice. Moreover, I think that more analyses need to be conducted to know whether Filindra et al 2022 overlooked evidence of the effect of prejudice against other racial groups.

Filindra et al 2022 didn't indicate whether their results held when using a measure of anti-Black stereotypes that placed respondents who rated all racial groups equally into a different category than respondents who rated Blacks less positively than all other racial groups and a different category than respondents who rated Blacks more positively than all other racial groups. Filindra et al 2022 didn't even report results when their measure of anti-White stereotypes was included in the regressions estimating the effect of anti-Black stereotypes.

A better review process might have produced a Filindra et al 2022 that resolved questions such as: Is the key Filindra et al 2022 finding merely because respondents who don't trust the government rate *all* groups relatively low on stereotype scales? Is the key finding because anti-Black stereotypes and anti-White stereotypes and anti-Hispanic stereotypes and anti-Asian stereotypes *each* reduce trust in government? Or are anti-Black stereotypes the *only* racial stereotypes that reduce trust in government?

Even if anti-Black stereotypes among Whites is the most important combination of racial prejudice and respondent demographics, other combinations of racial stereotype and respondent demographics are important enough to report on and can help readers better understand racial attitudes and their consequences.

---

NOTES

1. Filindra et al 2022 did note that:

Finally, another important consideration is the possibility that other outgroup attitudes or outgroup-related policy preferences may also have an effect on public trust.

That's sort of close to addressing some of the alternate explanations that I suggested, but the Filindra et al 2022 measure for this is a measure about immigration *policy* and not, say, the measures of stereotypes about Hispanics and about Asians that are included in the data.

2. Filindra et al 2022 suggested that:

Future research should focus on the role of attitudes towards immigrants and other racial groups—such as Latinos— and ethnocentrism more broadly in shaping white attitudes toward government.

But it's not clear to me why such analyses aren't included in Filindra et al 2022.

Maybe the expectation is that another publication should report results that include the measures of anti-Hispanic stereotypes and anti-Asian stereotypes in the ANES data. And another publication should report results that include the measures of anti-White stereotypes in the ANES data. And another publication should report results that include or focus on respondents in the ANES data who aren't White. But including all this in Filindra et al 2022 or its supplemental information would be more efficient and could produce a better understanding of political attitudes.

3. Filindra et al 2022 indicated that:

All variables in the models are rescaled on 0–1 scales consistent with the nature of the original variable. This allows us to conceptualize the coefficients as maximum effects and consequently compare the size of coefficients across models.

Scaling all predictors to range from 0 to 1 means that comparison of coefficients likely produces better inferences than if the predictors were on different scales, but differences in 0-to-1 coefficients can also be due to differences in the quality of the measurement of the underlying concept, as discussed in this prior post.

4. Filindra et al 2022 justified not using a differenced stereotype measure, citing evidence such as (from footnote 2):

Factor analysis of the Black and white stereotype items in the ANES confirms that they do not fall on a single dimension.

The reported factor analysis was on ANES 2020 data and included a measure of "lazy" stereotypes about Blacks, a measure of "violent" stereotypes about Blacks, a feeling thermometer about Blacks, a measure of "lazy" stereotypes about Whites, a measure of "violent" stereotypes about Whites, and a feeling thermometer about Whites.[*] But a "differenced" stereotype measure shouldn't be constructed by combining measures like that, as if the measure of "lazy" stereotypes about Blacks is independent of the measure of "lazy" stereotypes about Whites.

A "differenced" stereotype measure could be constructed by, for example, subtracting the "lazy" rating about Whites from the "lazy" rating about Blacks, subtracting the "violent" rating about Whites from the "violent" rating about Blacks, and then summing these two differences. That measure could help address the alternate explanation that the estimated effect for rating Blacks low is because respondents who rate Blacks low also rate all other groups low. That measure could also help address the concern that using only a measure of stereotypes about Blacks underestimates the effect of these stereotypes.

Another potential coding is a categorical measure, coded 1 for rating Blacks lower than Whites on all stereotype measures, 2 for rating Blacks equal to Whites on all stereotype measures, coded 3 for rating Blacks higher than Whites on all stereotype measures, and coded 4 for a residual category. The effect of anti-Black stereotypes could be estimated as the difference net of controls between category 1 and category 2.

Filindra et al 2022 provided justifications other than the factor analysis for not using a differenced stereotype measure, but, even if you agree that stereotype scale ratings about Blacks should not be combined with stereotype scale ratings about Whites, the Filindra et al 2022 arguments don't preclude including their measure of anti-White prejudice as a separate predictor in the analyses.

[*] I'm not sure why the feeling thermometer responses were included in a factor analysis intended to justify not combining stereotype scale responses.

5. I think that labels for the panels of Filindra et al 2022 Figure 1 and the corresponding discussion in the text are backwards: the label for each plot in Figure 1a appears to be "Negative Black Stereotypes", but the Figure 1a label is "Public Trust"; the label for each plot in Figure 1b appears to be "Level of Trust in Govt", but the Figure 1b label is "Anti-Black stereotypes".

My histogram of the Filindra et al 2022 measure of anti-Black stereotypes for the ANES 2020 Time Series Study looks like their 2020 plot in Figure 1a.

6. I'm not sure what the second sentence is supposed to mean, from this part of the Filindra et al 2022 conclusion:

Our results suggest that white Americans' beliefs about the trustworthiness of the federal government have become linked with their racial attitudes. The study shows that even when racial policy preferences are weakly linked to trust in government racial prejudice does not. Analyses of eight surveys...

7. Data source for my analysis: American National Election Studies. 2021. ANES 2020 Time Series Study Full Release [dataset and documentation]. July 19, 2021 version. www.electionstudies.org.

Tagged with: , , , , ,

I posted to OSF data, code, and a report for my unpublished "Public perceptions of human evolution as explanations for racial group differences" [sic] project that was from a survey that YouGov ran for me in 2017, using funds from Illinois State University New Faculty Start-up Support and the Illinois State University College of Arts and Sciences. The report describes results from preregistered analyses, but below I'll highlight key results.

---

The key item asked participants whether God's design and/or evolution, or neither, helped cause a particular racial difference:

Some racial groups have [...] compared to other racial groups. Select ALL of the reasons below that you think help cause this difference:
□ Differences in how God designed these racial groups
□ Genetic differences that evolved between these racial groups
○ None of the above

Participants were randomly assigned to receive one racial difference in the part of the item marked [...] above. Below are the racial differences asked about, along with the percentage assigned to that item who selected only the "evolved" response option:

70% a greater risk for certain diseases
55% darker skin on average
54% more Olympic-level runners
49% different skull shapes on average
26% higher violent crime rates on average
24% higher math test scores on average
21% lower math test scores on average
18% lower violent crime rates on average

---

Another item on the survey (discussed at this post) asked about evolution. The reports that I posted for these items removed all or a lot of the discussion and citation of literature from the manuscripts that I had submitted to journals but were rejected, in case I can use that material for a later manuscript.

Tagged with: , , , ,

I posted to OSF data, code, and a report for my unpublished "Public Perceptions of the Potential for Human Evolution" project that was from a survey that YouGov ran for me in 2017, using funds from Illinois State University New Faculty Start-up Support and the Illinois State University College of Arts and Sciences. The report describes results from preregistered analyses, but below I'll highlight key results.

---

"Textbook" evolution

About half of participants received an item that asked about what I think might be reasonably described as a textbook description of evolution, in which one group is more reproductively successful than another group. The experimental manipulations involved whether the more successful group had high intelligence or low intelligence and whether the response options mentioned or did not mention "evolved".

Here is the "high intelligence" item, with square brackets indicating the "evolved" manipulation:

If, in the future, over thousands of years, people with high intelligence have more children and grandchildren than people with low intelligence have, which of the following would be most likely to happen?

  • The average intelligence of humans would [increase/evolve to be higher].
  • The average intelligence of humans would [remain the same/not evolve to be higher or lower].
  • The average intelligence of humans would [decrease/evolve to be lower].

Percentages from analyses weighted to reflect U.S. population percentages were 55% for the "increase" option (N=245) and 49% for the "evolve to be higher" option (N=260), with the residual category including other responses and non-responses. So about half of participants selected what I think is the intuitive response.

Here is the "low intelligence" item:

If, in the future, over thousands of years, people with low intelligence have more children and grandchildren than people with high intelligence have, which of the following would be most likely to happen?

  • The average intelligence of humans would [increase/evolve to be higher].
  • The average intelligence of humans would [remain the same/not evolve to be higher or lower].
  • The average intelligence of humans would [decrease/evolve to be lower].

Percentages from analyses weighted to reflect U.S. population percentages were 41% for the "decrease" option (N=244) and 35% for the "evolve to be lower" option (N=244), with the residual category including other responses and non-responses.

So, compared to the "high intelligence" item, participants were less likely (p<0.05) to select what I think is the intuitive response for the "low intelligence" item.

---

Evolution due to separation into different environments

Participants not assigned to the aforementioned item received an item about whether the participant would expect differences to arise between groups separated into different environments, but the item did not include an indication of particular differences in the environments. The experimental manipulations were whether the item asked about intelligence or height and whether the response options mentioned or did not mention "evolved".

Here is the intelligence item, with square brackets indicating the "evolved" manipulation:

Imagine two groups of people. Each group has some people with high intelligence and some people with low intelligence, but the two groups have the same average intelligence as each other. If these two groups were separated from each other into different environments for tens of thousands of years and had no contact with any other people, which of the following would be more likely to happen?

  • After tens of thousands of years, the two groups would still have the same average intelligence as each other.
  • After tens of thousands of years, one group would [be/have evolved to be] more intelligent on average than the other group.

Percentages from analyses weighted to reflect U.S. population percentages were 32% for the "be more intelligent" option (N=260) and 29% for the "evolved to be more intelligent" option (N=236), with the residual category including other responses and non-responses.

Here is the height item:

Imagine two groups of people. Each group has some short people and some tall people, but the two groups have the same average height as each other. If these two groups were separated from each other into different environments for tens of thousands of years and had no contact with any other people, which of the following would be more likely to happen?

  • After tens of thousands of years, the two groups would still have the same average height as each other.
  • After tens of thousands of years, one group would [be/have evolved to be] taller on average than the other group.

Percentages from analyses weighted to reflect U.S. population percentages were 32% for the "be taller" option (N=240) and 32% for the "evolved to be taller" option (N=271), with the residual category including other responses and non-responses.

So not much variation in these percentages between the intelligence version of the item and the height version of the item. And only about 1/3 of participants indicated an expectation of intelligence or height differences arising between groups separated from each other into different environments for tens of thousands of years.

---

Another item on the survey (eventually discussed at this post) asked about evolution and racial differences. The reports that I posted for these items removed all or a lot of the discussion and citation of literature from the manuscripts that I had submitted to journals but were rejected, in case I can use that material for a later manuscript.

Tagged with: , , ,

The Journal of Race, Ethnicity, and Politics published Buyuker et al 2020: "Race politics research and the American presidency: thinking about white attitudes, identities and vote choice in the Trump era and beyond".

Table 2 of Buyuker et al 2020 reported regressions predicting Whites' projected and recalled vote for Donald Trump over Hillary Clinton in the 2016 U.S. presidential election, using predictors such as White identity, racial resentment, xenophobia, and sexism. Xenophobia placed into the top tier of predictors, with an estimated maximum effect of 88 percentage points going from the lowest to the highest value of the predictor, and racial resentment placed into the second tier, with an estimated maximum effect of 58 percentage points.

I was interested in whether this difference is at least partly due to how well each predictor was measured. Here are characteristics of the predictors among Whites, which indicate that xenophobia was measured at a much more granular level than racial resentment was:

RACIAL RESENTMENT
4 items
White participants fell into 22 unique levels
4% of Whites at the lowest level of racial resentment
9% of Whites at the highest level of racial resentment

XENOPHOBIA
10 items
White participants fell into 1,096 unique levels
1% of Whites at the lowest level of xenophobia
1% of Whites at the highest level of xenophobia

So it's at least plausible from the above results that xenophobia might have outperformed racial resentment merely because the measurement of xenophobia was better than the measurement of racial resentment.

---

Racial resentment was measured with four items that each had five response options, so I created a reduced xenophobia predictor using the four xenophobia items that each had exactly five response options; these items were about desired immigration levels and agreement or disagreement with statements that "Immigrants are generally good for America's economy", "America's culture is generally harmed by immigrants", and "Immigrants increase crime rates in the United States".

I re-estimated the Buyuker et al 2020 Table 2 model replacing the original xenophobia predictor with the reduced xenophobia predictor: the maximum effect for xenophobia (66 percentage points) was similar to the maximum effect for racial resentment (66 percentage points).

---

Among Whites, vote choice correlated between r=0.50 and r=0.58 with each of the four racial resentment items and between r=0.39 and r=0.56 with nine of the ten xenophobia items. The exception was the seven-point item that measured attitudes about building a wall on the U.S. border with Mexico, which correlated with vote choice at r=0.72.

Replacing the desired immigration levels item in the reduced xenophobia predictor with the border wall item produced a larger estimated maximum effect for xenophobia (85 percentage points) than for racial resentment (60 percentage points). Removing all predictors from the model except for xenophobia and racial resentment, the reduced xenophobia predictor with the border wall item still produced a larger estimated maximum effect than did racial resentment: 90 percentage points, compared to 74 percentage points.

But the larger effect for xenophobia is not completely attributable to the border wall item: using a predictor that combined the other nine xenophobia items produced a maximum effect for xenophobia (80 percentage points) that was larger than the maximum effect for racial resentment (63 percentage points).

---

I think that the main takeaway from this post is that, when comparing the estimated effect of predictors, inferences can depend on how well each predictor is measured, so such analyses should discuss the quality of the predictors. Imbalances in which participants fall into 22 levels for one predictor and 1,096 levels for another predictor seem to be biased in favor of the more granular predictor, all else equal.

Moreover, I think that, for predicting 2016 U.S. presidential vote choice, it's at least debatable whether a xenophobia predictor should include an item about a border wall with Mexico, because including that item means that, instead of xenophobia measuring attitudes about immigrants per se, the xenophobia predictor conflates these attitudes with attitudes about a policy proposal that is very closely connected with Donald Trump.

---

It's not ideal to use regression to predict maximum effects, so I estimated a model using only the racial resentment predictor and the reduced four-item xenophobia predictor with the border wall item, but including a predictor for each level of the predictors. That model predicted failure perfectly for some levels of the predictors, so I recoded the predictors until those errors were eliminated, which involved combining the three lowest racial resentment levels (so that racial resentment ran from 2 through 16) and combining the 21st and 22nd levels of the xenophobia predictor (so that xenophobia ran from 0 through 23). In a model with only those two recoded predictors, the estimated maximum effects were 81 percentage points for xenophobia and 76 percentage points for racial resentment. Using all Buyuker et al 2020 predictors, the respective percentage points were 65 and 63.

---

I then predicted Trump/Clinton vote choice using only the 22-level racial resentment predictor and the full 1,096-level xenophobia predictor, but placing the values of the predictors into ten levels; the original scale for the predictors ran from 0 through 1, and, for the 10-level predictors, the first level for each predictor was from 0 to 0.1, a second level was from above 0.1 to 0.2, and a tenth level was from above 0.9 to 1. Using these predictors as regular predictors without "factor" notation, the gap in maximum effects was about 24 percentage points, favoring xenophobia. But using these predictors with "factor" notation, the gap favoring xenophobia fell to about 9.5 percentage points.

Plots below illustrate the difference in predictions for xenophobia: the left panel uses a regular 10-level xenophobia predictor, and the right panel uses each of the 10 levels of that predictor as a separate predictor.

---

So I'm not sure that these data support the inference that xenophobia is in a higher tier than racial resentment, for predicting Trump/Clinton vote in 2016. The above analyses seem to suggest that much or all of the advantage for xenophobia over racial resentment in the Buyuker et al 2020 analyses was due to model assumptions and/or better measurement of xenophobia.

---

Another concern about Buyuker et al 2020 is with the measurement of predictors such as xenophobia. The xenophobia predictor is more accurately described as something such as attitudes about immigrants. If some participants are more favorable toward immigrants than toward natives, and if these participants locate themselves at low levels of the xenophobia predictor, then the effect of xenophilia among these participants is possibly being added to the effect of xenophobia.

Concerns are similar for predictors such as racial resentment and sexism. See here and here for evidence that low levels of similar predictors associate with bias in the opposite direction.

---

NOTES

1. Thanks to Beyza Buyuker for sending me replication materials for Buyuker et al 2020.

2. Stata code for my analyses. Stata output for my analyses.

3. ANES 2016 citations:

The American National Election Studies (ANES). 2016. ANES 2012 Time Series Study. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2016-05-17. https://doi.org/10.3886/ICPSR35157.v1.

ANES. 2017. "User's Guide and Codebook for the ANES 2016 Time Series Study". Ann Arbor, MI, and Palo Alto, CA: The University of Michigan and Stanford University.

Tagged with: , , , ,

Participants in studies reported on in Regina Bateson's 2020 Perspectives on Politics article "Strategic Discrimination" were asked to indicate the percentage of other Americans that the participant thought would not vote for a woman for president and the percentage of other Americans that the participant thought would not vote for a black person for president.

Bateson 2020 Figure 1 reports that, in the nationally representative Study 1 sample, mean participant estimates were that 47% of other Americans would not vote for a woman for president and that 42% of other Americans would not vote for a black person for president. I was interested in the distribution of responses, so I plotted in the histograms below participant estimates to these items, using the Bateson 2020 data for Study 1.

This first set of histograms is for all participants:

This second set of histograms is for only participants who passed the attention check:

---

I was also interested in estimates from participsnts with a graduate degree, given that so many people in political science have a graduate degree. Bateson 2020 Appendix Table 1.33 indicates that, among participants with a graduate degree, estimates were that 58.3% of other Americans would not vote for a woman for president and that 56.6% of other Americans would not vote for a black person for president.

But these estimates differ depending on whether the participant correctly responded to the attention check item: for the item about the percentage of other Americans who would not vote for a woman for president, the mean estimate was 47% [42, 52] for the 84 graduate degree participants who correctly responded to the attention check and was 68% [63, 73] for the 97 graduate degree participants who did not correctly respond to the attention check; for the item about the percentage of other Americans who would not vote for a black person for president, respective estimates were 44% [39, 49] and 67% [62, 73].

Participants who reported having a graduate degree were 20 percentage points more likely to fail the attention check than participants who did not report having a graduate degree, p<0.001.

---

These data were collected in May 2019, after Barack Obama had been elected president twice and after Hillary Clinton won the popular vote for president, and each aforementioned mean estimate seems to be a substantial overestimate of discrimination against women presidential candidates and Black presidential candidates, compared to point estimates from relevant list experiments reported in Carmines and Schmidt 2020 and compared to point estimates from list experiments and direct questions cited in Bateson 2020 footnote 8.

---

NOTES

1. Stata code for my analysis.

2. R code for the first histogram.

Tagged with: , ,