In "Gendered Nationalism and the 2016 US Presidential Election: How Party, Class, and Beliefs about Masculinity Shaped Voting Behavior" (Politics & Gender 2019), Melissa Deckman and Erin Cassese reported a Table 2 model that had a sample size of 750 and a predictor for college degree that had a logit coefficient of -0.57 and a standard error of 0.28, so the associated t-statistic is -0.57/28, or about -2.0, which produces a p-value of about 0.05.

The college degree coefficient fell to -0.27 when a "gendered nationalism" predictor was added to the model, and Deckman and Cassese 2019 indicated (pp. 17-18) that:

A post hoc Wald test comparing the size of the coefficients between the two models suggests that the coefficient for college was significantly reduced by the inclusion of the mediator [F(1,678) = 7.25; p < .0072]...

From what I can tell, this means that there is stronger evidence for the -0.57 coefficient differing from the -0.27 coefficient (p<0.0072) than for the -0.57 coefficient differing from zero (p≈0.05).

This type of odd result has been noticed before.

---

For more explanation, below are commands that can be posted into Stata to produce a similar result:

clear all
set seed 123
set obs 500
gen Y = runiform(0,10)
gen X1 = 0.01*(Y + runiform(0,10)^2)
gen X2 = 0.01*(Y + 2*runiform(0,10))
reg Y X1
egen weight = fill(1 1 1 1 1)
svyset [pw=weight]
svy: reg Y X1
estimates store X1alone
svy: reg Y X1 X2
estimates store X1paired
suest X1alone X1paired
lincom _b[X1alone:X1] - 0
di _b[X1paired:X1]
lincom _b[X1alone:X1] - 0.4910762
lincom _b[X1alone:X1] - _b[X1paired:X1]

The X1 coefficient is 0.8481948 in the "reg Y X1" model and is 0.4910762 in the "reg Y X1 X2" model. Results for the "lincom _b[X1alone:X1] - _b[X1paired:X1]" command indicate that the p-value is 0.040 for the test that the 0.8481948 coefficient differs from the 0.4910762 coefficient. But results for the "lincom _b[X1alone:X1] - 0.4910762" command indicate that the p-value is 0.383 for the test that the 0.8481948 coefficient differs from the number 0.4910762.

So, from what I can tell, there is stronger evidence that the 0.8481948 X1 coefficient differs from an imprecisely estimated coefficient that has the value of 0.4910762 than from the value of 0.4910762.

---

As indicated in the link above, this odd result appears attributable to the variance sum law:

Variance(X-Y) = Variance(X) + Variance(Y) - 2*Covariance(X,Y)

For the test of whether the 0.8481948 X1 coefficient differs from the 0.4910762 X1 coefficient, the formula is:

Variance(X-Y) = Variance(X) + Variance(Y) - 2*Covariance(X,Y)

But for the test of whether the -0.57 coefficient differs from zero, the formula reduces to:

Variance(X-Y) = Variance(X) + 0 - 0

For the simulated data, subtracting 2*Covariance(X,Y) reduces Variance(X-Y) more than adding the Variance(Y) increases Variance(X-Y), which explains how the p-value can be lower for comparing the two coefficients to each other than for comparing one coefficient to the value of the other coefficient.

See the code below:

suest X1alone X1paired
matrix list e(V)
di (.8481948-.4910762)/sqrt(.16695974)
di (.8481948-.4910762)/sqrt(.16695974+.14457114-2*.14071065)
test _b[X1alone:X1] = _b[X1paired:X1]

Stata output here.

Tagged with:

Back in 2016, SocImages tweeted a link to a post entitled "Trump Supporters Substantially More Racist Than Other Republicans". The "more racist" label refers to Trump supporters being more likely than Cruz supporters and Kasich supporters to indicate on stereotype scales that Blacks "in general" are less intelligent, more lazy, more rude, more violent, and more criminal than Whites "in general". I had a brief Twitter discussion with Philip Cohen and offered to move the discussion to a blog post. Moreover, I collected some relevant data, which is reported on in a new publication in Political Studies Review.

---

In 2017, Turkheimer, Harden, and Nisbett in Vox estimated the Black/White IQ gap to be closer to 10 points than to 15 points. Ten points would be a relatively large gap, about 2/3 of a standard deviation. Suppose that a person reads this Vox article and reads the IQ literature and, as a result, comes to believe that IQ is a valid enough measure of intelligence for it to be likely that the Black/White IQ gap reflects a true difference in mean intelligence. This person later responds to a survey, rating Whites in general one unit higher on a stereotype scale for intelligence than the person rates Blacks in general. My question, for anyone who thinks that such stereotype scale responses can be used as a measure of anti-Black animus, is:

Why is it racist for this person to rate Whites in general one unit higher than Blacks in general on a stereotype scale for intelligence?

I am especially interested in a response that is general enough to indicate whether it would be sexist against men to rate men in general higher than women in general on a stereotype scale for criminality.

Tagged with: , ,

In 2019, Michael Tesler published a Monkey Cage post subtitled "The majority of people who hold racist beliefs say they have an African American friend". Here is a description of these racist beliefs:

Not many whites in the survey took the overtly racist position of saying 'most blacks' lacked those positive attributes. The responses ranged from 9 percent of whites who said 'most blacks' aren't intelligent to 20 percent who said most African Americans aren't law-abiding or generous.

My analysis of the Pew Research Center data used in the Tesler 2019 post indicated that Tesler 2019 labeled as "overtly racist" the belief that most Blacks are not intelligent, even if a participant also indicated that most Whites are not intelligent.

In the Pew Research Center data (citation below), including Don't Knows and refusals, 118 of 1,447 Whites responded "No" to the question of whether most Blacks are intelligent, which is about 8 percent. However, 57 of the 118 Whites who responded "No" to the question of whether most Blacks are intelligent also responded "No" to the question of whether most Whites are intelligent. Thus, based on these intelligence items, 48 percent of the White participants who Tesler 2019 coded as taking an "overtly racist position" against Blacks also took a (presumably) overtly racist position against Whites. It could be that about half of the Whites who are openly racist against Blacks are also openly racist against Whites, or it could be that most or all of these 57 White participants have a nonracial belief that most people are not intelligent.

Even classification of responses of the 56 Whites who reported "No" for whether most Blacks are intelligent and "Yes" for whether most Whites are intelligent should address the literature on the distribution of IQ test scores in the United States and the possibility that at least some of these 56 Whites used the median U.S. IQ as the threshold for being intelligent.

---

I offered Michael Tesler an opportunity to reply. His reply is below:

Scholars have long disputed what constitutes racism in survey research.  Historically, these disagreements have centered around whether racial resentment items like agreeing that “blacks could be just as well of as whites if they only tried harder” are really racism or prejudice.  Because of these debates, I have avoided calling whites who score high on the racial resentment scale racists in both my academic research and my popular writing.

Yet even scholars who are most critical of the racial resentment measure, such as Ted Carmines and Paul Sniderman, have long argued that self-reported racial stereotypes are “self-evidently valid” measures of prejudice.  So, I assumed it would be relatively uncontroversial to say that whites who took the extreme position of saying that MOST BLACKS aren’t intelligent/hardworking/honest/law-abiding hold racist beliefs.  As the piece in question noted, very few whites took such extreme positions—ranging from 9% who said most blacks aren’t intelligent to 20% who said most blacks are not law-abiding.

If anything, then, the Pew measure of stereotypes used severely underestimates the extent of white racial prejudice in the country.  Professor Zigerell suggests that differencing white from black stereotypes is a better way to measure prejudice.  But this isn’t a very discerning measure in the Pew data because the stereotypes were only asked as dichotomous yes-no questions.  It’s all the more problematic in this case since black stereotypes were asked immediately before white stereotypes in the Pew survey and white respondents may have rated their own group less positively to avoid the appearance of prejudice.

In fact, Sniderman and Carmines’s preferred measure of prejudice—the difference between 7-point anti-white stereotypes and 7-point anti-black stereotypes—reveals far more prejudice than I reported from the Pew data.  In the 2016 American National Election Study (ANES), for example, 48% of whites rated their group as more hardworking than blacks, compared to only 13% in the Pew data who said most blacks are not hardworking.  Likewise, 53% of whites in the 2016 ANES rated blacks as more violent than whites and 25% of white Americans in the pooled 2010-2018 General Social Survey rated whites as more intelligent than blacks.

Most importantly, the substantive point of the piece in question—that whites with overtly racist beliefs still overwhelmingly claim they have black friends—remains entirely intact regardless of measurement.  Even if one wanted to restrict racist beliefs to only those saying most blacks are not intelligent/law-abiding AND that most whites are intelligent/law-abiding, 80%+ of these individuals who hold racist beliefs reported having a black friend in the 2009 Pew Survey.

All told, the post in question used a very narrow measure, which found far less prejudice than other valid stereotype measures, to make the point that the vast majority of whites with overtly racist views claim to have black friends.  Defining prejudice even more narrowly leads to the exact same conclusion.

I'll add a response in the comments.

---

NOTES

1. The title of the Tesler 2019 post is "No, Mark Meadows. Having a black friend doesn't mean you're not racist".

2. Data citation: Pew Research Center for the People & the Press/Pew Social & Demographic Trends. Pew Research Center Poll: Pew Social Trends--October 2009-Racial Attitudes in America II, Oct, 2009 [dataset]. USPEW2009-10SDT, Version 2. Princeton Survey Research Associates International [producer]. Cornell University, Ithaca, NY: Roper Center for Public Opinion Research, RoperExpress [distributor], accessed Aug-14-2019.

3. "White" and "Black" in the data analysis refer to non-Hispanic Whites and non-Hispanic Blacks.

4. In the Pew data, more White participants (147) reported "No" for the question of whether most Whites are intelligent, compared to the number of White participants (118) who reported "No" for the question of whether most Blacks are intelligent.

Patterns were similar among the 812 Black participants: 145 Black participants reported "No" for the question of whether most Whites are intelligent, but only 93 Black participants reported "No" for the question of whether most Blacks are intelligent.

Moreover, 76 White participants reported "Yes" for the question of whether most Blacks are intelligent and "No" for the question of whether most Whites are intelligent.

5. Stata code:

tab racethn, mi

tab q69b q70b if racethn==1, mi

tab q69b q70b if racethn==2, mi

Tagged with: , ,

Below is a discussion of small study effects in the data for the 2017 PNAS article, "Meta-analysis of field experiments shows no change in racial discrimination in hiring over time", by Lincoln Quillian, Devah Pager, Ole Hexel, and Arnfinn Midtbøen. The first part is the initial analysis that I sent to Dr. Quillian. The Quillian et al. team replied here, also available via this link a level up. I responded to this reply below my initial analysis and will notify Dr. Quillian of the reply. Please note that Quillian et al. 2017 mentions publication bias analyses on page 5 of its main text and in Section 5 of the supporting information appendix.

---

Initial analysis

Levels of discrimination against Black job applicants in the United States have not changed much or at all over the past 25 years is a conclusion of the Quillian et al. 2017 PNAS article, based on a meta-analysis that focuses on 1989-2015 field experiments assessing discrimination against Black or Hispanic job applicants relative to White applicants. The credibility of this conclusion depends at least on the meta-analysis including the population of relevant field experiments or a representative set of relevant field experiments. However, the graph below for the dataset set of Black/White discrimination field experiments is consistent with what would be expected if the meta-analysis did not have a complete set of studies.

Comment Q2017 Figure 1

The graphs plot a measure of the precision of each study against the corresponding effect size estimate, from the dmap_update_1024recoded_3.dta dataset available here. For a population of studies or for a representative set of studies, the pattern of points is expected to approximate a symmetric pyramid peaking at zero on the y-axis. The logic of this expectation is that, if there were a single true underlying effect, the size of that effect would be the estimated effect size from a perfectly-precise study, which would have a standard error of zero. The average effect size for less-than-perfectly-precise studies should also approximate the true effect size, but any given less-than-perfectly-precise study would not necessarily produce an estimate of the true effect size and would be expected to produce estimates that often fall to one side or the other side of the true effect size, with estimates from lower-precision studies falling further on average from the true effect size than estimates from higher-precision studies, thus creating the expected symmetric pyramid shape.

Egger's test assesses asymmetry in the shape of a pattern of points. The p-value of 0.003 for the Black/White set of studies indicates the presence of sufficient evidence to conclude with reasonable certainty that the pattern of points for the 1989-2015 set of Black/White discrimination field experiments is asymmetric. This particular pattern of asymmetry could have been caused by the higher-precision studies having tested for discrimination in situations with lower levels of anti-Black discrimination relative to situations for the lower-precision studies. But this pattern could also have been produced by suppression of low-precision studies that had null results or had results that indicated discrimination favoring Blacks relative to Whites.

Any inference from analyses of the set of 1989-2015 Black/White discrimination field experiments should thus consider the possibility that the set is incomplete and that any such incompleteness might bias inferences. For example, assessing patterns over time without any adjustment for possible missing studies requires an assumption that the inclusion of any missing studies would not alter the particular inference being made. That might be a reasonable assumption, but it should be identified as an assumption of any such inference.

The graphs below attempt to assess this assumption, by plotting estimates for the 10 earliest 1989-2015 Black/White field experiments and the 10 latest 1989-2015 Black/White field experiments, excluding the study that had no year indicated in the dataset for the year of the fieldwork. Both graphs are at least suggestive of the same type of small study effects.

Comment Q2017 Figure 2

Statistical methods have been developed to estimate the true effect size in meta-analyses after accounting for the possibility that the meta-analysis does not include the population of relevant studies or at least a representative set of relevant studies. For example, the top 10 percent by precision method, the trim-and-fill method with a linear estimator, and the PET-PEESE method cut the estimate of discrimination across the Black/White discrimination field experiments from 36 percent fewer callbacks or interviews to 25 percent, 21 percent, and 20 percent, respectively. These estimates, though, depend heavily on a lack of publication bias in highly-precise studies, which adds another assumption to these analyses and underscores the importance of preregistering studies.

Social science should inform public beliefs and public policy, but the ability of social scientists to not report data that have been collected and analyzed cannot help but undercut this important role for social science. Social scientists should consider preregistering their plans to conduct studies and their planned research designs for analyzing data, to restrict their ability to suppress undesired results and to thus add credibility to their research and to social science in general.

---

Reply from the Quillian et al.

Here

---

My response to the Quillian et al. reply

[1] The second section heading in the Quillian et al. reply correctly states that "Tests based on funnel plot asymmetry often generate false positives as indicators of publication bias". The Quillian et al. reply reported the funnel plot to the left below and the Egger's test p-value of 0.647 for the set of 13 Black/White discrimination resume audit correspondence field experiments, which provide little-to-no evidence of small study effects or publication bias. However, the funnel plot of the residual set of 8 Black/White discrimination field experiments—of in-person-audits—has an asymmetric shape and a p=0.043 Egger's test indicative of small study effects.

Comment Q2017 Figure 3The Quillian et al. reply indicated that "Using only resume audits to analyze change over time gives no trend (the linear slope is -.002, almost perfectly flat, shown in figure 3 in our original paper, and the weighted-average discrimination ratio is 1.32, only slightly below the ratio of all studies of 1.36)". For me at least, the lack of a temporal pattern in the resume audit (correspondence) field experiments is more convincing after seeing the funnel plot pattern than when not knowing the funnel plot pattern, although now the inference is limited to racial discrimination between 2001 and 2015 because there were no dataset correspondence field experiments conducted between 1989 and 2000. The top graph below illustrates this nearly-flat -0.002 slope for correspondence audit field experiments. Presuming no publication bias or presuming a constant effect of publication bias, it is reasonable to infer that there was no decrease in the level of White-over-Black favoring in correspondence audit field experiments between 2001 and 2015.

Comment Q2017 Figure 4But presuming no publication bias or presuming a constant effect of publication bias, the slope for in-person audits in the bottom graph above indicates a potentially alarming increase in discrimination favoring Whites over Blacks, from the early 1990s to the post-2000 years, with slope of 0.03 and a corresponding p-value of p=0.08. But maybe there's a good reason to not include the three field experiments from 1990 and 1991 with a decade gap between the latest of these three field experiments and the set of post-2000 field experiments. If so, the slope of the line for Black/White discrimination correspondence studies and Black/White discrimination in-person audit studies pooled together from 2001 to 2015 is -0.02 with a p-value of p=0.059, and depicted below.

[2] I don't object to the use of the publication bias test reported on in Quillian et al. 2017. My main objections are to the non-reporting of a funnel plot and to basing the inference that "publication or write-up bias is unlikely to have produced inflated discrimination estimates" (p. 6 of the supporting information index) on a null result from a regression with 21 points and five independent variables. Trim-and-fill lowered the meta-analysis estimate from 0.274 to 0.263 for the 1989-2015 Black/White discrimination correspondence audits, but lowered the 1989-2015 Black/White discrimination in-person audit meta-analysis estimate from 0.421 to 0.158. The trim-and-fill decrease for the pooled set of 1989-2015 Black/White discrimination field experiments is from 0.307 to 0.192.

Funnel plots and corresponding tests of funnel plot asymmetry indicate at most the presence of small study effects, which could be caused by phenomena other than publication bias. The Quillian et al. reply notes that "we find evidence that the difference between in person versus resume audit may create false positives for this test" (p. 4). This information and the reprinted funnel plots below are useful because they suggest multiple reasons to not pool results from in-person audits and correspondence audits for Black/White discrimination, such as [i] the possibility of publication bias in the in-person audit set of studies or [ii] possible differences in mean effect sizes for in-person audits compared to correspondence audits.

Comment Q2017 Figure 3Maybe the best way to report these results is a flat line for correspondence audits indicating no change between 2001 and 2015 (N=13) and a downward-sloping-but-not-statistically-significant line for in-person audits between 2001 and 2015 (N=5), with an upward-sloping-but-not-statistically-significant line for in-person audits between 1989 and 2015 (N=8).

[3] This section discusses the publication bias test used by Quillian et al. 2017. I'll use "available" to describe field experiments retrieved in the search for published and unpublished field experiments.

The Quillian et al. reply (pp. 1-2) describes the logic of the publication bias test that they used as:

If publication bias is a serious issue, then studies that focus on factors other than race/ethnic discrimination should show lower discrimination than studies focused primarily on race/ethnicity, because for the latter studies (but not the former) publication should be difficult for studies that do not find significant evidence of racial discrimination.

The expectation, as I understand it, is that discrimination field experiments with race as the primary focus will have a range of estimates, some of which are statically significant and some of which are not statically significant. If there is publication bias such that race-as-the-primary-focus field experiments that do not find discrimination against Blacks are less likely to be available than race-as-the-primary-focus field experiments that find discrimination against Blacks, then the estimate of discrimination against Blacks in the available race-as-the-primary-focus field experiments should be artificially inflated above the true value of racial discrimination. This publication bias test involves a comparison of this presumed inflated effect size to the effect size from field experiments in which race was not the primary focus, which presumably is closer to the true value of racial discrimination because non-availability in the non-race-as-the-primary-focus field experiments is not primarily due to the p-value and direction for racial discrimination but is instead or primarily due to the p-value and direction for the other type of discrimination. The publication bias test is whether the effect size for the available non-race-focused discrimination field experiments is smaller than effect size for the available race-focused discrimination field experiments.

The effect size for racial discrimination from field experiments in which race was not the primary focus might still be inflated in the presence of publication bias because [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus but do find discrimination in the race manipulation] are plausibly more likely to be available than [non-race-as-the-primary-focus field experiments that don't find discrimination in the primary focus or in the race manipulation].

But let's stipulate that the racial discrimination effect size from non-race-as-the-primary-focus field experiments should be smaller than the racial discrimination effect size from race-as-the-primary-focus field experiments. If so, how large must this expected difference be such that the observed null result (0.051 coefficient, 0.112 standard error) in the N=21 five-independent-variable regression in Table S7 of Quillian et al. 2017 should be interpreted as evidence of the absence of nontrivial levels of publication bias?

For what it's worth, the publication bias test in the regression below reflects the test used in Quillian et al. 2017, but with a different model and with removal of the three field experiments from 1990 and 1991, such that the sample is the set of Black/White discrimination field experiments from 2001 to 2015. The control for the study method indicates that in-person audits have an estimated 0.40 larger effect size than correspondence audits. The 95 percent confidence interval for the race_not_focus predictor ranges from -0.21 to 0.18. Is that range inconsistent with the expected value based on this test if there were nontrivial amounts of publication bias?

Comment Q2017 Figure 6---

Data available at the webpage for Quillian et al. 2017 [here]

My R code [here]

My Stata code [here]

Tagged with: , ,

I had a recent Twitter exchange about a Monkey Cage post:

Below, I use statistical power calculations to explain why the Ahlquist et al. paper, or at least the list experiment analysis cited in the Monkey Cage post, is not compelling.

---

Discussing the paper (published version here), Henry Farrell wrote:

So in short, this research provides exactly as much evidence supporting the claim that millions of people are being kidnapped by space aliens to conduct personally invasive experiments on, as it does to support Trump's claim that millions of people are engaging in voter fraud.

However, a survey with a sample size of three would also not be able to differentiate the percentage of U.S. residents who commit vote fraud from the percentage of U.S. residents abducted by aliens. For studies that produce a null result, it is necessary to assess the ability of the study to detect an effect of a particular size, to get a sense of how informative that null result is.

The Ahlquist et al. paper has a footnote [31] that can be used to estimate the statistical power for their list experiments: more than 260,000 total participants would be needed for a list experiment to have 80% power to detect a 1 percentage point difference between treatment and control groups, using an alpha of 0.05. The power calculator here indicates that the corresponding estimated standard deviation is at least 0.91 [see note 1 below].

So let's assume that list experiment participants are truthful and that we combine the 1,000 participants from the first Ahlquist et al. list experiment with the 3,000 participants from the second Ahlquist et al. list experiment, so that we'd have 2,000 participants in the control sample and 2,000 participants in the treatment sample. Statistical power calculations using an alpha of 0.05 and a standard deviation of 0.91 indicate that there is:

  • a 5 percent chance of detecting a 1% rate of vote fraud.
  • an 18 percent chance of detecting a 3% rate of vote fraud.
  • a 41 percent chance of detecting a 5% rate of vote fraud.
  • a 79 percent chance of detecting an 8% rate of vote fraud.
  • a 94 percent chance of detecting a 10% rate of vote fraud.

---

Let's return to the claim that millions of U.S. residents committed vote fraud and use 5 million for the number of adult U.S. residents who committed vote fraud in the 2016 election, eliding the difference between illegal votes and illegal voters. There are roughly 234 million adult U.S. residents (reference), so 5 million vote fraudsters would be 2.1% of the adult population, and a 4,000-participant list experiment would have about an 11 percent chance of detecting that 2.1% rate of vote fraud.

Therefore, if 5 million adult U.S. residents really did commit vote fraud, a list experiment with the sample size of the pooled Ahlquist et al. 2014 list experiments would produce a statistically-significant detection of vote fraud about 1 of every 9 times the list experiment was conducted. The fact that Ahlquist et al. 2014 didn't detect voter impersonation at a statistically-significant level doesn't appear to compel any particular belief about whether the rate of voter impersonation in the United States is large enough to influence the outcome of presidential elections.

---

NOTES

1. Enter 0.00 for mu1, 0.01 for mu2, 0.91 for sigma, 0.05 for alpha, and a 130,000 sample size for each sample; then hit Calculate. The power will be 0.80.

2. I previously discussed the Ahlquist et al. list experiments here and here. The second link indicates that an Ahlquist et al. 2014 list experiment did detect evidence of attempted vote buying.

Tagged with: , , ,

Researchers often have the flexibility to report only the results they want to report, so an important role for peer reviewers is to request that researchers report results that a reasonable skeptical reader might suspect have been strategically unreported. I'll discuss two publications where obvious peer review requests do not appear to have been made and, presuming these requests were not made, how requests might have helped readers better assess evidence in the publication.

---

Example 1. Ahlquist et al. 2014 "Alien Abduction and Voter Impersonation in the 2012 U.S. General Election: Evidence from a Survey List Experiment"

Ahlquist et al. 2014 reports on two list experiments: one list experiment is from December 2012 and has 1,000 cases, and another list experiment is from September 2013 and has 3,000 cases.

Figure 1 of Ahlquist et al. 2014 reports results for the 1,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election; the 95% confidence intervals for the full sample and for each reported subgroup cross zero. Figure 2 reports results for the full sample of the 3,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election, but Figure 2 did not include subgroup results. Readers are thus left to wonder why subgroup results were not reported for the larger sample that had more power to detect an effect among subgroups.

Moreover, the main voting irregularity list experiment reported in Ahlquist et al. 2014 concerned voter impersonation, but, in footnote 15, Ahlquist et al. discuss another voting irregularity list experiment that was part of the study, about whether political candidates or activists offered the participant money or a gift for their vote:

The other list experiment focused on vote buying and closely mimicked that described in Gonzalez-Ocantos et al. (2012). Although we did not anticipate discovering much vote buying in the USA we included this question as a check, since a similar question successfully discovered voting irregularities in Nicaragua. As expected we found no evidence of vote buying in the USA. We omit details here for space considerations, though results are available from the authors and in the online replication materials...

The phrasing of the footnote is not clear whether the inference of "no evidence of vote buying in the USA" is restricted to an analysis of the full sample or also covers analyses of subgroups.

So the article leaves at least two questions unanswered for a skeptical reader:

  1. Why report subgroup analyses for only the smaller sample?
  2. Why not report the overall estimate and subgroup analyses for the vote buying list experiment?

Sure, for question 2, Ahlquist et al. indicate that the details of the vote buying list experiment were omitted for "space considerations"; however, the 16-page Ahlquist et al. 2014 article is shorter than the other two articles in the journal issue, which are 17 pages and 24 pages.

Peer reviewer requests that could have helped readers were to request a detailed report on the vote buying list experiment and to request a report of subgroup analyses for the 3,000-person sample.

---

Example 2. Sen 2014 "How Judicial Qualification Ratings May Disadvantage Minority and Female Candidates"

Sen 2014 reports logit regression results in Table 3 for four models predicting the ABA rating given to U.S. District Court nominees from 1962 to 2002, with ratings dichotomized into (1) well qualified or exceptionally well qualified and (2) not qualified or qualified.

Model 1 includes a set of variables such as the nominee's sex, race, partisanship, and professional experience (e.g., law clerk, state judge). Compared to model 1, model 2 omits the partisanship variable and adds year dummies. Compared to model 2, model 3 adds district dummies and interaction terms for female*African American and female*Hispanic. And compared to model 3, model 4 removes the year dummies and adds a variable for years of practice and a variable for the nominee's estimated ideology.

The first question raised by the table is the omission of the partisanship variable for models 2, 3, and 4, with no indication of the reason for that omission. The partisanship variable is not statistically significant in model 1, and Sen 2014 notes that the partisanship variable "is never statistically significant under any model specification" (p. 44), but it is not clear why the partisanship variable is dropped in the other models because other variables appear in all four models and never reach statistical significance.

The second question raised by the table is why years of practice appears in only the fourth model, in which roughly one-third of cases are lost due to the inclusion of estimated nominee ideology. Sen 2014 Table 2 indicates that male and white nominees had substantially more years of practice than female and black nominees: men (16.87 years), women (11.02 years), whites (16.76 years), and blacks (10.08 years); therefore, any model assessing whether ABA ratings are biased should account for sex and race differences in years of practice, under the reasonable expectation that nominees should receive higher ratings for more experience.

Peer reviewer requests that could have helped readers were to request a discussion of the absence of the partisanship variable from models 2, 3, and 4, and to request that years of experience be included in more of the models.

---

Does it matter?

Data for Ahlquist et al. 2014 are posted here. I reported on my analysis of the data in a manuscript rejected after peer review by the journal that published Ahlquist et al. 2014.

My analysis indicated that the weighted list experiment estimate of vote buying for the 3,000-person sample was 5 percent (p=0.387), with a 95% confidence interval of [-7%, 18%]. I'll echo my earlier criticism and note that a 25-percentage-point-wide confidence interval is not informative about the prevalence of voting irregularities in the United States because all plausible estimates of U.S. voting irregularities fall within 12.5 percentage points of zero.

Ahlquist et al. 2014 footnote 14 suggests that imputed data on participant voter registration were available, so a peer reviewer could have requested reporting of the vote buying list experiments restricted to registered voters, given that only registered voters have a vote to trade. I did not see a variable for registration in the dataset for the 1,000-person sample, but the list experiment for the 3,000-person sample produced the weighted point estimate that 12 percent of persons listed as registered to vote were contacted by political candidates or activists around the 2012 U.S. general election with an offer to exchange money or gifts for a vote (p=0.018).

I don't believe that this estimate is close to correct, and, given sufficient subgroup analyses, some subgroup analyses would be expected to produce implausible or impossible results, but peer reviewers requesting these data might have produced a more tentative interpretation of the list experiments.

---

For Sen 2014, my analysis indicated that the estimates and standard errors for the partisanship variable (coded 1 for nomination by a Republican president) inflate unusually high when that variable is included in models 2, 3, and 4: the coefficient and standard error for the partisanship variable are 0.02 and 0.11 in model 1, but inflate to 15.87 and 535.41 in model 2, 17.90 and 1,455.40 in model 3, and 18.21 and 2,399.54 in model 4.

The Sen 2014 dataset had variables named Bench.Years, Trial.Years, and Private.Practice.Years. The years of experience for these variables overlap (e.g., nominee James Gilstrap was born in 1957 and respectively has 13, 30, and 30 years for these variables); therefore, the variables cannot be summed to construct a variable for total years of legal experience that does not include double- or triple-counting for some cases. Bench.Years correlates with Trial.Years at -0.47 and with Private.Practice.Years at -0.39, but Trial.Years and Private.Practice.Years correlate at 0.93, so I'll include only Bench.Years and Trial.Years, given that Trial.Years appears more relevant for judicial ratings than Private.Practice.Years.

My analysis indicated that women and blacks had a higher Bench.Years average than men and whites: men (4.05 years), women (5.02 years), whites (4.02 years), and blacks (5.88 years). Restricting the analysis to nominees with nonmissing nonzero Bench.Years, men had slightly more experience than women (9.19 years to 8.36 years) and blacks had slightly more experience than whites (9.33 years to 9.13 years).

Adding Bench.Years and Trial.Years to the four Table 3 models did not produce any meaningful difference in results for the African American, Hispanic, and Female variables, but the p-value for the Hispanic main effect fell to 0.065 in model 4 with Bench.Years added.

---

I estimated a simplified model with the following variables predicting the dichotomous ABA rating variable for each nominee with available data: African American nominee, Hispanic nominee, female nominee, Republican nominee, nominee age, law clerk experience, law school tier (from 1 to 6), Bench0 and Trial0 (no bench or trial experience respectively), Bench.Years, and Trial.Years. These variables reflect demographics, nominee quality, and nominee experience, with a presumed penalty for nominees who lack bench and/or trial experience. Results are below:

aba1The female coefficient was not statistically significant in the above model (p=0.789), but the coefficient was much closer to statistical significance when adding a control for the year of the nomination:

aba2District.Court.Nomination.Year was positively related to the dichotomous ABA rating variable (r=0.16) and to the female variable (r=0.29), and the ABA rating increased faster over time for women than for men (but not at a statistically-significant level: p=0.167), so I estimated a model that interacted District.Court.Nomination.Year with Female and with the race/ethnicity variables:

aba3The model above provides some evidence for an over-time reduction of the sex gap (p=0.095) and the black/white gap (0.099).

The next model is the second model reported above, but with estimated nominee ideology added, coded with higher values indicating higher levels of conservatism:

aba4So there is at least one reasonable model specification that produces evidence of bias against conservative nominees, at least to the extent that the models provide evidence of bias. After all, ABA ratings are based on three criteria—integrity, professional competence, and judicial temperament—but the models include information for only professional competence, so a sex, race, and ideological gap in the models could indicate bias and/or could indicate a sex, race, and ideological gap in nonbiased ABA evaluations of integrity and/or judicial temperament and/or elements of professional competence that are not reflected in the model measures. Sen addressed the possibility of gaps in these other criteria, starting on page 47 of the article.

For what it's worth, evidence of the bias against conservatives is stronger when excluding the partisanship control:

aba5---

The above models for the Sen reanalysis should be interpreted to reflect the fact that there are many reasonable models that could be reported. My assessment from the models that I estimated is that the black/white gap is extremely if not completely robust, the Hispanic/white gap is less robust but still very robust, the female/male gap is less robust but still somewhat robust, and the ideology gap is the least robust of the group.

I'd have liked for the peer reviewers on Sen 2014 to have requested results for the peer reviewers' preferred model, with requested models based only on available data and results reported in at least an online supplement. This would provide reasonable robustness checks for an analysis for which there are many reasonable model specifications. Maybe that happened: the appendix table in the working paper version of Sen 2014 is somewhat different than the published logit regression table. In any event, indicating which models were suggested by peer reviewers might help reduce skepticism about the robustness of reported models, to the extent that models suggested by a peer reviewer have not been volunteered by the researchers.

---

NOTES FOR AHLQUIST ET AL. 2014:

1. Subgroup analyses might have been reported for only the smaller 1,000-person sample because the smaller sample was collected first. However, that does not mean that the earlier sample should be the only sample for which subgroup analyses are reported.

2. Non-disaggregated results for the 3,000-person vote buying list experiment and disaggregated results for the 1,000-person vote buying list experiment were reported in a prior version of Ahlquist et al. 2014, which Dr. Ahlquist sent me. However, a reader of Ahlquist et al. 2014 might not be aware of these results, so Ahlquist et al. 2014 might have been improved by including these results.

---

NOTES FOR SEN 2014:

1. Ideally, models would include a control for twelve years of experience, given that the ABA Standing Committee on the Federal Judiciary "...believes that a prospective nominee to the federal bench ordinarily should have at least twelve years' experience in the practice of law" (p. 3, here). Sen 2014 reports results for a matching analysis that reflects the 12 years threshold, at least for the Trial.Years variable, but I'm less confident in matching results, given the loss of cases (e.g., from 304 women in Table 1 to 65 women in Table 4) and the loss of information (e.g., cases appear to be matched so that nominees with anywhere from 0 to 12 years on Trial.Years are matched on Trial.Years).

2. I contacted the ABA and sent at least one email to the ABA liaison for the ABA committee that handles ratings for federal judicial nominations, asking whether data could be made available for nominee integrity and judicial temperament, such as a dichotomous indication whether an interviewee had raised concerns about the nominee's integrity or judicial temperament. The ABA Standing Committee on the Federal Judiciary prepares a written statement (e.g., here) that describes such concerns for nominees rated as not qualified, if the ABA committee is asked to testify at a Senate Judiciary Committee hearing for the nominee (see p. 8 here). I have not yet received a reply to my inquiries.

---

GENERAL NOTES

1. Data for Ahlquist et al. 2014 are here. Code for my additional analyses is here.

2. Dr. Sen sent me data and R code, but the Sen 2014 data and code do not appear to be online now. Maya Sen's Dataverse is available here. R code for the supplemental Sen models described above is here.

Tagged with: , , , , ,

Journals requiring the posting of data and code for published articles is a major improvement in the conduct of social science because it increases the ability of researchers to assess the correctness and robustness of reported results and because it presumably produces more careful analyses by researchers aware that their data and code will be made public.

But the DA-RT agreement to "[r]equire authors to ensure that cited data are available at the time of publication through a trusted digital repository" does not address selective reporting. For example, the current replication policy for the journal Political Behavior requires only that "[a]uthors of accepted manuscripts will be required to deposit all of the data and script files needed to replicate the published results in a trusted data repository such as ICPSR or Dataverse" (emphasis added).

This permits researchers to selectively report experiments, experimental conditions, and potential outcome variables, and to then delete the corresponding data from the dataset that is made public. Readers thus often cannot be sure whether the reported research has been selectively reported.

---

Consider uncertainty about the survey experiment reported in Filindra and Kaplan 2016, described in the article's abstract as follows (p. 255):

To determine whether racial prejudice depresses white support for gun control, we designed a priming experiment which exposed respondents to pictures of blacks and whites drawn from the IAT. Results show that exposure to the prime suppressed support for gun control compared to the control, conditional upon a respondent's level of racial resentment.

But here is a description of the experimental treatment (p. 261):

Under the guise of a cognitive test, we exposed 600 survey participants who self-identified as white to three pictures of the faces of black individuals and another three of white individuals.

I wasn't sure why a survey experiment intended "[t]o determine whether racial prejudice depresses white support for gun control" would have as its only treatment a prime that consisted of photos of both blacks and whites. It seems more logical for a "racial prejudice" experiment to have one condition in which participants were shown photos of blacks and another condition in which participants were shown photos of whites; then responses to gun control items that followed the photo primes could be compared for the black photo and white photo conditions.

Readers of Filindra and Kaplan 2016 might suspect that there were unreported experimental conditions in which participants were shown photos of blacks or were shown photos of whites. But readers cannot know from the article whether there were unreported conditions.

---

I didn't know of an easier way to eliminate the uncertainty about whether there were unreported conditions in Filindra and Kaplan 2016 other than asking the researchers, so I sent the corresponding author an email asking about the presence of unreported experimental conditions involving items about guns and photos of blacks and/or whites. Dr. Filindra indicated that there were no unreported conditions involving photos of blacks and/or whites, but there were unreported conditions for non-photo conditions that are planned for forthcoming work.

---

My correspondence with Dr. Filindra made me more confident in their reported results, but such correspondence is a suboptimal way to increase confidence in reported results: it took time from Drs. Filindra and Kaplan and from me, and the information from our correspondence is, as far as I am aware, available only to persons reading this blog post.

There are multiple ways for journals and researchers to remove uncertainty about selective reporting and thus increase research transparency, such as journals requiring the posting of all collected data, journals requiring researchers to make disclosures about the lack of selective reporting, and researchers preregistering plans to collect and analyze data.

Tagged with: , ,