In May 2020, PS published a correction to Mitchell and Martin 2018 "Gender Bias in Student Evaluations", which reflected concerns that I raised in a March 2019 blog post. That correction didn't mention me, and in May 2020 PS published another correction that didn't mention me but was due to my work, so I'll note below evidence that the corrections were due to my work, which might be useful in documenting my scholarly contributions for, say, an end-of-the-year review or promotion application.

---

In August 2018, I alerted the authors of Mitchell and Martin 2018 (hereafter MM) to concerns about potential errors in MM. I'll post one of my messages below. My sense at the time was that the MM authors were not going to correct MM (and the lead author of MM was defending MM as late as June 2019), so I published a March 2019 blog post about my concerns and in April 2019 I emailed PS a link to my blog post and a suggestion that MM "might have important errors in inferential statistics that warrant a correction".

In May 2019, a PS editor indicated to me that the MM authors have chosen to not issue a correction and that PS invited me to submit a comment on MM that would pass through the normal peer review process. I transformed my blog post into a manuscript comment, which involved, among other things, coding all open-ended student evaluation comments and calculating what I thought the correct results should be in the main three MM tables. Moreover, for completeness, I contacted Texas Tech University and eventually filed a Public Information Act request, because no one I communicated with at Texas Tech about this knew for certain why student evaluation data were not available online for certain sections of the course that MM Table 4 reported student evaluation results for.

I submitted a comment manuscript to PS in August 2019 and submitted a revision based on editor feedback in September 2019. Here is the revised submitted manuscript. In January 2020, I received an email from PS indicating that my manuscript was rejected after peer review and that PS would request a corrigendum from the authors of MM.

In May 2020, PS published a correction to MM, but I don't think that the correction is complete: for example, as I discussed in my blog post and manuscript comment, I think that the inferential statistics in MM Table 4 were incorrectly based on a calculation in which multiple ratings from the same student were treated as independent ratings.

---

For the Comic-Con correction that PS issued in May 2020, I'll quote from my manuscript documenting the error of inference in the article:

I communicated concerns about the Owens et al. 2020 "Comic-Con" article to the first two authors in November 2019. I did not hear of an attempt to publish a correction, and I did not receive a response to my most recent message, so I submitted this manuscript to PS: Political Science & Politics on Feb 4, 2020. PS published a correction to "Comic-Con" on May 11, 2020. PS then rejected my manuscript on May 18, 2020 "after an internal review".

Here is an archive of a tweet thread, documenting that in September 2019 I alerted the lead "Comic-Con" author to the error of inference, and the lead author did not appear to understand my point.

---

NOTES:

1. My PS symposium entry "Left Unchecked" (published online in June 2019) discussed elements of MM that ended up being addressed in the MM correction.

2. Here is an email that I sent the MM authors in August 2018:

Thanks for the data, Dr. Mitchell. I had a few questions, if you don't mind:

[1] The appendix indicates for the online course analysis that: "For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10". But I think that Dr. Martin taught a section 11 course (D11) that was included in the data.

[2] I am not certain about how to reproduce the statistical significance levels for Tables 1 and 2. For example, for Table 1, I count 23 comments for Dr. Martin and 45 comments for Dr. Mitchell, for the N=68 in the table. But a proportion test in Stata for the "Referred to as 'Teacher'" proportions (prtesti 23 0.152 45 0.244) produces a z-score of -0.8768, which does not seem to match the table asterisks indicating a p-value of p<0.05.

[3] Dr. Martin's CV indicates that he was a visiting professor at Texas Tech in 2015 and 2016. For the student comments for POLS 3371 and POLS 3373, did Dr. Martin's official title include "professor"? If so, than that might influence inferences about any difference in the frequency of student use of the label "professor" between Dr. Martin and Dr. Mitchell. I didn't see "professor" as a title in Dr. Mitchell's CV, but the inferences could also be influenced if Dr. Mitchell had "professor" in her title for any of the courses in the student comments analysis, or for the Rate My Professors comments analysis.

[4] I was able to reproduce the results for the Technology analysis in Table 4, but, if I am correct, the statistical analysis seems to assume that the N=153 for Dr. Martin and the N=501 for Dr. Mitchell are for 153 and 501 independent observations. I do not think that this is correct, because my understanding of the data is that the 153 observations for Dr. Martin are 3 observations for 51 students and that the 501 observations for Dr. Mitchell are 3 observations for 167 students. I think that the analysis would need to adjust for the non-independence of some of the observations.

Sorry if any of my questions are due to a misunderstanding. Thank you for your time.

Best,

L.J

Tagged with: , ,

The average eighth grade math score on the 2019 National Assessment of Educational Progress (NAEP) was 310 for Asian/Pacific Islander students, 292 for White students, 268 for Hispanic students, and 260 for Black students. This pattern has been consistent for many years, for fourth grade students (Figure 3), for eighth grade students (Figure 4), and for twelfth grade students (Figure 5).

However, before inferring that Asian/Pacific Islander students are better in math on average than are White students and Hispanic students and Black students, be aware that this inference could be labeled "prejudice" in peer-reviewed research such as Piston 2010 and Hopkins and Washington 2020, which measured "prejudice" as a difference in ratings of groups on stereotype scales for certain characteristics.

---

Piston 2010 conceptualized "prejudice" with "an etymological perspective":

An assessment that one racial group possesses a negative attribute relative to another racial group is a "pre-judgment"; it precedes, but may or may not influence, the evaluation of an individual member of that group, such as Barack Obama.

So, if you make a good faith interpretation of NAEP scores and/or SAT scores and infer that Asian/Pacific Islander students are better on average in math than are White students and Hispanic students and Black students, that would be "prejudice" by the analysis in Piston 2010.

---

Your responses might not be "prejudice" based on Hopkins 2019:

We define prejudice as a standing, negative predisposition toward a social group held in the face of contradictory information.

Based on this, Hopkins 2019 seems to require evidence that Asian/Pacific Islander students are not better on average in math than are White students and Hispanic students and Black students ("contradictory information") before labeling that belief as "prejudice".

I asked Dan Hopkins in a tweet what "contradictory information" he was referring to for his use of "prejudice", and, perhaps as a consequence, Hopkins and Washington 2020 removed the "held in the face of contradictory information" restriction. From Hopkins and Washington 2020:

'Prejudice' refers to a standing, negative predisposition toward a social group.

So, by Hopkins and Washington 2020, it would be "prejudice" to have a justified standing, negative predisposition toward a hate group that regularly commits terrorism. That might be a proper conceptualization of "prejudice", but I would be interested in seeing Hopkins or Washington use "prejudice" in that way.

Hopkins and Washington 2020 used stereotype scale differences as measures of "prejudice", but it seems possible to perceive that members of one group perform better on average on some measure than members of another group, without having a "standing, negative predisposition" toward either group, especially because nothing in these traditional stereotype scales indicates that the scales measure belief about innate or genetic characteristics.

---

From what I can tell, the belief that U.S. Asian/Pacific Islander students are better in math on average than are White students and Hispanic students and Black students would be "prejudice" under the conceptualizations in Piston 2010 and Hopkins and Washington 2020, even though I think that this belief can result from a good faith interpretation of high quality evidence. I thus think that use of the conceptualizations of "prejudice" in Piston 2010 or Hopkins and Washington 2020 has the potential to be misleading and to corrode public discourse.

The potential to mislead is because I think that "prejudice" has a negative connotation in everyday language, and I don't think that a good faith interpretation of high quality evidence should have a label that has a negative connotation. I am not aware of anything that prevents researchers from labeling such stereotype scale responses as "stereotype scale differences" or something similar that would more precisely describe the phenomenon being measured.

The potential to corrode public discourse is the potential that fear of application of the "prejudice" label can make people less likely to express beliefs that have been derived from a good faith interpretation of high quality evidence, and I don't think that, barring some compelling reason otherwise, people should be discouraged from expressing a belief that has been derived from a good faith interpretation of high quality evidence.

Tagged with: , ,

1.

Researchers reporting results from an experiment often report estimates of the treatment effect at particular levels of a predictor. For example, a panel of Figure 2 in Barnes et al. 2018 plotted, over a range of hostile sexism, the estimated difference in the probability of reporting being very unlikely to vote for a female representative target involved in a sex scandal relative to the probability of reporting being very unlikely to vote for a male representative target involved in a sex scandal. For another example, Chudy 2020 plotted, over a range of racial sympathy, estimated punishments for a Black culprit target and a White culprit target. Both of these plots report estimates derived from a regression. However, as indicated in Hainmueller et al. 2020, regression can nontrivially misestimate a treatment effect at particular levels of a predictor.

This post presents another example of this phenomenon, based on data from the experiment in Costa et al. 2020 "How partisanship and sexism influence voters' reactions to political #MeToo scandals" (link to a correction to Costa et al. 2020).

---

2.

The Costa et al. 2020 experiment had a control condition, two treatment conditions, and multiple outcome variables, but my illustration will focus on only two conditions and only one outcome variable. Participants were asked to respond to four items measuring participant sexism and to rate a target male senator on a 0-to-10 scale. Participants who were then randomized to the "sexual assault" condition were provided a news story indicating that the senator had been accused of groping two women without consent. Participants who were instead randomized to the control condition were provided a news story about the senator visiting a county fair. The outcome variable of interest for this illustration is the percent change in the favorability of the senator, from the pretest to the posttest.

Estimates in the left panel of Figure 1 are based on a linear regression predicting the outcome variable of interest, using predictors of a pretest measure of participant sexism from 0 for low sexism to 16 for high sexism, a dichotomous variable coded 1 for participants in the sexual assault condition and 0 for participants in the control, and an interaction of these predictors. The panel plots the point estimates and 95% confidence intervals for the estimated difference in the outcome variable, between the control condition and the sexual assault condition, at each observed level of the participant sexism index.

The leftmost point indicates that the "least sexist" participants in the sexual assault condition were estimated to have a value of the outcome variable that was about 52 units less than the "least sexist" participants in the control condition; the "least sexist" participants in the control were estimated to have increased their rating of the senator by 4.6 percent, and the "least sexist" participants in the sexual assault condition were estimated to have reduced their rating of the senator by 47.6 percent.

The rightmost point of the plot indicates that the "most sexist" participants in the sexual assault condition were estimated to have a value of the outcome variable that was about 0 units less than did the "most sexist" participants in the control condition; the "most sexist" participants in the control were estimated to have increased their rating of the senator by 1.7 percent, and the "least sexist" participants in the sexual assault condition were estimated to have increased their rating of the senator by 2.1 percent. Based on this rightmost point, a reader could conclude about the sexual assault allegations, as Costa et al. 2020 suggested, that:

...the most sexist subjects react about the same way to sexual assault and sexist jokes allegations as they do to the control news story about the legislator attending a county fair.

However, the numbers at the inside bottom of the Figure 1 panels indicate the sample size at that level of the sexism index, across the control condition and the sexual assault condition. These numbers indicate that the regression-based estimate for the "most sexist" participants was nontrivially based on the behavior of other participants.

Estimates in the right panel of Figure 1 are instead based on t-tests conducted for participants at only the indicated level of the sexism index. As in the left panel, the estimate for the "least sexist" participants falls between -50 and -60, and, for the next few higher observed values of the sexism index, estimates tend to rise and/or tend to get closer to zero. But the tendency does not persist above the midpoint of the sexism index. Moreover, the point estimates in the right panel for the three highest values of the sexism index do not fall within the corresponding 95% confidence intervals in the left panel.

The p-value fell below p=0.05 for the 28 participants at 15 or 16 on the sexism index, with a point estimate of -22. The sample size was 1,888 across these two conditions, so participants at 15 or 16 on the sexism index represent the top 1.5% of participants on the sexism index across these two conditions. Therefore, the sexual assault treatment appears to have had an effect on these "very sexist" participants.

---

3.

Regression can reveal patterns in data. For example, linear regression estimates correctly indicated that, in the Costa et al. 2020 experiment, the effect of the sexual assault treatment relative to the control was closer to zero for participants at higher levels of a sexism index than for participants at lower level of the sexism index. However, as indicated in the illustration above, regression can produce misestimates of an effect at particular levels of a predictor. Therefore, inferences about an estimated effect at a particular level of a predictor should be based only on cases at or around that level of the predictor and should not be influenced by other cases.

---

NOTES

1. Costa et al. 2020 data.

2. Stata code for the analysis.

3. R code for the plot. CSV file for the R plot.

4. The interflex R package (Hainmueller et al. 2020) produced the plot below, using six bins. The leveling off at higher values of the sexism index also appears in this interflex plot:

R code to add to the corrected Costa et al. 2020 code:

dat$sexism16 <- (dat$pre_sexism-1)*4

summary(dat$sexism16)

p1 <- inter.binning(data=dat, Y="perchange_vote", D="condition2", X="sexism16", nbins=6, base="Control")

plot(p1)

Tagged with: ,

In a Monkey Cage post and Chapter 6 of their Ignored Racism book, Mark D. Ramirez and David A.M. Peterson reported on a conjoint experiment, in which White adult U.S. citizens were given a profile of two target persons and were asked "Which of these citizens do you prefer to keep registered to vote?". The experiment manipulated profile target characteristics such as race, gender, and criminal status.

Latina/o racism-ethnicism (LRE) was measured with responses to four "modern racism"-type items, such as "Many other ethnic groups have successfully integrated into American culture. Latinos and Hispanics should do the same without any special favors".

Results in Figure 6.7 indicated that high LRE participants favored White targets over Hispanic targets. But Figure 6.7 results also indicated that low LRE participants favored Hispanics targets over White targets. This experiment thus provided further evidence that a nontrivial percentage of participants at low levels of modern racism / modern sexism items have racial bias and/or gender bias. Here is prior post on a study indicating that persons at low levels of hostile sexism discriminated against men.

Tagged with: , ,

PS Political Science & Politics recently published Liu et al. 2020 "The Gender Citation Gap in Undergraduate Student Research: Evidence from the Political Science Classroom". The authors use their study to discuss methods to address gender bias in citations among students:

To the extent that women, in fact, are underrepresented in undergraduate student research, the question becomes: What do we, as a discipline, do about this?...

However, Liu et al. 2020 do not establish that women authors were unfairly underrepresented in student research, because Liu et al. 2020 did not compare citation patterns to a benchmark of the percentage of women that should be cited in the absence of gender bias.

PS Political Science & Politics has an relevant article for benchmarking: Teele and Thelen 2017, in which Table 1 reports the percentage of authors who are women for research articles published from 2000 to 2015 in ten top political science journals. Based on that table, about 26.3% of authors were women.

The Liu et al. 2020 student sample had 75 male students and 65 female students,with male students citing 21.2% women authors and female students citing 33.1% women authors, so the percentage of women cited by the students overall was about 26.7% when weighted by student gender, which is remarkably close to the 26.3% benchmark.

There might be sufficient evidence to claim that the 95% confidence interval for male students does not contain the proper benchmark, and the same might be true for female students, but the 26.3% benchmark from Teele and Thelen 2017 might not be the correct benchmark: for example, maybe students wrote more on topics for which women have published relatively more, or maybe students drew from publications from before 2000 (during which women were a smaller percentage of political scientists than from 2000 to 2015). But the correct benchmark for inferring that women authors were unfairly underrepresented should have been addressed before PS published the final paragraph of Liu et al. 2020, with recommendations about how to address women's under-representation in undergraduate student research.

Tagged with: , ,

The 2018 Cooperative Congressional Election Survey included two items labeled as measures of "sexism", for which respondents received five response options from "strongly agree" to "strongly disagree". One of these sexism measures is the Glick and Fiske 1996 hostile sexism statement that "Feminists are making entirely reasonable demands of men". This item was recently used in the forthcoming Schaffner 2020 article in the British Journal of Political Science.

It is not clear to me what "demands" the statement refers to. Moreover, it seems plausible that Democrats would conceptualize these demands differently than Republicans do so that, in effect, many Democrats would respond to a different item than many Republicans respond to. Democrats might be more likely to conceptualize reasonable demands such as support for equal work for equal pay, but Republicans might be more likely to conceptualize more disputable demands such as support for taxpayer-funded late-term abortions.

---

To assess whether CCES 2018 respondents were thinking only of the reasonable demand of men's support for equal work for equal pay, let's check data for the 2016 American National Election Studies Time Series Study, which asked post-election survey participants to respond to the item: "Do you favor, oppose, or neither favor nor oppose requiring employers to pay women and men the same amount for the same work?".

In weighted ANES 2016 data, 87% of participants asked that item favored requiring employers to pay women and men the same amount for the same work, including non-substantive responses, with a 95% confidence interval of [86%, 89%]. However, in weighted CCES 2018 post-election data, only 38% of participants somewhat or strongly agreed that feminists are making entirely reasonable demands of men, including non-substantive responses, with a 95% confidence interval of [37%, 39%].

So, in these weighted national samples, 87% favored requiring employers to pay women and men the same amount for the same work, but only 38% agreed that feminists are making entirely reasonable demands of men. I think that this is strong evidence that a large percentage of U.S. adults do not think of only reasonable demands when responding to the statement that "Feminists are making entirely reasonable demands of men".

---

To address the concern that the interpretation of the "demands" differs by partisanship, here are support levels by partisan identification:

Democrats

  • 92% favor requiring employers to pay women and men the same amount for the same work [2016 ANES]
  • 59% agree that feminists are making entirely reasonable demands of men [2018 CCES]
  • 33 percentage-point difference

Republicans

  • 84% favor requiring employers to pay women and men the same amount for the same work [2016 ANES]
  • 18% agree that feminists are making entirely reasonable demands of men [2018 CCES]
  • 66 percentage-point difference

So that's an 8-point Democrat/Republican gap in favoring requiring employers to pay women and men the same amount for the same work, but a 41-point Democrat/Republican gap in agreement that feminists are making entirely reasonable demands of men.

I think that this is at least suggestive evidence that a nontrivial percentage of Democrats and an even higher percentage of Republicans are not thinking of reasonable feminist demands such as support for equal work for equal pay. If it is generally true that, responding to the "feminist demands" item, Democrats on average think of different demands than Republicans think of, that seems like a poor research design, to infer sexism in politically relevant variables based on a too-vague item that different political groups interpret differently.

---

NOTES:

1. ANES 2016 citations:

The American National Election Studies (ANES). 2016. ANES 2012 Time Series Study. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2016-05-17. https://doi.org/10.3886/ICPSR35157.v1.

ANES. 2017. "User's Guide and Codebook for the ANES 2016 Time Series Study". Ann Arbor, MI, and Palo Alto, CA: The University of Michigan and Stanford University.

2. CCES 2018 citation:

Stephen Ansolabehere, Brian F. Schaffner, and Sam Luks. Cooperative Congressional Election Study, 2018: Common Content. [Computer File] Release 2: August 28, 2019. Cambridge, MA: Harvard University [producer] http://cces.gov.harvard.edu.

3. ANES 2016 Stata code:

tab V162149

tab V160502

keep if V160502==1

tab V162149

gen favorEQpay = V162149

recode favorEQpay (-9 -8 2 3=0)

tab V162149 favorEQpay, mi

svyset [pweight=V160102], strata(V160201) psu(V160202)

svy: prop favorEQpay

tab V161155

svy: prop favorEQpay if V161155==1

svy: prop favorEQpay if V161155==2

4. CCES 2018 Stata code:

tab CC18_422d tookpost, mi

tab CC18_422d tookpost, mi nol

keep if tookpost==2

tab CC18_422d, mi

gen femagree = CC18_422d

recode femagree (3/5 .=0) (1/2=1)

tab CC18_422d femagree, mi

svyset [pw=commonpostweight]

svy: prop femagree

tab CC18_421a

svy: prop femagree if CC18_421a==1

svy: prop femagree if CC18_421a==2

Tagged with: ,

Brian Schaffner posted a paper ("How Political Scientists Should Measure Sexist Attitudes") that engaged my critique in this symposium entry about the gender asymmetry in research on gender attitudes. This post provides comments on the part of the paper that engages with my critiques.

---

Schaffner placed men as the subject of five hostile sexism items, used responses to these items to construct a male-oriented hostile sexism scale, placed that scale into a regression alongside a female-oriented hostile sexism scale, and discussed results, such as (p. 39):

...including this scale in the models of candidate favorability or issue attitudes does not alter the patterns of results for the hostile sexism scale. The male-oriented scale demonstrates no association with gender-related policies, with coefficients close to zero and p-values above .95 in the models asking about support for closing the gender pay gap and relaxing Title IX.

The hostile sexism items include "Most men interpret innocent remarks or acts as being sexist" and "Many men are actually seeking special favors, such as hiring policies that favor them over women, under the guise of asking for equality".

These items reflect negative stereotypes about women, and it's not clear to me that these items should be expected to perform as well measuring "hostility towards men" (p. 39) as the items perform measuring hostility against women when women are the target of the items. I discussed in this prior post Schaffner 2019 Figure 2, which indicated that participants at low levels of hostile sexism discriminated against men; so the Schaffner 2019 data have participants who prefer women to men, but the male-oriented version of hostile sexism doesn't sort them sufficiently well.

If a male-oriented hostile sexism scale is to compete in a regression against a female-oriented hostile sexism scale, then interpretation of the results needs to be informed by how well each scale measures sexism against its target. I think an implication of the Schaffner 2019 results is that placing men as the target of hostile sexism items doesn't produce a good measure of sexism against men.

---

The male-oriented hostile sexism might be appropriate as a "differencer" in the way that stereotype scale responses about Whites can be used to better measure stereotype scale responses about Blacks. For example, for the sexism items, a sincerely-responding participant who strongly agrees that people in general are too easily offended would be coded as a hostile sexist by the woman-oriented hostile sexism item but would be coded as neutral by a "differenced" hostile sexism item.

I don't know that this differencing should be expected to overturn inferences, but I think that it is plausible that this differencing would improve the sorting of participants by levels of sexism.

---

Schaffner 2019 Figure A.1 indicates that the marginal effect of hostile sexism reduced the favorability ratings of female candidates Warren and Harris and increased the favorability ratings of Trump; see Table A.4 for more on this, and see Table A.5 for associations with policy preferences. However, given that low hostile sexism associates with sexism against men, I don't think that these associations in isolation are informative about whether sexism against women causes such support for political candidates or policies.

---

If I analyze the Shaffner 2019 data, here are a few things that I would like to look for:

[1] Comparison of the coefficient for the female-oriented hostile sexism scale to the coefficient for a "differenced" hostile sexism scale, predicting Trump favorability ratings.

[2] Assessment of whether responses to certain items predict discrimination by target sex in the conjoint experiment, such as for participants who strongly supported or strongly opposed the pay gap policy item or participants with relatively extreme ratings of Warren, Harris, and Trump (say, top 25% and bottom 25%).

Tagged with: , ,