October 2019

Comments on "Mitigating gender bias in student evaluations of teaching"

The Peterson et al. 2019 PLOS ONE article "Mitigating gender bias in student evaluations of teaching" reported on an experiment conducted with students across four Spring 2018 courses: an introduction to biology course taught by a female instructor, an introduction to biology course taught by a male instructor, an introduction to American politics course taught by a female instructor, and an introduction to American politics course taught by a male instructor. Students completing evaluations of these teachers were randomly assigned to receive or to not receive a statement about how student evaluations of teachers are often biased against women and instructors of color.

The results clearly indicated that "this intervention improved the SET scores for the female faculty" (p. 8). But that doesn't address the mitigation of bias in the title of the article because, as the article indicates, "It is also possible that the students with female instructors who received the anti-bias language overcompensated their evaluations for the cues they are given" (p. 8).

---

For the sake of illustration, let's assume that the two American politics teachers were equal to each other and that the two biology teachers were equal to each other; if so, data from the Peterson et al. 2019 experiment for the v19 overall evaluation of teaching item illustrate how the treatment can both mitigate and exacerbate gender bias in student evaluations.

Here are the mean student ratings on v19 for the American politics instructors:

4.65     Male American politics teacher CONTROL

4.17     Female American politics teacher CONTROL

4.58     Male American politics teacher TREATMENT

4.53     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.49 disadvantage for the female teacher (p=0.02), but the treatment had only a 0.05 disadvantage for the female teacher (p=0.79). But here are the means for the biology teachers:

3.72     Male biology teacher CONTROL

4.02     Female biology teacher CONTROL

3.73     Male biology teacher TREATMENT

4.44     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.29 disadvantage for the male teacher (p=0.25), and the treatment had a 0.71 disadvantage for the male teacher (p<0.01).

---

I did not see any data reported on in the PLOS ONE article that can resolve whether the treatment mitigated or exacerbated or did not affect gender bias in the student evaluations of the biology teachers or the American politics teachers. The article's claim about addressing the mitigation of bias is, by my read of the article, rooted in the "decidedly mixed" (p. 2) literature and, in particular, on their reference 5, to MacNell et al. 2015. For example, from Peterson et al. 2019:

These effects [from the PLOS ONE experiment] were substantial in magnitude; as much as half a point on a five-point scale. This effect is comparable with the effect size due to gender bias found in the literature [5].

The MacNell et al. 2015 sample was students evaluating assistant instructors for an online course, with sample sizes for the four cells (actual instructor gender X perceived instructor gender) of 8, 12, 12, and 11. That's the basis for "the effect size due to gender bias found in the literature": a non-trivially underpowered experiment with 43 students across four cells evaluating *assistant* instructors in an *online* course.

It seems reasonable that, before college or university departments use the Peterson et al. 2019 treatment, there should be more research to assess whether the treatment mitigates, exacerbates, or does not change gender bias in student evaluations in situations in which the treatment is used. For what it's worth, the gender difference has been reported to be about 0.13 on a five-point scale based on a million or so Rate My Professors evaluations, using the illustration of 168 additional steps for a 5,117-step day. If the true gender bias in student evaluations were 0.13 units against women, the roughly 0.4-unit or 0.5-unit Peterson et al. 2019 treatment effect would have exacerbated gender bias in student evaluations of teaching.

---

NOTES:

1. Thanks to Dave Peterson for comments.

2. From what I can tell, if the treatment truly mitigated gender bias among students evaluating the biology teachers, that would mean that the male biology teacher truly did a worse job teaching than the female biology teacher did.

3. I created a index combining the v19, v20, and v23 items, which respectively are the overall evaluation of teaching, a rating of teaching effectiveness, and the overall evaluation of the course. Here are the mean student ratings on the index for the American politics instructors:

4.56     Male American politics teacher CONTROL

4.21     Female American politics teacher CONTROL

4.36     Male American politics teacher TREATMENT

4.46     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.35 disadvantage for the female teacher (p=0.07), but the treatment had a 0.10 advantage for the female teacher (p=0.59). But here are the means for the biology teachers:

3.67     Male biology teacher CONTROL

3.90     Female biology teacher CONTROL

3.64     Male biology teacher TREATMENT

4.39     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.23 disadvantage for the male teacher (p=0.35), and the treatment had a 0.75 disadvantage for the male teacher (p<0.01).

4. Regarding MacNell et al. 2015 being underpowered, if we use the bottom right cell of MacNell et al. 2015 Table 2 to produce a gender bias estimate of 0.50 standard deviations, the statistical power was 36% for an experiment with 20 student evaluations of instructors who were a woman or a man pretending to be a woman and 23 student evaluations of instructors who were a man or a woman pretending to be a man. If the true effect of gender bias in student evaluations is, say, 0.25 standard deviations, then the MacNell et al. study had a 13% chance of detecting that effect.

R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.50, sig.level=0.05)

pwr.t2n.test(n1=20, n2=23, d=0.25, sig.level=0.05)

5. Stata code:

* Overall evaluation of teaching

ttest v19 if bio==0 & treatment==0, by(female)

ttest v19 if bio==0 & treatment==1, by(female)

ttest v19 if bio==1 & treatment==0, by(female)

ttest v19 if bio==1 & treatment==1, by(female)

* Teaching effectiveness:

ttest v20 if bio==0 & treatment==0, by(female)

ttest v20 if bio==0 & treatment==1, by(female)

ttest v20 if bio==1 & treatment==0, by(female)

ttest v20 if bio==1 & treatment==1, by(female)

* Overall evaluation of the course

ttest v23 if bio==0 & treatment==0, by(female)

ttest v23 if bio==0 & treatment==1, by(female)

ttest v23 if bio==1 & treatment==0, by(female)

ttest v23 if bio==1 & treatment==1, by(female)

sum v19 v20 v23

pwcorr v19 v20 v23

factor v19 v20 v23, pcf

gen index = (v19 + v20 + v23)/3

sum index v19 v20 v23

ttest index if bio==0 & treatment==0, by(female)

ttest index if bio==0 & treatment==1, by(female)

ttest index if bio==1 & treatment==0, by(female)

ttest index if bio==1 & treatment==1, by(female)

Comments on Dion and Mitchell 2019 "How many citations to women is 'enough'? Estimates of gender representation in political science"

In the 2019 PS: Political Science & Politics article "How Many Citations to Women Is 'Enough'? Estimates of Gender Representation in Political Science", Michelle L. Dion and Sara McLaughlin Mitchell address a question about "the normative standard for the amount women should be cited" (p. 1).

The first proposed Dion and Mitchell 2019 measure is the proportion of female members of the American Political Science Association (APSA) by section and primary field, using data from 2018. According to Dion and Mitchell 2019: "When political scientists compose course syllabi, graduate reading lists, and research bibliographies, these membership data provide guidance about the minimum representation of scholarship by women that should be included to be representative by gender" (p. 3).

But is APSA section membership in 2018 a reasonable benchmark for gender representation in course syllabi that include readings from throughout history?

Hardt et al. 2019 reported on data for readings assigned in the training of political science graduate students. Below are percentages of graduate student readings in these data that had a female first author:

Time Period	Female First Author %
Before 1970	3.5%
1970 to 1979	6.7%
1980 to 1989	11.3%
1990 to 1999	15.7%
2000 to 2009	21.0%
2010 to 2018	24.6%

So the pattern is increasing representation of women over time. If this pattern reflects increasing representation of women over time in APSA section membership or increasing representation of women among the set of researchers whose research interests include the topic of a particular section, then APSA section membership data from 2018 will overstate the percentage of women needed to ensure fair gender representation on syllabi or research bibliographies. For illustrative purposes, if a section had 20% women across the 1990s, 30% women across the 2000s, and 40% women across the 2010s, a fair "section membership" benchmark for gender representation on syllabi would not be 40%; rather, a fair "section membership" benchmark for gender representation on syllabi would be something like 20% women for syllabi readings across the 1990s, 30% women for syllabi readings across the 2000s, and 40% women for syllabi readings across the 2010s.

---

Dion and Mitchell 2019 propose another measure that is biased in the same direction and for the same reason: gender distribution of authors by journal from 2007 to 2016 inclusive for available years.

About 68% of readings in the Hardt et al. 2019 graduate training readings data were published prior to 2007: 15% of these pre-2007 readings had a first female author, but 24% of the 2007-2016 readings in the data had a first female author.

Older readings are included on Hardt et al. 2019 readings with decent frequency: 42% of readings that had the gender of the first author coded were published before 2000. However, the Dion and Mitchell 2019 measure of journal representation from 2007 to 2016 ignores these older readings, which produces a biased measure favoring women if fair representation means representation that matches the representation in the relevant pool of syllabi-worthy journal articles.

---

In a sense, this bias in the Dion and Mitchell 2019 measures might not matter much if the measures are used in the biased manner that Dion and Mitchell 2019 proposed (p. 6):

We remedy this gap by explicitly providing conservative estimates of gender diversity based on organization membership and journal article authorship for evaluating gender representation. Instructors, researchers, and editors who want to ensure that references are representative can reference these as floors (rather than ceilings) for minimally representative citations.

The Dion and Mitchell 2019 suggestion above is that instructors, researchers, and editors who want to ensure that references are representative use a conservative estimate as a floor. Both the conservative nature of the estimate and its use as a floor would produce a bias favoring women, so I'm not sure how that is helpful for instructors, researchers, and editors who want to ensure that references are representative.

---

NOTE:

1. Stata code for the analysis of the Hardt et al. 2019 data:

tab female1 if year<1970

tab female1 if year>=1970 & year<1980

tab female1 if year>=1980 & year<1990

tab female1 if year>=1990 & year<2000

tab female1 if year>=2000 & year<2010

tab female1 if year>=2010 & year<2019

tab female1

tab female1 if year<2000

di 36791/87398

Month: October 2019

Comments on "Mitigating gender bias in student evaluations of teaching"

Comments on Dion and Mitchell 2019 "How many citations to women is 'enough'? Estimates of gender representation in political science"