The Peterson et al. 2019 PLOS ONE article "Mitigating gender bias in student evaluations of teaching" reported on an experiment conducted with students across four Spring 2018 courses: an introduction to biology course taught by a female instructor, an introduction to biology course taught by a male instructor, an introduction to American politics course taught by a female instructor, and an introduction to American politics course taught by a male instructor. Students completing evaluations of these teachers were randomly assigned to receive or to not receive a statement about how student evaluations of teachers are often biased against women and instructors of color.

The results clearly indicated that "this intervention improved the SET scores for the female faculty" (p. 8). But that doesn't address the mitigation of bias in the title of the article because, as the article indicates, "It is also possible that the students with female instructors who received the anti-bias language overcompensated their evaluations for the cues they are given" (p. 8).

---

For the sake of illustration, let's assume that the two American politics teachers were equal to each other and that the two biology teachers were equal to each other; if so, data from the Peterson et al. 2019 experiment for the v19 overall evaluation of teaching item illustrate how the treatment can both mitigate and exacerbate gender bias in student evaluations.

Here are the mean student ratings on v19 for the American politics instructors:

4.65     Male American politics teacher CONTROL

4.17     Female American politics teacher CONTROL

4.58     Male American politics teacher TREATMENT

4.53     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.49 disadvantage for the female teacher (p=0.02), but the treatment had only a 0.05 disadvantage for the female teacher (p=0.79). But here are the means for the biology teachers:

3.72     Male biology teacher CONTROL

4.02     Female biology teacher CONTROL

3.73     Male biology teacher TREATMENT

4.44     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.29 disadvantage for the male teacher (p=0.25), and the treatment had a 0.71 disadvantage for the male teacher (p<0.01).

---

I did not see any data reported on in the PLOS ONE article that can resolve whether the treatment mitigated or exacerbated or did not affect gender bias in the student evaluations of the biology teachers or the American politics teachers. The article's claim about addressing the mitigation of bias is, by my read of the article, rooted in the "decidedly mixed" (p. 2) literature and, in particular, on their reference 5, to MacNell et al. 2015. For example, from Peterson et al. 2019:

These effects [from the PLOS ONE experiment] were substantial in magnitude; as much as half a point on a five-point scale. This effect is comparable with the effect size due to gender bias found in the literature [5].

The MacNell et al. 2015 sample was students evaluating assistant instructors for an online course, with sample sizes for the four cells (actual instructor gender X perceived instructor gender) of 8, 12, 12, and 11. That's the basis for "the effect size due to gender bias found in the literature": a non-trivially underpowered experiment with 43 students across four cells evaluating *assistant* instructors in an *online* course.

It seems reasonable that, before college or university departments use the Peterson et al. 2019 treatment, there should be more research to assess whether the treatment mitigates, exacerbates, or does not change gender bias in student evaluations in situations in which the treatment is used. For what it's worth, the gender difference has been reported to be about 0.13 on a five-point scale based on a million or so Rate My Professors evaluations, using the illustration of 168 additional steps for a 5,117-step day. If the true gender bias in student evaluations were 0.13 units against women, the roughly 0.4-unit or 0.5-unit Peterson et al. 2019 treatment effect would have exacerbated gender bias in student evaluations of teaching.

---

NOTES:

1. Thanks to Dave Peterson for comments.

2. From what I can tell, if the treatment truly mitigated gender bias among students evaluating the biology teachers, that would mean that the male biology teacher truly did a worse job teaching than the female biology teacher did.

3. I created a index combining the v19, v20, and v23 items, which respectively are the overall evaluation of teaching, a rating of teaching effectiveness, and the overall evaluation of the course. Here are the mean student ratings on the index for the American politics instructors:

4.56     Male American politics teacher CONTROL

4.21     Female American politics teacher CONTROL

4.36     Male American politics teacher TREATMENT

4.46     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.35 disadvantage for the female teacher (p=0.07), but the treatment had a 0.10 advantage for the female teacher (p=0.59). But here are the means for the biology teachers:

3.67     Male biology teacher CONTROL

3.90     Female biology teacher CONTROL

3.64     Male biology teacher TREATMENT

4.39     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.23 disadvantage for the male teacher (p=0.35), and the treatment had a 0.75 disadvantage for the male teacher (p<0.01).

4. Regarding MacNell et al. 2015 being underpowered, if we use the bottom right cell of MacNell et al. 2015 Table 2 to produce a gender bias estimate of 0.50 standard deviations, the statistical power was 36% for an experiment with 20 student evaluations of instructors who were a woman or a man pretending to be a woman and 23 student evaluations of instructors who were a man or a woman pretending to be a man. If the true effect of gender bias in student evaluations is, say, 0.25 standard deviations, then the MacNell et al. study had a 13% chance of detecting that effect.

R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.50, sig.level=0.05)

pwr.t2n.test(n1=20, n2=23, d=0.25, sig.level=0.05)

5. Stata code:

* Overall evaluation of teaching

ttest v19 if bio==0 & treatment==0, by(female)

ttest v19 if bio==0 & treatment==1, by(female)

ttest v19 if bio==1 & treatment==0, by(female)

ttest v19 if bio==1 & treatment==1, by(female)

* Teaching effectiveness:

ttest v20 if bio==0 & treatment==0, by(female)

ttest v20 if bio==0 & treatment==1, by(female)

ttest v20 if bio==1 & treatment==0, by(female)

ttest v20 if bio==1 & treatment==1, by(female)

* Overall evaluation of the course

ttest v23 if bio==0 & treatment==0, by(female)

ttest v23 if bio==0 & treatment==1, by(female)

ttest v23 if bio==1 & treatment==0, by(female)

ttest v23 if bio==1 & treatment==1, by(female)

 

sum v19 v20 v23

pwcorr v19 v20 v23

factor v19 v20 v23, pcf

gen index = (v19 + v20 + v23)/3

sum index v19 v20 v23

 

ttest index if bio==0 & treatment==0, by(female)

ttest index if bio==0 & treatment==1, by(female)

ttest index if bio==1 & treatment==0, by(female)

ttest index if bio==1 & treatment==1, by(female)

Tagged with: , ,

I came across an interesting site, Dynamic Ecology, and saw a post on self-archiving of journal articles.The post mentioned SHERPA/RoMEO, which lists archiving policies for many journals. The only journal covered by SHERPA/RoMEO that I have published in that permits self-archiving is PS: Political Science & Politics, so I am linking below to pdfs of PS articles that I have published.

---

This first article attempts to help graduate students who need seminar paper ideas. The article grew out of a graduate seminar in US voting behavior with David C. Barker. I noticed that several articles on the seminar reading list placed in top-tier journals but made an incremental theoretical contribution and used publicly-available data, which was something that I as a graduate student felt that I could realistically aspire to.

For instance, John R. Petrocik in 1996 provided evidence that candidates and parties "owned" certain issues, such as Democrats owning care for the poor and Republicans owning national defense. Danny Hayes extended that idea by using publicly-available ANES data to provide evidence that candidates and parties owned certain traits, such as Democrats being more compassionate and Republicans being more moral.

The original manuscript identified the Hayes article as a travel-type article in which the traveling is done by analogy. The final version of the manuscript lost the Hayes citation but had 19 other ideas for seminar papers. Ideas on the cutting room floor included replication and picking a fight with another researcher.

Of Publishable Quality: Ideas for Political Science Seminar Papers. 2011. PS: Political Science & Politics 44(3): 629-633.

  1. pdf version, copyright held by American Political Science Association

---

This next article grew out of reviews that I conducted for friends, colleagues, and journals. I noticed that I kept making the same or similar comments, so I produced a central repository for generalized forms of these comments in the hope that -- for example -- I do not review any more manuscripts that formally list hypotheses about the control variables.

Rookie Mistakes: Preemptive Comments on Graduate Student Empirical Research Manuscripts. 2013. PS: Political Science & Politics 46(1): 142-146.

  1. pdf version, copyright held by American Political Science Association

---

The next article grew out of friend and colleague Jonathan Reilly's dissertation. Jonathan noticed that studies of support for democracy had treated don't know responses as if the respondents had never been asked the question. So even though 73 percent of respondents in China expressed support for democracy, that figure was reported as 96 percent because don't know responses were removed from the analysis.

The manuscript initially did not include imputation of preferences for non-substantive responders, but a referee encouraged us to estimate missing preferences. My prior was that multiple imputation was "making stuff up," but research into missing data methods taught me that the alternative -- deletion of cases -- assumed that cases were missing at random, which did not appear to be true in our study: the percent of missing cases in a country correlated at -0.30 and -0.43 with the country's Polity IV democratic rating, which meant that respondents were more likely to issue a non-substantive response in countries where political and social liberties are more restricted.

Don’t Know Much about Democracy: Reporting Survey Data with Non-Substantive Responses. 2012. PS: Political Science & Politics 45(3): 462-467. Second author, with Jonathan Reilly.

  1. pdf version, copyright held by American Political Science Association
Tagged with: , , , ,

This post at Active Learning in Political Science describes a discussion on inequality that followed the unequal distribution of chocolate to students reflecting unequal GDPs among countries:

The students then led a discussion about how the students felt, whether the wealthy students were obligated to give up some of their chocolate, and how they would convince the wealthy students to do so. Violence entered the conversation (jokingly) at one point. Eventually the discussion turned to the real-world implications, and the chocolate was widely shared.

Use of a prop like chocolate has advantageous qualities, such as raising the interest level of students and the uniqueness of the discussion, which likely fosters the potential for learning. But the simulation itself clouded or removed many of the features of inequality necessary for a quality discussion of global inequality and aid:

  1. A discussion of inequality among students in the same room diverts attention from impediments to sharing that real countries face: it is nearly costless to pass chocolate to the person next to you, but there is a substantial cost to packaging and shipping goods across the world.
  2. Presumably none of the students had the negative features of a regime like North Korea that would raise questions about whether direct aid might be more harmful than beneficial.
  3. The method of production of the chocolate in the simulation bears no relationsip to the method of production for GDP, chocolate, or any good in the real world: countries do not "receive" goods or wealth independent of mechanisms related to the country's natural resources, education or skill level of the population, political choices, history, etc.
  4. The parameters of the simulation ensured that the total amount of chocolate was static, so that the producion of more chocolate was not an option for the students.

The problem with simulations such as this is that the focus is placed on the simulated instead of the real.

Tagged with: ,