Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

7.

Spooren et al. 2013 "On the validity of student evaluation of teaching: The state of the art" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

---

8.

Laube et al. 2007 "The impact of gender on the evaluation of teaching: What we know and what we can do" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

---

9.

Stark and Freishtat 2014 "An evaluation of course evaluations" is a discussion that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity. I think that these publications would be more appropriate in a separate section of Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" instead of in their list of academic articles, book chapters, and working papers finding bias.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

4.

El-Alayli et al. 2018 "Dancing backwards in high heels: Female professors experience more work demands and special favor requests, particularly from academically entitled students" does not present novel evidence about bias in student evaluations of teaching. Instead: "The current research examined the extra burdens experienced by female professors in academia in the form of receiving more work demands from their students" (p. 145).

---

5.

Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" lists as "finding bias" Hessler et al. 2018 "Availability of cookies during an academic course session affects evaluation of teaching". I'm not sure why this study is included in a list that one of the Holman et al. 2019 coauthors described as a "list of 76 articles demonstrating gender and/or racial bias in student evaluations". The Hessler et al. 2018 experimental design focused on the provision or non-provision of cookies; the study also had variation in which Teacher A handled 10 groups of students and Teacher B handled the other 10 groups of students, but the p-value was 0.514 for this variation in teacher in the Table 3 regression predicting the summation score.

---

6.

The Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list doesn't provide a summary for Uttl et al. 2017 "Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related", so I'm not sure why this study is included in a list that one of the Holman et al. 2019 coauthors described as a "list of 76 articles demonstrating gender and/or racial bias in student evaluations".

For what it's worth, I don't know that student evaluations of teaching being uncorrelated with learning is much of a problem, unless student evaluations of teaching are used as a measure of student learning. For example, if an instructor received a low score on an item asking about the instructor's availability outside of class because the instructor is not available outside of class, then I don't see why responses to that instructor availability item would need to be correlated with student learning in order to be a valid measure of the instructor's availability outside of class.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity.

Tagged with: , ,

My prior post on Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" indicated that:

I think there would be value in a version of "Evidence of Bias in Standard Evaluations of Teaching" that accurately summarizes each study that has tested for unfair bias in student evaluations of teaching using a research design with internal validity and plausibly sufficient statistical power, especially if each summary were coupled with a justification of why the study provides credible evidence about unfair bias in student evaluations of teaching.

Pursuant to a discussion with Holman et al. 2019 co-author Dr. Rebecca Kreitzer, I thought that it might be a good idea for me to occasionally read and discuss a study that Holman et al. has categorized as finding bias.

---

1.

I have already posted about Peterson et al. 2019 "Mitigating gender bias in student evaluations of teaching". Holman et al. 2019 includes that article in the list of academic articles, book chapters, and working papers finding bias, so let's start there...

I do not perceive how the results in Peterson et al. 2019 can be read as finding bias. Feel free to read the article yourself or to read the Holman et al. 2019 summary of the article. Peterson et al. 2019 indicates that their results "indicate that a relatively simple intervention in language can potentially mitigate gender bias in student evaluation of teaching", but their research design does not permit an inference that bias was present among students the control group.

---

2.

Given that I am familiar with the brilliance research discussed in this Slate Star Codex post, let's move on to Storage et al. 2016 "The frequency of 'brilliant' and 'genius' in teaching evaluations predicts the representation of women and African Americans across fields", which reported evidence of a difference found in RateMyProfessors data:

Across the 18 fields in our analysis, "brilliant" was used in a 1.81:1 male:female ratio and "genius" in a 3.10:1 ratio...In contrast, we found little evidence of gender bias in use of "excellent" and "amazing" in online evaluations, with male:female ratios of 1.08:1 and 0.91:1, respectively.

But is the male/female imbalance in the frequency of "brilliant" and "genius" an unfair bias? One alternate explanation is that male instructors are more likely than female instructors to be in fields in which students use "brilliant" and "genius" in RateMyProfessors comments; that pattern appears in Storage et al. 2016 Figure 2. Another alternate explanation is that a higher percentage of male instructors than female instructors are "brilliant" and "genius"; for what it's worth, my analysis here indicates that male test-takers are disproportionately at the highest scores on the SAT-Math test, even accounting for the higher number of female SAT test-takers.

It's certainly possible that, accounting for these and other plausible alternate explanations, student comments are unfairly more likely to refer to male instructors than female instructors as "brilliant" and "genius". But it's not clear that the Storage et al. 2016 analysis permits such an inference of unfair bias.

From what I can tell, the main implication of research on bias in student evaluations of teaching concerns whether student evaluations of teaching should be used in employment decisions. Data from Storage et al. 2016 are from RateMyProfessors, so another hurdle for anyone properly using Storage et al. 2016 for the purpose of undercutting the use of student evaluations of teaching in employment decisions is producing a plausible argument that the "brilliant" and "genius" pattern in RateMyProfessors comments are representative of comments on student evaluations conducted by a college or university that are used in employment decisions.

Another hurdle is establishing that any instructor's employment would be nontrivially affected by a less-frequent-than-deserved use of "brilliant" and "genius" in student evaluation comments conducted by a college or university or on the RateMyProfessors site.

---

3.

Let's move on to another publication that Holman et al. 2019 has listed as finding bias: Piatak and Mohr 2019 "More gender bias in academia? Examining the influence of gender and formalization on student worker rule following".

It's not clear to me why an article reporting on a study of "student worker rule following" should be included in a list of "Evidence of Bias in Standard Evaluations of Teaching".

---

Comments are open if you disagree, but I don't see anything in Peterson et al. 2019 or Storage et al. 2016 or Piatak and Mohr 2019 that indicates a test for unfair bias in student evaluations of teaching using a research design with internal validity: from what I can tell, Peterson et al. 2019 had no test for unfair bias, Storage et al. 2016 did not address plausible alternate explanations, and Piatak and Mohr 2019 isn't even about student evaluations of teaching.

Tagged with: , ,

"Evidence of Bias in Standard Evaluations of Teaching" (Mirya Holman, Ellen Key, and Rebecca Kreitzer, 2019) has been cited as evidence of bias in student evaluations of teaching.

I am familiar with Mitchell and Martin 2018, so let's check how that study is summarized in the list, as archived on 20 November 2019. I count three substantive errors in the summary and one spelling error, highlighted below, and not counting the fgender in the header or the singular RateMyProfessor:

The summary referred to the online courses as being from different universities, but all of the online courses in the Mitchell and Martin 2018 analysis were at the same university. The summary referred to "female instructors" and "male professors", but the Mitchell and Martin 2018 analysis compared comments and evaluations for only one female instructor to comments and evaluations for only one male instructor. The summary indicated that female instructors were evaluated differently in intelligence, but no Mitchell and Martin 2018 table reported a statistical significance asterisk for the Intelligence/Competency category.

---

The aforementioned errors in the summary of Mitchell and Martin 2018 can be easily fixed, but that would not address a flaw in a particular use of the list, given that, from what I can tell, Mitchell and Martin 2018 has errors that undercut the inference about students using different language when evaluating female instructors than when evaluating male instructors. Listing that study and other studies as evidence of bias in student evaluations of teaching based on an uncritical reading of results shouldn't be convincing evidence of bias in student evaluations of teaching, especially if the categorizing of studies does not indicate whether "bias" is operationalized as an unfair difference or as a mere difference.

I think there would be value in a version of "Evidence of Bias in Standard Evaluations of Teaching" that accurately summarizes each study that has tested for unfair bias in student evaluations of teaching using a research design with internal validity and plausibly sufficient statistical power, especially if each summary were coupled with a justification of why the study provides credible evidence about unfair bias in student evaluations of teaching. But I don't see why anyone should be convinced by "Evidence of Bias in Standard Evaluations of Teaching" in its current form.

Tagged with: ,

This post discusses whether the lowest levels of hostile sexism are properly understood as indicating the lowest measured levels of sexism.

---

Barnes et al. 2018 "Sex and Corruption: How Sexism Shapes Voters' Responses to Scandal" in Politics, Groups, and Identities (ungated) reported results from an experiment that had an outcome variable with four levels, from "very unlikely" to "very likely", in which participants could indicate how likely the participant would be to vote for a hypothetical representative in the next election. Treatments were the representative's sex (man or woman) and the type of scandal that the representative had been involved in (corruption or sex).

Hostile sexism (Glick and Fiske 1996) was measured with three items:

  1. Women are too easily offended
  2. Most women fail to appreciate all that men do for them
  3. Women exaggerate their problems

Below is the hostile sexism panel for the sex scandal condition, from Barnes et al. 2018 Figure 2. The right side of the panel suggests that participants at the highest levels of hostile sexism were biased against women. But the left side of the panel suggests that participants at lowest levels of hostile sexism were biased against men.That low levels of hostile sexism do not indicate the absence of sexism seems plausible given that, in the article, the lowest level of hostile sexism for participants responding to all hostile sexism items required participants to disagree as much as possible on a 7-point scale with mildly negative statements about women, such as the statement that "Most women fail to appreciate all that men do for them". Strong disagreement with this statement is equivalent to expressing the view that most women appreciate all that men do for them, and it seems at least possible that persons with such a positive view of women might be unfairly biased in favor of women. Another way to think of it is that persons unfairly biased in favor of women must fall somewhere on the hostile sexism measure, and it seems plausible that these persons would place themselves at or toward the lower end of the measure.

"Sex and Corruption" co-author Emily Bacchus sent me data and code for the article, and these data indicate that the patterns for the dichotomous "very unlikely" outcome variable in the above plot hold when the outcome variable is coded with all four measured levels of vote likelihood, as in the plot below, in which light blue dots are for the male candidate and pink dots are for the female candidate:

Further analysis suggested that, in the sex scandal plot, much or all of the modeled discrimination against men at the lower levels of hostile sexism is due to the linear model and a relatively large discrimination against women at higher levels of hostile sexism. For example, for levels of hostile sexism from 0.75 through 1, there is a 0.75 discrimination against women (Ns of 20 and 32, p<0.01); for levels of hostile sexism from 0 through 0.25, there is a 0.20 discrimination against men (Ns of 95 and 94, p=0.07); for levels of hostile sexism at 0, there is a 0.09 discrimination against men (Ns of 35 and 28, p=0.70). Only 4 participants scored a 1 for hostile sexism. For levels of hostile sexism from 0.25 through 0.75, there is a 0.05 discrimination against men (Ns of 169 and 155, p=0.57).

---

Recent political science that I am familiar with that has used a hostile sexism measure has I think at least implied that lower levels of hostile sexism are normatively good. For example, the Barnes et al. 2018 article discussed "individuals who hold sexist attitudes" (p. 14, implying that some participants did not hold sexist attitudes), and a plot in Luks and Schaffner 2019 labeled the low end of a hostile sexism measure as "least sexist". However, it is possible that persons at the lower levels of hostile sexism are nontrivially persons who are sexist against men. I don't think that this possibility can be conclusively accepted or rejected based on the Barnes et al. 2018 data, but I do think that it matters whether the proper labeling of the low end of hostile sexism is "least sexist" or is "most sexist against men", to the extent that such unambiguous labels can be properly used for the lower end of the hostile sexism measure.

---

NOTES

Thanks to Emily Bacchus and her co-authors for comments and sharing data and code, and thanks for Peter Glick and Susan Fiske for comments.

Tagged with: ,

On October 27, 2019, U.S. Representative Katie Hill announced her resignation from Congress after her involvement in a sex scandal, claiming that she was leaving "because of a double standard".

There is a recently published article that reports on an experiment that can be used to assess such a double standard among the public, at least with an MTurk sample of over 1,000, with women about 45% of the sample: Barnes et al. 2018 "Sex and corruption: How sexism shapes voters' responses to scandal" in Politics, Groups, and Identities (ungated). Participants in the Barnes et al. 2018 experiment indicated on a four-point scale how likely they would be to vote for a representative in the next election; the experiment manipulated the hypothetical U.S. Representative's sex (man or woman) and the type of scandal that the representative had been involved in (corruption or sex).

Results in Barnes et al. 2018 Figure 1 indicated that, compared to the reported vote likelihoods for the male representative among participants assigned to the male representative involved in the sex scandal, participants assigned to the female representative involved in the sex scandal were not less likely to vote for the female representative.

---

The Monkey Cage published a post by Michael Tesler, entitled "Was Rep. Katie Hill held to a higher standard than men in Congress? This research suggests she was". The post did not mention the Barnes et al. 2018 experiment.

---

Mischiefs of Faction published a post by Gregory Koger and Jeffrey Lazarus that did mention the Barnes et al. 2018 experiment, but the Koger/Lazarus post did not mention the null finding across the full sample. The post instead mentioned the finding of a correlate of relative disfavoring of the female candidate (links omitted in the quoted passage below):

One answer is that there is sexist double standard for female politicians. One recently published article (ungated) by Tiffany Barnes, Emily Beaulieu, and Gregory Saxton finds that citizens are more likely to disapprove of a sex scandal by a female politician if they a) generally disapprove of women "usurping men's power," or b) see themselves as protectors of women, with protection contingent upon conformity to traditional gender roles. Both dynamics help explain why alleged House-rule-breaker Hill is resigning, while alleged federal-lawbreaker Hunter was reelected in 2018 and shows no interest in resigning.

The Koger/Lazarus post doesn't explain why these correlates are more important than the result among all participants or, for that matter, more important than the dynamic in Barnes et al. 2018 Figure 2 among participants with low hostile sexism scores.

The Koger/Lazarus post suggests that the Barnes et al. 2018 experiment detected a correlation between relative disfavoring of the female politician involved in a sex scandal and participant responses to a benevolent sexism scale (the "b" part of the passage quoted above). I don't think that is a correct description of the results: see Barnes et al. 2018 Table 1, Barnes et al. 2018 Figure 2, and/or the Barnes et al. 2018 statement that "Participants are thus unlikely to differentiate between the sex of the representative when responding to allegations about the representative's involvement in a sex scandal, regardless of the participant's level of benevolent sexism" (p. 13).

For what it's worth, the Barnes et al. 2018 abstract can be read as suggesting that the experiment did detect a bias among persons with high scores on a benevolent sexism scale.

---

Barnes et al. 2018 is a recently published large-sample experiment that found that, in terms of vote likelihood, participants assigned to a hypothetical female U.S. Representative involved in a sex scandal treated that female representative remarkably similar to the way in which participants assigned to the hypothetical male representative involved in a sex scandal treated that male representative. This result is not mentioned in two political science blog posts discussing the claim of a gender double standard made by a female U.S. Representative involved in a sex scandal.

Tagged with: ,