Tour of research on student evaluations of teaching [74-end], plus a discussion

Let's conclude our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post focuses on what I consider to be promising research designs.

---

74.

Mitchell and Martin 2018 "Gender Bias in Student Evaluations" has two studies. The first study reports a comparison of comments on one female instructor to comments on one male instructor, from official course evaluations and RateMyProfessors comments on dimensions such as competence, appearance, personality, and use of "teacher" or "professor". There's not much all-else-equal in the study established for instructor teaching style or effectiveness, and this study didn't even involve the instructors teaching the same set of courses. Moreover, as I indicated here, the p-values for the reported results are biased downward in a way that undercuts inferences that there is a statistical difference.

The second study has a better research design, with Mitchell and Martin teaching different sections of the same course and so that "all lectures, assignments, and content were exactly the same in all sections" (p. 650).

But this study also has errors in the p-values and a lack of an all-else-equal element that could sufficiently eliminate all plausible explanations other than gender bias ("The only aspects of the course that varied between Dr. Mitchell's and Dr. Martin's sections were the course grader and contact with the instructor", p. 650). Moreover, Martin taught the higher-numbered sections for which students plausibly differed from students in the lower-numbered sections (e.g., from what I can tell, response rates to the student evaluations were a 17 percent for Mitchell and 12 percent for Martin); and Martin and Mitchell had taught before at the university and thus could have developed a reputation that caused any difference in student evaluations for the course.

And the analysis for the second study excluded student evaluation data for sections 1 to 5 of the course; these data are not available on the university website.

---

75.

Rivera and Tilcsik 2019 "Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation" has a preregistered survey experiment and a quasi-natural experiment.

The quasi-natural experiment involved a university switching from a 10-point evaluation scale to a 6-point evaluation scale. Data from 105,034 student ratings of 369 instructors in 235 courses over 20 semesters with a 10-point scale and 9 semesters with a 6-point scale indicated that male instructors were more likely than were female instructors to get the highest rating on the 10-point scale (p. 257) but were not more likely than were female instructors to get the highest rating on the 6-point scale (p. 258); moreover, the 0.5-point male/female gap for the 10-point scale was reduced to a 0.1-point male/female gap for the 6-point scale. The authors reported results that addressed potential confounds, such as: "the change we observe is not driven by a general linear trend toward higher ratings for women (Models 8 and 11)" (p. 263).

The survey experiment was intended to build on these results with a research design that permitted a stronger causal inference. The experiment was conducted online with 400 students from 40 universities, in which the experimental manipulations involved the sex of the instructor and whether the rating scale had 6 points or 10 points. The teaching samples involved students receiving "identical excerpts from the transcript of a lecture and [being] randomly assigned either a male or a female name to the instructor who had ostensibly given the lecture" (p. 256).

For the 10-point scale, the mean rating was 7.8 for the male instructor and was 7.1 for the female instructor, a difference at p<0.05 of about 0.32 standard deviations, using the 0.64 gap mentioned on page 265. However, for the 6-point scale, the mean rating was 4.9 for the male instructor and was 4.8 for the female instructor, a difference at p>0.05 of about 0.1 standard deviations.

Results from the survey experiment also indicated evidence that participants were more likely to use superlative words to describe the male instructor than to describe the female instructor.

I think that Rivera and Tilcsik 2019 is a great study that is convincing that numeric student evaluations of teaching should have no more than 6 points. But I don't think that it is convincing evidence that numeric student evaluations of teaching with 6 or fewer points should not be used in employment decisions, given that the experiment did not detect at p<0.05 a difference between the mean rating for the male instructor and the mean rating for the female instructor when using a 6-point scale and did not even detect a meaningfully large difference in the quasi-experimental data.

The survey experiment did provide evidence of gender bias in student use of superlatives in items that the preregistration form indicated were for exploratory purposes, but I don't know that these sort of superlatives have a nontrivial effect on the employment decisions that student evaluations of teaching are properly used for.

Two notes and/or criticisms...

First, the topic of the lecture in the survey experiment was selected with gender in mind (p. 256):

All participants read an identical excerpt from the transcript of a lecture on the social and economic implications of technological change. We chose this topic because it has potentially broad appeal, and both technology and economics are traditionally male-dominated fields.

I don't know why it would be a good idea to select a topic that drew on two male-dominated fields instead of selecting a topic that was gender neutral or, preferably, adding a dimension to the survey experiment in which some participants received a topic from a male-dominated field and other participants received a topic from a female-dominated field. Limiting the topic to male-dominated fields undercuts the ability to generalize any bias detected in favor of the male instructors.

Second, the article suggests that teaching quality was held constant in the survey experiment ("...randomly vary the focal instructor's (perceived) gender and the rating scale while holding constant instructor quality", p. 256, emphasis in the original). But the only "teaching" that participants were exposed to was the teaching that could be inferred from reading lecture notes, which undercuts the ability to generalize results to in-person teaching, especially in-person teaching over a semester.

---

76.

The shortcoming of the survey experiment methodology in Rivera and Tilcsik 2019 is that the exposure to the instructor is too brief to permit the inference that any trivial-to-moderate bias detected in the survey experiment will survive an entire semester of exposure to the instructor, especially if the exposure is two or three days per week for 12 or more weeks. Related, from Anderson and Kanner 2011: "Many studies have found that stereotypes decrease as individuals get to know out-group members (e.g., Anderssen, 2002)" (p. 1560).

MacNell et al. 2014 "What's in a Name: Exposing Gender Bias in Student Ratings of Teaching" addresses this shortcoming by having a male assistant instructor and a female assistant instructor for an introductory-level anthropology/sociology course act as themselves for a section and act as the other assistant instructor for another section; students were randomly assigned to section. If conducted properly and preregistered, this sort of study could provide strong evidence of gender bias. But as Benton and Li 2014 indicate, the study or the article have multiple important flaws, such as the assistant instructors not being blind to the sex that the instructor was acting as and the article not reporting results for the item that asked about the instructor's overall quality of teaching.

The experiment is also substantially underpowered, based on estimates from Rivera and Tilcsik 2019. The 6-point scales in the Rivera and Tilcsik 2019 survey experiment produced an estimate of the bias against female instructors of 0.10 standard deviations, and that estimate was not statistically significant. Using that estimate and the MacNell et al. 2014 sample sizes, the MacNell et al. 2014 study had a 6 percent statistical power. The items in MacNell et al. 2014 had a 5-point scale, but the statistical power is only 25 percent even using the 0.40 standard deviation estimate from the Rivera and Tilcsik 2019 survey experiment conditions with the 10-point scale. R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.1)

pwr.t2n.test(n1=20, n2=23, d=0.4)

The MacNell et al. 2014 data are here. My analysis indicated that, comparing the perceived female instructor group to the perceived male instructor group, the p-value is p=0.15 for the overall evaluation item (a 0.46 standard deviation difference) and is p=0.07 for an index of all 15 items (a 0.58 standard deviation difference). The estimated bias on the index was 0.35 standard deviations (p=0.51) among male students and was 0.81 standard deviations (p=0.055) among female students, although the p-value is only p=0.45 for this 0.46 standard deviation difference. Stata code:

egen overall_std = std(overall)
ttest overall_std, by(taidgender) unp une

gen index = professional + respect + caring + enthusiastic + communicate + helpful + feedback + prompt + consistent + fair + responsive + praised + knowledgeable + clear + overall
egen index_std = std(index)
ttest index_std, by(taidgender) unp une

ttest index_std if gender==1, by(taidgender) unp une
ttest index_std if gender==2, by(taidgender) unp une
reg index_std taidgender##gender

---

Discussion

Detecting gender bias in student evaluations of teaching is not simple. Non-experimental research that doesn't control for gender differences in teaching quality or teaching style and doesn't control for student differences isn't of much value unless the detected difference in evaluations is large enough to not be plausibly explained by a combination of gender differences in teaching quality or teaching style and non-random student selection to instructors and courses. And experimental research that involves only a brief exposure to a target instructor isn't informative about whether any detected gender bias will persist for an entire semester of exposure to an instructor.

I think that studies 74 through 76 have among the strongest research designs in the literature on bias in student evaluations of teaching, if not the strongest research designs, or at least have the potential for convincing research designs if aforementioned flaws are addressed. However, even accounting for the fact that online research designs in these studies produce inferences that might not apply to face-to-face courses, I don't know what about these studies or other studies should lead a department or university to not use student evaluations of teaching in employment decisions.

I think that the research design of the type in MacNell et al. 2014 could provide convincing evidence of bias in student evaluations of teaching, if preregistered, sufficiently powered, and with instructors blinded to condition. Even better would be for the research to be conducted or supervised by a team of researchers in an adversarial collaboration.

An interesting research design has been to analyze evaluations to assess whether the sub-items that predict an overall evaluation differ for female instructors and male instructors. However, it's not clear to me that such differences would be a bias that should lead a department or university to not use student evaluations of teaching in employment decisions, unless these biases manifest in the student responses to the items and not only in the correlations among responses to the items.

---

I think that student evaluations of teaching are a useful tool when used properly, but, if I were opposed to their use in employment decisions or otherwise, I don't know that I would focus on the claims of gender and race biases, such as "Mounting evidence of favouritism towards white male instructors doesn't dissuade universities". I think that it's correct that any gender or race bias would be smaller than a "beauty premium" favoring attractive instructors (e.g., Lombardo and Tocci 1979, Hammermesh and Parker 2005, Wallisch and Cachia 2018).

---

Let me end the post with a discussion of the Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list of 76 studies "finding bias" that I have been working through pursuant to a suggestion by a Holman et al. coauthor. This list was presented by a Holman et al. 2019 coauthor in a widely-distributed tweet as a "list of 76 articles demonstrating gender and/or racial bias in student evaluations".

But the Holman et al. 2019 list has important flaws. The list is incomplete, and it is not clear that omitted studies are missing at random. For example, the Feldman 1992 review of experimental studies indicates that "Most of the laboratory studies reviewed here (and summarized in the Appendix) found that male and female teachers did not differ in college student's overall evaluation of them as professionals..." (emphasis in the original), and the Feldman 1993 review of non-experimental studies indicates that "a majority of studies have found that male and female college teachers do not differ in the global ratings they receive from their students" and that "when statistically significant differences are found, more of them favor women than men". This is not the impression provided by the Holman et al. 2019 list, which includes over fifteen pre-1991 "finding bias" studies, one pre-1991 "bias favoring women" study, and zero pre-1991 "no gender or race bias" studies.

Moreover, the Holman et al. 2019 list has at least nine studies that are essays or reviews that present no novel non-anecdotal data (Schuster and Van Dyne 1985, Feldman 1992, Feldman 1993, Andersen and Miller 1997, Baldwin and Blattner 2003, Huston 2006, Laube et al. 2007, Spooren et al. 2013, and Stark and Freishtat 2014), has at least five studies that might not be properly classified as being about student evaluations of teaching (Brooks 1982, Heilman and Okimoto 2007, El-Alayli et al. 2018, Drake et al. 2019, and Piatak and Mohr 2019), has at least three studies that might be better classified as duplicates of other entries (Boring 2015, Boring et al. 2016, and "Freiderike" et al. 2017), has at least one study that might be better classified as "no bias" or as "bias favoring women" (Basow and Montgomery 2005), and has at least three studies that might be better not being listed as finding bias if the bias is intended to be a bias in favor of straight white men or some combination thereof (Greenwald and Gillmore 1997, Uttl et al. 2017, and Hessler et al. 2018).

It's also worth noting that a nontrivial percentage of studies listed as "finding bias" are at least 20 years old: 6 studies from before 1980, 17 studies from before 1990, and 28 studies (37%) from before 1999.

I think that there is a public benefit to a list of studies assessing bias in student evaluations of teaching, but the list should at least be representative or be accompanied by an indication of evidence that the list is representative. For what it's worth, on November 26, I tweeted to two Holman et al. 2019 coauthors a link to a post listing errors in one of the 76 summaries; the errors were still there on December 16.

Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.