The Journal of Academic Ethics published Kreitzer and Sweet‑Cushman 2021 "Evaluating Student Evaluations of Teaching: A Review of Measurement and Equity Bias in SETs and Recommendations for Ethical Reform".

---

Kreitzer and Sweet‑Cushman 2021 reviewed "a novel dataset of over 100 articles on bias in student evaluations of teaching" (p. 1), later described as "an original database of more than 90 articles on evaluative bias constructed from across academic disciplines" (p. 2), but a specific size of the dataset/database is not provided.

I'll focus on the Kreitzer and Sweet‑Cushman 2021 discussion of evidence for an "equity bias".

---

Footnote 4

Let's start with Kreitzer and Sweet‑Cushman 2021 footnote 4:

Research also finds that the role of attractiveness is more relevant to women, who are more likely to get comments about their appearance (Mitchell & Martin, 2018; Key & Ardoin, 2019). This is problematic given that attractiveness has been shown to be correlated with evaluations of instructional quality (Rosen, 2018)

Mitchell and Martin 2018 reported two findings about comments on instructor appearance. MM2018 Table 1 reported on a content analysis of official university course evaluations, which indicated that 0% of comments for the woman instructor and 0% of comments for the man instructor were appearance-related. MM2018 Table 2 reported on a content analysis of Rate My Professors comments, which indicated that 10.6% of comments for the woman instructor and 0% of comments for the man instructor were appearance-related, with p<0.05 for the difference between the 10.6% and the 0%.

So Kreitzer and Sweet‑Cushman 2021 footnote 4 cited the p<0.05 Rate My Professors finding but not the zero result for the official university course evaluations, even though official university course evaluations are presumably much more informative about bias in student evaluations as used in practice, compared to Rate My Professors comments that presumably are unlikely to be used for faculty tenure, promotion, and end-of-year evaluations.

Note also that Kreitzer and Sweet‑Cushman 2021 reported this Rate My Professors appearance-related finding without indicating the low quality of the research design: Mitchell and Martin 2018 compared comments about one woman instructor (Mitchell herself) to comments about one man instructor (Martin himself), from a non-experimental research design.

Moreover, the p<0.05 evidence for this "appearance" finding from Mitchell and Martin is based on an error by Mitchell and/or Martin. I blogged about the error in 2019, and MM2018 was eventually corrected (26 May 2020) to indicate that there is insufficient evidence (p=0.3063) to infer than the 10.6 percentage point gender difference in appearance-related comments is inconsistent enough with chance. However, Kreitzer and Sweet‑Cushman 2021 (accepted 27 Jan 2021) cited this "appearance" finding from the uncorrected version of the article.

---

And I'm not sure what footnote 4 is referencing in Key and Ardoin 2019. The closest Key and Ardoin 2019 passage that I see is below:

Another telling point is whether students comment on the faculty member's teaching and expertise, or on such personal qualities as physical appearance or fashion choices. Among the students who'd received the bias statement, comments on female faculty were substantially more likely to be about the teaching.

But this Key and Ardoin 2019 passage is about a difference between groups in comments about female faculty (involving personal qualities and not merely comments on appearance), and does not compare comments about female faculty to comments about male faculty, which is what would be needed to support the Kreitzer and Sweet‑Cushman 2021 claim in footnote 4.

---

And for the Kreitzer and Sweet‑Cushman 2021 claim that "the role of attractiveness is more relevant to women", consider this passage from Hamermesh and Parker (2005: 373):

The reestimates show, however, that the impact of beauty on instructors' course ratings is much lower for female than for male faculty. Good looks generate more of a premium, bad looks more of a penalty for male instructors, just as was demonstrated (Hamermesh & Biddle, 1994) for the effects of beauty in wage determination.

This finding is the *opposite* of the claim that "the role of attractiveness is more relevant to women".

Kreitzer and Sweet‑Cushman 2021 cited Hamermesh and Parker 2005 elsewhere, so I'm not sure why Kreitzer and Sweet‑Cushman 2021 footnote 4 claimed that "the role of attractiveness is more relevant to women" without at least noting the contrary evidence from Hamermesh and Parker 2005.

---

"...react badly when those expectations aren't met"

From Kreitzer and Sweet‑Cushman 2021 (p. 4):

Students are also more likely to expect special favors from female professors and react badly when those expectations aren't met or fail to follow directions when they are offered by a woman professor (El-Alayli et al., 2018; Piatak & Mohr, 2019).

From what I can tell, neither Piatak and Mohr 2019 nor Study 1 of El-Alayli et al 2018 support the "react badly when those expectations aren't met" part of this claim. I think that this claim refers to the "negative emotions" measure of El-Alayli et al 2018 Study 2, but I don't think that the El-Alayli et al 2018 data support that inference.

El-Alayli et al 2018 *claimed* that there was a main effect of professor gender for the "negative emotions" measure, but I think that that claim is incorrect: the relevant means in El-Alayli et al 2018 Table 1 are 2.38 and 2.28, with a sample size of 121 across two conditions and corresponding standard deviations of 0.93 and 0.93, so that there is insufficient evidence of a main effect of professor gender for that measure.

---

"...no discipline where women receive higher evaluative scores"

From Kreitzer and Sweet‑Cushman 2021 (p. 4):

Rosen (2018), using a massive (n = 7,800,000) Rate My Professor sample, finds there is no discipline where women receive higher evaluative scores.

I think that the relevant passage from Rosen 2018 is:

Importantly, out of all the disciplines on RateMyProfessors, there are no fields where women have statistically higher overall quality scores than men.

But this claim is based on an analysis limited to instructors rated "not hot", so Rosen 2018 doesn't support the Kreitzer and Sweet‑Cushman 2021 claim, which was phrased without that "not hot" caveat.

My concern with limiting the analysis to "not hot" instructors was that Rosen 2018 indicated that "hot" instructors on average received higher ratings than "not hot" instructors and that a higher percentage of women instructors than of men instructors received a "hot" rating. Thus, it seemed plausible to me that restricting the analysis to "not hot" instructors removed a higher percentage of highly-rated women than of highly-rated men.

I asked Andrew S. Rosen about gender comparisons by field for the Rate My Professors ratings for all professors and not limited to "not hot" professors, and he indicated that, of the 75 fields with the largest number of Rate My Professors ratings, men faculty had a higher mean overall quality rating at p<0.05 than women faculty did in many of these fields, but that, in exactly one of these fields (mathematics), women faculty had a higher mean overall quality rating at p<0.05 than men faculty did, with women faculty in mathematics also having a higher mean clarity rating and a higher mean helpfulness rating than men faculty in mathematics (p<0.05). Thanks to Andrew S. Rosen for the information.

By the way, the 7.8 million sample size cited by Kreitzer and Sweet‑Cushman 2021 is for the number of ratings, but I think that the more relevant sample size is the number of instructors who were rated.

---

"designs", plural

From Kreitzer and Sweet‑Cushman 2021 (p. 4):

Experimental designs that manipulate the gender of the instructor in online teaching environments have even shown that students offered lower evaluations when they believed the instructor was a woman, despite identical course delivery (Boring et al., 2016; MacNell et al., 2015).

The plural "experimental designs" and the citation of two studies suggests that one of these studies replicated the other study, but, regarding this "believed the instructor was a woman, despite identical course delivery" research design, Boring et al. 2016 merely re-analyzed data from MacNell et al. 2015, so the two cited studies are not independent of each other such that a plural "experimental designs" would be justified.

And Kreitzer and Sweet‑Cushman 2021 reported the finding without mentioning shortcomings of the research design, such as a sample size small enough (N=43 across four conditions) to raise reasonable questions about the replicability of the result.

---

Discussion

I think that it's plausible that there are unfair equity biases in student evaluations of teaching, but I'm not sure that Kreitzer and Sweet‑Cushman 2021 is convincing about that.

My reading of the literature on unfair bias in student evaluations of teaching is that the research isn't of consistently high enough quality that a credulous review establishes anything: a lot of the research designs don't permit causal inference of unfair bias, and a lot of the research designs that could permit causal inference have other flaws.

Consider the uncorrected Mitchell and Martin 2018: is it plausible that a respectable peer-reviewed journal would publish results from a similar research design that claimed no gender bias in student comments, in which the data were limited to a non-experimental comparison of comments about only two instructors? Or is it plausible that a respectable peer-reviewed journal would publish a four-condition N=43 version of MacNell et al. 2015 that found no gender bias in student ratings? I would love to see these small-N null-finding peer-reviewed publications, if they exist.

But maybe non-experimental "N=2 instructors" studies and experimental "N=43 students" studies that didn't detect gender bias in student evaluations of teaching exist, but haven't yet been published. If so, then did Kreitzer and Sweet‑Cushman try to find them? From what I can tell, Kreitzer and Sweet‑Cushman 2021 does not indicate that the authors solicited information about unpublished research through, say, posting requests on listservs or contacting researchers who have published on the topic.

I plan to tweet a link to this post tagging Dr. Kreitzer and Dr. Sweet‑Cushman, and I'm curious to see whether Kreitzer and Sweet‑Cushman 2021 is corrected or otherwise updated to address any of the discussion above.

Tagged with: ,

In May 2020, PS published a correction to Mitchell and Martin 2018 "Gender Bias in Student Evaluations", which reflected concerns that I raised in a March 2019 blog post. That correction didn't mention me, and in May 2020 PS published another correction that didn't mention me but was due to my work, so I'll note below evidence that the corrections were due to my work, which might be useful in documenting my scholarly contributions for, say, an end-of-the-year review or promotion application.

---

In August 2018, I alerted the authors of Mitchell and Martin 2018 (hereafter MM) to concerns about potential errors in MM. I'll post one of my messages below. My sense at the time was that the MM authors were not going to correct MM (and the lead author of MM was defending MM as late as June 2019), so I published a March 2019 blog post about my concerns and in April 2019 I emailed PS a link to my blog post and a suggestion that MM "might have important errors in inferential statistics that warrant a correction".

In May 2019, a PS editor indicated to me that the MM authors have chosen to not issue a correction and that PS invited me to submit a comment on MM that would pass through the normal peer review process. I transformed my blog post into a manuscript comment, which involved, among other things, coding all open-ended student evaluation comments and calculating what I thought the correct results should be in the main three MM tables. Moreover, for completeness, I contacted Texas Tech University and eventually filed a Public Information Act request, because no one I communicated with at Texas Tech about this knew for certain why student evaluation data were not available online for certain sections of the course that MM Table 4 reported student evaluation results for.

I submitted a comment manuscript to PS in August 2019 and submitted a revision based on editor feedback in September 2019. Here is the revised submitted manuscript. In January 2020, I received an email from PS indicating that my manuscript was rejected after peer review and that PS would request a corrigendum from the authors of MM.

In May 2020, PS published a correction to MM, but I don't think that the correction is complete: for example, as I discussed in my blog post and manuscript comment, I think that the inferential statistics in MM Table 4 were incorrectly based on a calculation in which multiple ratings from the same student were treated as independent ratings.

---

For the Comic-Con correction that PS issued in May 2020, I'll quote from my manuscript documenting the error of inference in the article:

I communicated concerns about the Owens et al. 2020 "Comic-Con" article to the first two authors in November 2019. I did not hear of an attempt to publish a correction, and I did not receive a response to my most recent message, so I submitted this manuscript to PS: Political Science & Politics on Feb 4, 2020. PS published a correction to "Comic-Con" on May 11, 2020. PS then rejected my manuscript on May 18, 2020 "after an internal review".

Here is an archive of a tweet thread, documenting that in September 2019 I alerted the lead "Comic-Con" author to the error of inference, and the lead author did not appear to understand my point.

---

NOTES:

1. My PS symposium entry "Left Unchecked" (published online in June 2019) discussed elements of MM that ended up being addressed in the MM correction.

2. Here is an email that I sent the MM authors in August 2018:

Thanks for the data, Dr. Mitchell. I had a few questions, if you don't mind:

[1] The appendix indicates for the online course analysis that: "For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10". But I think that Dr. Martin taught a section 11 course (D11) that was included in the data.

[2] I am not certain about how to reproduce the statistical significance levels for Tables 1 and 2. For example, for Table 1, I count 23 comments for Dr. Martin and 45 comments for Dr. Mitchell, for the N=68 in the table. But a proportion test in Stata for the "Referred to as 'Teacher'" proportions (prtesti 23 0.152 45 0.244) produces a z-score of -0.8768, which does not seem to match the table asterisks indicating a p-value of p<0.05.

[3] Dr. Martin's CV indicates that he was a visiting professor at Texas Tech in 2015 and 2016. For the student comments for POLS 3371 and POLS 3373, did Dr. Martin's official title include "professor"? If so, than that might influence inferences about any difference in the frequency of student use of the label "professor" between Dr. Martin and Dr. Mitchell. I didn't see "professor" as a title in Dr. Mitchell's CV, but the inferences could also be influenced if Dr. Mitchell had "professor" in her title for any of the courses in the student comments analysis, or for the Rate My Professors comments analysis.

[4] I was able to reproduce the results for the Technology analysis in Table 4, but, if I am correct, the statistical analysis seems to assume that the N=153 for Dr. Martin and the N=501 for Dr. Mitchell are for 153 and 501 independent observations. I do not think that this is correct, because my understanding of the data is that the 153 observations for Dr. Martin are 3 observations for 51 students and that the 501 observations for Dr. Mitchell are 3 observations for 167 students. I think that the analysis would need to adjust for the non-independence of some of the observations.

Sorry if any of my questions are due to a misunderstanding. Thank you for your time.

Best,

L.J

Tagged with: , ,

Let's conclude our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post focuses on what I consider to be promising research designs.

---

74.

Mitchell and Martin 2018 "Gender Bias in Student Evaluations" has two studies. The first study reports a comparison of comments on one female instructor to comments on one male instructor, from official course evaluations and RateMyProfessors comments on dimensions such as competence, appearance, personality, and use of "teacher" or "professor". There's not much all-else-equal in the study established for instructor teaching style or effectiveness, and this study didn't even involve the instructors teaching the same set of courses. Moreover, as I indicated here, the p-values for the reported results are biased downward in a way that undercuts inferences that there is a statistical difference.

The second study has a better research design, with Mitchell and Martin teaching different sections of the same course and so that "all lectures, assignments, and content were exactly the same in all sections" (p. 650).

But this study also has errors in the p-values and a lack of an all-else-equal element that could sufficiently eliminate all plausible explanations other than gender bias ("The only aspects of the course that varied between Dr. Mitchell's and Dr. Martin's sections were the course grader and contact with the instructor", p. 650). Moreover, Martin taught the higher-numbered sections for which students plausibly differed from students in the lower-numbered sections (e.g., from what I can tell, response rates to the student evaluations were a 17 percent for Mitchell and 12 percent for Martin); and Martin and Mitchell had taught before at the university and thus could have developed a reputation that caused any difference in student evaluations for the course.

And the analysis for the second study excluded student evaluation data for sections 1 to 5 of the course; these data are not available on the university website.

---

75.

Rivera and Tilcsik 2019 "Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation" has a preregistered survey experiment and a quasi-natural experiment.

The quasi-natural experiment involved a university switching from a 10-point evaluation scale to a 6-point evaluation scale. Data from 105,034 student ratings of 369 instructors in 235 courses over 20 semesters with a 10-point scale and 9 semesters with a 6-point scale indicated that male instructors were more likely than were female instructors to get the highest rating on the 10-point scale (p. 257) but were not more likely than were female instructors to get the highest rating on the 6-point scale (p. 258); moreover, the 0.5-point male/female gap for the 10-point scale was reduced to a 0.1-point male/female gap for the 6-point scale. The authors reported results that addressed potential confounds, such as: "the change we observe is not driven by a general linear trend toward higher ratings for women (Models 8 and 11)" (p. 263).

The survey experiment was intended to build on these results with a research design that permitted a stronger causal inference. The experiment was conducted online with 400 students from 40 universities, in which the experimental manipulations involved the sex of the instructor and whether the rating scale had 6 points or 10 points. The teaching samples involved students receiving "identical excerpts from the transcript of a lecture and [being] randomly assigned either a male or a female name to the instructor who had ostensibly given the lecture" (p. 256).

For the 10-point scale, the mean rating was 7.8 for the male instructor and was 7.1 for the female instructor, a difference at p<0.05 of about 0.32 standard deviations, using the 0.64 gap mentioned on page 265. However, for the 6-point scale, the mean rating was 4.9 for the male instructor and was 4.8 for the female instructor, a difference at p>0.05 of about 0.1 standard deviations.

Results from the survey experiment also indicated evidence that participants were more likely to use superlative words to describe the male instructor than to describe the female instructor.

I think that Rivera and Tilcsik 2019 is a great study that is convincing that numeric student evaluations of teaching should have no more than 6 points. But I don't think that it is convincing evidence that numeric student evaluations of teaching with 6 or fewer points should not be used in employment decisions, given that the experiment did not detect at p<0.05 a difference between the mean rating for the male instructor and the mean rating for the female instructor when using a 6-point scale and did not even detect a meaningfully large difference in the quasi-experimental data.

The survey experiment did provide evidence of gender bias in student use of superlatives in items that the preregistration form indicated were for exploratory purposes, but I don't know that these sort of superlatives have a nontrivial effect on the employment decisions that student evaluations of teaching are properly used for.

Two notes and/or criticisms...

First, the topic of the lecture in the survey experiment was selected with gender in mind (p. 256):

All participants read an identical excerpt from the transcript of a lecture on the social and economic implications of technological change. We chose this topic because it has potentially broad appeal, and both technology and economics are traditionally male-dominated fields.

I don't know why it would be a good idea to select a topic that drew on two male-dominated fields instead of selecting a topic that was gender neutral or, preferably, adding a dimension to the survey experiment in which some participants received a topic from a male-dominated field and other participants received a topic from a female-dominated field. Limiting the topic to male-dominated fields undercuts the ability to generalize any bias detected in favor of the male instructors.

Second, the article suggests that teaching quality was held constant in the survey experiment ("...randomly vary the focal instructor's (perceived) gender and the rating scale while holding constant instructor quality", p. 256, emphasis in the original). But the only "teaching" that participants were exposed to was the teaching that could be inferred from reading lecture notes, which undercuts the ability to generalize results to in-person teaching, especially in-person teaching over a semester.

---

76.

The shortcoming of the survey experiment methodology in Rivera and Tilcsik 2019 is that the exposure to the instructor is too brief to permit the inference that any trivial-to-moderate bias detected in the survey experiment will survive an entire semester of exposure to the instructor, especially if the exposure is two or three days per week for 12 or more weeks. Related, from Anderson and Kanner 2011: "Many studies have found that stereotypes decrease as individuals get to know out-group members (e.g., Anderssen, 2002)" (p. 1560).

MacNell et al. 2014 "What's in a Name: Exposing Gender Bias in Student Ratings of Teaching" addresses this shortcoming by having a male assistant instructor and a female assistant instructor for an introductory-level anthropology/sociology course act as themselves for a section and act as the other assistant instructor for another section; students were randomly assigned to section. If conducted properly and preregistered, this sort of study could provide strong evidence of gender bias. But as Benton and Li 2014 indicate, the study or the article have multiple important flaws, such as the assistant instructors not being blind to the sex that the instructor was acting as and the article not reporting results for the item that asked about the instructor's overall quality of teaching.

The experiment is also substantially underpowered, based on estimates from Rivera and Tilcsik 2019. The 6-point scales in the Rivera and Tilcsik 2019 survey experiment produced an estimate of the bias against female instructors of 0.10 standard deviations, and that estimate was not statistically significant. Using that estimate and the MacNell et al. 2014 sample sizes, the MacNell et al. 2014 study had a 6 percent statistical power. The items in MacNell et al. 2014 had a 5-point scale, but the statistical power is only 25 percent even using the 0.40 standard deviation estimate from the Rivera and Tilcsik 2019 survey experiment conditions with the 10-point scale. R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.1)

pwr.t2n.test(n1=20, n2=23, d=0.4)

The MacNell et al. 2014 data are here. My analysis indicated that, comparing the perceived female instructor group to the perceived male instructor group, the p-value is p=0.15 for the overall evaluation item (a 0.46 standard deviation difference) and is p=0.07 for an index of all 15 items (a 0.58 standard deviation difference). The estimated bias on the index was 0.35 standard deviations (p=0.51) among male students and was 0.81 standard deviations (p=0.055) among female students, although the p-value is only p=0.45 for this 0.46 standard deviation difference. Stata code:

egen overall_std = std(overall)
ttest overall_std, by(taidgender) unp une

gen index = professional + respect + caring + enthusiastic + communicate + helpful + feedback + prompt + consistent + fair + responsive + praised + knowledgeable + clear + overall
egen index_std = std(index)
ttest index_std, by(taidgender) unp une

ttest index_std if gender==1, by(taidgender) unp une
ttest index_std if gender==2, by(taidgender) unp une
reg index_std taidgender##gender

---

Discussion

Detecting gender bias in student evaluations of teaching is not simple. Non-experimental research that doesn't control for gender differences in teaching quality or teaching style and doesn't control for student differences isn't of much value unless the detected difference in evaluations is large enough to not be plausibly explained by a combination of gender differences in teaching quality or teaching style and non-random student selection to instructors and courses. And experimental research that involves only a brief exposure to a target instructor isn't informative about whether any detected gender bias will persist for an entire semester of exposure to an instructor.

I think that studies 74 through 76 have among the strongest research designs in the literature on bias in student evaluations of teaching, if not the strongest research designs, or at least have the potential for convincing research designs if aforementioned flaws are addressed. However, even accounting for the fact that online research designs in these studies produce inferences that might not apply to face-to-face courses, I don't know what about these studies or other studies should lead a department or university to not use student evaluations of teaching in employment decisions.

I think that the research design of the type in MacNell et al. 2014 could provide convincing evidence of bias in student evaluations of teaching, if preregistered, sufficiently powered, and with instructors blinded to condition. Even better would be for the research to be conducted or supervised by a team of researchers in an adversarial collaboration.

An interesting research design has been to analyze evaluations to assess whether the sub-items that predict an overall evaluation differ for female instructors and male instructors. However, it's not clear to me that such differences would be a bias that should lead a department or university to not use student evaluations of teaching in employment decisions, unless these biases manifest in the student responses to the items and not only in the correlations among responses to the items.

---

I think that student evaluations of teaching are a useful tool when used properly, but, if I were opposed to their use in employment decisions or otherwise, I don't know that I would focus on the claims of gender and race biases, such as "Mounting evidence of favouritism towards white male instructors doesn't dissuade universities". I think that it's correct that any gender or race bias would be smaller than a "beauty premium" favoring attractive instructors (e.g., Lombardo and Tocci 1979, Hammermesh and Parker 2005, Wallisch and Cachia 2018).

---

Let me end the post with a discussion of the Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list of 76 studies "finding bias" that I have been working through pursuant to a suggestion by a Holman et al. coauthor. This list was presented by a Holman et al. 2019 coauthor in a widely-distributed tweet as a "list of 76 articles demonstrating gender and/or racial bias in student evaluations".

But the Holman et al. 2019 list has important flaws. The list is incomplete, and it is not clear that omitted studies are missing at random. For example, the Feldman 1992 review of experimental studies indicates that "Most of the laboratory studies reviewed here (and summarized in the Appendix) found that male and female teachers did not differ in college student's overall evaluation of them as professionals..." (emphasis in the original), and the Feldman 1993 review of non-experimental studies indicates that "a majority of studies have found that male and female college teachers do not differ in the global ratings they receive from their students" and that "when statistically significant differences are found, more of them favor women than men". This is not the impression provided by the Holman et al. 2019 list, which includes over fifteen pre-1991 "finding bias" studies, one pre-1991 "bias favoring women" study, and zero pre-1991 "no gender or race bias" studies.

Moreover, the Holman et al. 2019 list has at least nine studies that are essays or reviews that present no novel non-anecdotal data (Schuster and Van Dyne 1985, Feldman 1992, Feldman 1993, Andersen and Miller 1997, Baldwin and Blattner 2003, Huston 2006, Laube et al. 2007, Spooren et al. 2013, and Stark and Freishtat 2014), has at least five studies that might not be properly classified as being about student evaluations of teaching (Brooks 1982, Heilman and Okimoto 2007, El-Alayli et al. 2018, Drake et al. 2019, and Piatak and Mohr 2019), has at least three studies that might be better classified as duplicates of other entries (Boring 2015, Boring et al. 2016, and "Freiderike" et al. 2017), has at least one study that might be better classified as "no bias" or as "bias favoring women" (Basow and Montgomery 2005), and has at least three studies that might be better not being listed as finding bias if the bias is intended to be a bias in favor of straight white men or some combination thereof (Greenwald and Gillmore 1997, Uttl et al. 2017, and Hessler et al. 2018).

It's also worth noting that a nontrivial percentage of studies listed as "finding bias" are at least 20 years old: 6 studies from before 1980, 17 studies from before 1990, and 28 studies (37%) from before 1999.

I think that there is a public benefit to a list of studies assessing bias in student evaluations of teaching, but the list should at least be representative or be accompanied by an indication of evidence that the list is representative. For what it's worth, on November 26, I tweeted to two Holman et al. 2019 coauthors a link to a post listing errors in one of the 76 summaries; the errors were still there on December 16.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post should take us through all but three of the remaining Holman et al. 2009 "finding bias" entries.

---

For my numbering, I had the Huston 2006 review listed as both #10 and #16, so I am replacing my #16 with another review/essay-type publication: Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor". Here is a sample from Sandler 1991 (p. 13):

Humor is a good way to handle some issues. If students call you Ms. or Mrs. or Miss, you can jokingly say "Oops, I've lost my professorship (or doctorate) again."

The above quote is the full text of one of the bullet points from Sandler 1991, but Sandler has more nuanced advice on her website (footnote omitted):

Humor is a good way to handle some issues, partly because it indicates that you are not taking what is happening as a personal affront, i.e., humor can be a way of showing strength because it shows that you are in charge. For example, if students call you Ms. or Mrs. or Miss, you can jokingly say, "Oops, I've lost my professorship (or doctorate) again." Although this works well with people who are comfortable using humor, it carries the risk of backfiring by putting the faculty member into a "joking match" with students.

I'm not sure how "Oops, I've lost my professorship (or doctorate) again" indicates that you are not taking what is happening as a personal affront. For what it's worth, I think that instructors who want to have such a discussion about a student's non-use of a title such as "Dr." should start that discussion with "I think that you should refer to me as 'Dr' because...".

Regardless, I don't see anything in Sandler 1991 that can be cited as novel evidence that student evaluations of teaching are biased.

---

43.

Kaschak 1981 "Another Look at Sex Bias in Students Evaluations of Professors" reports data from 40 male undergraduates and 40 female undergraduates in an experiment that had target instructors who were male or female and who were in one of six fields. Students rated instructors on six items, and 6 of the 42 (14%) potential comparisons produced a p-value less than 0.05 based on these six items across seven tests involving the sex of the student, the sex of the instructor, and the teaching field, and combinations of these factors.

Main effects for instructor sex appeared for "powerful" and "effective" (favoring the male instructor) but did not appear for "concerned", "likeable", "excellent", or "would take this course".

I don't think that a publication from 1981 should be used to inform policy about the use of student evaluations of teaching in 2020 or beyond, but I think it would be a good idea for student evaluations of teaching to not ask students to rate how "powerful" the instructor is or appeared to be.

---

44.

Statham et al. 1991 Gender and University Teaching: A Negotiated Difference is a book that reports on a study that involved interviews with professors, observations of professors, and surveys of students for 167 professors at a large Midwestern state university: 31 female professors and 57 male professors from male-dominated departments (80%+ men) and 40 female professors and 39 male professors from departments that were not male-dominated.

Results indicated several gender differences in teaching style: among other things and controlling for factors such as class size and course level, the women professors tended to spend more class time involving students than did the men professors (Table 3.2), the women professors offered more positive evaluations and more negative evaluations of students than did the men professors (Table 4.2), and the women professors offered more personalizations such as personal statements about themselves than did the men professors (Table 5.1). Moreover, there were no gender differences at p<0.05 in students challenging professors (Table 4.1).

Table 6.3 reports results from student evaluations for six competence-related items and five likability items, with the item scales ranging from 1 for strongly agree to 5 for strongly disagree. None of the 11 items had a p<0.05 difference between the mean for the men professors and the mean for the women professors.

Some differences between women professors and men professors appeared in associations of student evaluations by sex for instructional activities (Table 6.4), authority management techniques (Table 6.5), and personalizing activities (Table 6.6). For example, acknowledgement of student contributions positively associated at p<0.001 with likability among women professors but negatively associated at p<0.001 with likability among men professors.

Some of the patterns might be due to gender bias among students. For example, Table 6.4 indicates that students' solicitation of information negatively associated at p<0.001 with competence ratings of women professors but did not associate at p<0.05 with competence ratings for men professors. However, even presuming that the p-value for this difference is less than 0.05, for all we know students' solicitation of information truly did negatively associate with competence among women professors but not among men professors.

The finding of gender differences in teaching style, if correct, is relevant for studies of bias in student evaluations of teaching that do not sufficiently control for teaching style.

---

45.

Freeman 1994 "Student Evaluations of College Instructors: Effects of Type of Course Taught, Instructor Gender and Gender Role, and Student Gender" reports results from two experiments with undergraduates from introduction to psychology classes. In Experiment 1, each student responded to descriptions of three female instructors or descriptions of three male instructors; there was no main effect for instructor gender, but, with regard to gender role, androgynous instructors (mean of 5.99) were rated more effective than feminine instructors (4.47) or masculine instructors (4.40). Experiment 2 did not concern student evaluations of teaching.

---

46.

Basow 1995 "Student Evaluations of College Professors: When Gender Matters" reports on evaluations from students of 37 female faculty and 99 male faculty at a private undergraduate institution over four semesters, with a more detailed discussion of results from Fall 1986 (5,403 evaluations) and Spring 1990 (5,216 evaluations). Discussing the results, Basow 1995 indicates that (p. 622):

It can be argued that the effect sizes of the gender variables are so small, individually accounting for only .5% to 4% of the variance in the instructor ratings, as to be negligible.

I think that Holman et al. 2019 reversed the summaries for Basow and Silberg 1987 and Basow 1995.

---

47.

Burns-Glover and Veith 1995 "Revisiting Gender and Teaching Evaluations: Sex Still Makes a Difference" reported results from an experiment involving 78 undergraduates from a small liberal arts college who were asked to indicate which personality characteristics the university should look for in a tenure-track candidate referred to as [Sam/Sarah/Dr.] Larson. Students were asked to rate on a 0-to-6 scale 52 traits such as rational, daring, soft, and jolly. For the next page of the questionnaire, students were asked to rate on a 0-to-6 scale 25 behaviors such as "is an expert in field of study".

Table 1 lists the 20 traits that had a mean above the midpoint of the scale: 13 of these traits were male typed and 7 were female typed, but I don't see in the article an indication of the total number of male-typed traits or the total number of female typed traits. Three of the traits were rated higher at p<0.05 for Sarah than for Sam or Dr.: self-confident, stable, and steady; there were no other traits listed in Table 1 with a main effect for the Sarah/Sam/Dr. manipulation.

Responses about behaviors were analyzed with a discriminant analysis. For instructor gender, the modal category limited to Sarah or Sam was 51 percent. However, knowing the full set of ratings for a given student permitted correct classification of the rated instructor 91 percent of the time (39 of 43). The authors make an argument that the discriminant analysis results for "Dr." suggest that students perceived "Dr. Lawson" to be male; that might be true, but "Sam" is an unfortunate choice to signal instructor sex, given that "Sam" can be short for Samantha.

The data are interesting, but I don't know that it would matter if student responses truly differed as indicated in this study, if these differences did not translate into a bias detectable in student evaluations of teaching.

---

48.

Andersen and Miller 1997 "Gender and Student Evaluations of Teaching" is a review. Here is a quote that Anderson and Miller 1997 relayed (p. 127) from Sandler 1991 (not the version published in Communication Education but another version; see this link for the quote):

One male student continually objected to many of the statements made by a woman faculty member in her class. He would call out comments such as "That doesn't make sense," "I disagree with that," and similar statements in response to the professor's substantive remarks. She recognized that his comments were not related to the substance of her statements when the following occurred: the faculty member, using her own experience as a teaching example, began to state that she had been at a supermarket and the male student immediately interrupted to call out "That's not true." (Sandler 1991, pp. 5).

Maybe that happened, and maybe that was due to gender bias.

---

49.

Baker and Copp 1997 "Gender Performance Matters Most: The Interaction of Gendered Expectations, Feminist Course Content and Pregnancy in Students' Course Evaluations" reports an analysis of data from 1992 (student evaluations N=245 across three terms) about "the complexities of students' gendered expectations by considering what happens when a woman professor (Dr. Phyllis Baker, coauthor) teaches a course with controversial (feminist) content and then becomes pregnant" (p. 29).

Maybe something here generalizes to other situations in 2020 and beyond, and, if so, it would be nice to have updated research on that.

---

50.

Greenwald and Gillmore 1997 "No Pain, No Gain? The Importance of Measuring Course Workload in Student Ratings of Instruction" concerns grading leniency and student evaluations, and presents no evidence "demonstrating gender and/or racial bias in student evaluations".

---

51.

Moore 1997 "Student Resistance to Course Content: Reactions to the Gender of the Messenger" discusses resistance that a female instructor received and her attempts to address this resistance. Here is a sample (p. 130):

I asked the class how they would have responded if the male professor was teaching this course with the same primary textbook with the word "feminist" in the title. A female student reported that a male professor then would be speaking against males, so no favoritism would be apparent, whereas a female relaying the same message would be favoring "girls." The recurrent theme was that I am necessarily self-interested when I teach about family, especially when I present any data or theory that may be interpreted as favorable to women or unfavorable to men. I sensed that students felt that the male professor was not only more objective and scientific in his approach, but also that he speaks for everyone. I, on the other hand, am only capable of proffering "the women's" perspective.

This does not appear to be a correct reading of the female student's statement, which appears to make the point that a member of Group X speaking against Group X is properly treated differently than a member of Group Y speaking against Group X. The question of fairness would thus involve whether a woman speaking against men is treated the same as a man speaking against women.

I'm not sure what evidence of bias in student evaluations this publication offers. The publication notes that "While women's studies courses earn high evaluations, instructors of those courses are often attacked personally in those evaluations, accused of bias and of hating men" (p. 132). But if student evaluations of teaching aren't used to compare instructors of women's studies courses to instructors of courses in different departments, then I'm not sure that this would be a bias that would influence employment outcomes.

---

52.

Basow 2000 "Best and Worst Professors: Gender Patterns in Students' Choices" reports responses from 61 female students and 47 male students to two open-ended items, in counterbalanced order: "Think of the best professor you've had in college...describe what made him or her the 'best', in your opinion" and the same item but with "worst" substituted for "best". Results included these findings: "...about twice as many male as female faculty were chosen as 'best' by this sample; this is proportional to the male/female ratio of faculty at the college", and "For choice of 'worst' professors, there was no student gender by faculty gender interaction. Students made their choice proportional to the number of male and female faculty they estimated they had had" (pp. 411-412).

This sounds similar to Basow et al. 2006 "Gender Patterns in College Students' Choices of Their Best and Worst Professors".

---

53.

Basow 2000 "Gender Dynamics in the Classroom" is from a book that isn't in my institutional library. Here's a description from Maricic et al. 2016 "Gender Bias in Student Assessment of Teaching Performance" (p. 138):

Interesting studies were conducted by Basow (2000b) and Sprague and Massoni (2005). Namely, both studies asked students to depict their "best" and "worst" male and female teachers. Their results were conclusive. Traits of "best" female teachers were caring and nurturing, while the traits of "best" male teachers were funny and entertaining. On the other hand, when it comes to "worst" teachers, common traits for both genders were unorganised, unclear, indifferent, and rude. "Worst" female teachers were just the opposite of the "best" female teacher; they were characterized as rigid, mean, and unfair. Interestingly, "worst" male teachers were self-centred and unenthusiastic.

But Maricic et al. 2016 might have confused the two Basow 2000 citations. The terms "best professor" or "best professors" isn't showing up in a Google Books search of the Basow 2000 "Gender Dynamics in the Classroom" chapter.

---

54.

Sinclair and Kunda 2000 "Motivated Stereotyping of Women: She's Fine If She Praised Me but Incompetent If She Criticized Me" reports results from three studies.

In Study 1, 83 male undergraduates and 97 female undergraduates provided ratings regarding courses that the student had taken the prior term. Results indicated that "students' evaluations of female instructors are more dependent on the grades they have received from them than are their evaluations of male instructors" (p. 1333); for example, using the first course that a student rated, the drop-off in mean evaluations from a high grade to a low grade was 6.22 for male instructors and was 20.54 for female instructors, with female instructors rated lower than male instructors in the low grade group but with no p<0.05 difference in the high grade group.

In Study 2, 54 male undergraduates watched a male confederate manager or a female confederate manager provide positive feedback or negative feedback on the participant's responses to an interpersonal skills questionnaire. The pattern from Study 1 conceptually replicated in Study 2: no p<0.05 gender difference in ratings of the manager's skill in the positive feedback condition, a p<0.05 difference in ratings of the manager's skill in the negative feedback condition, and a p<0.05 difference in the difference. For participants' rating of manager competence, only a main effect appeared at p<0.05.

The Study 2 p<0.05 differences in ratings of manager skill based on 54 participants spread across four conditions suggest a large effect size: the "penalty" for negative feedback was d=0.66 for the male managers and was d=2.05 for the female managers.

Study 3 involved participants from Study 2 paired with another participant, to assess whether the key patterns from Study 2 replicate when the positive feedback or the negative feedback is provided to another person. Results indicated no differences when participants rated the female manager or the male manager who had provided feedback to a person who was not the participant: "Male and female evaluators were given comparable ratings when providing negative feedback as well as when providing positive feedback" (p. 1339).

---

55.

Arbuckle and Williams 2003 "Students' Perceptions of Expressiveness: Age and Gender Effects on Teacher Evaluations" reports results from an experiment involving a computer-generated stick figure lecturer with a female voice that in prior research had been attributed by students about equally to a man or a woman. The 352 student participants in the present study watched the 35-minute video that had the stick figure lecturer interspersed, and the students then completed an evaluation that indicated to participants that the professor was a [male/female] [under age 35/over age 55].

The lecturer in the male condition was rated higher on 4 of the 9 dependent variables for which results were reported: enthusiasm, felt accepted, meaningful voice tone, and showed interest. The 5 remaining dependent variables were: precise teaching, logical and organized, seemed conscientious, used scientific terminology, and relaxed and confident. Table II indicated that the higher ratings in the male condition were nearly entirely attributable to the male under age 35 condition.

My calculations indicated that the difference between the mean for the male condition and the mean for the female condition was 0.31 on the 6-point scale and that the pooled standard deviation was about 1.33 for the male condition and 1.38 for the female condition, so the 0.31 difference was about 0.23 standard deviations. The p-value was 0.03 for a t-test for a difference in the mean rating in the male condition compared to the mean rating in the female condition, presuming that participants were equally assigned to the male condition and the female condition (ttesti 176 0 1.33 176 0.31 1.38).

The evaluations had 12 items, but results were not reported for the 3 of the 12 dependent variables that had "negative skews and a heterogeneity of variance" (p. 510). The appendix indicated that the three dropped items were: "Presented scientific principles rather than opinions", "Helped me to gain a broader understanding", and "Appeared to know and understand the subject well". Those seem like important items, and it's odd and a bit suspicious to have these 3 items dropped from the analysis; it's not like in real life faculty evaluations don't consider student evaluation items that have a heterogeneity of variance.

So we are left with no evidence of a gender difference on important items such as logical and organized and appeared to know and understand the subject well, but with evidence of a gender difference on the less important items of enthusiasm, felt accepted, meaningful voice tone, and showed interest. It would be fine with me if student evaluations of teaching did not include a "meaningful voice tone" item.

Based on this 1995 AP story, there appears to be an unpublished paper by the co-authors providing more experimental evidence of bias in student evaluations.

---

56.

Ewing et al. 2003 "Prejudice against Gay Male and Lesbian Lecturers" reports results from a study involving involving 261 introduction to psychology students. Student participants were given a curriculum vitae for the guest lecturer, with the curriculum vitae signaling or not signaling that the instructor was gay or lesbian. The assignment was not purely random, and was based on which side of the room the student sat; moreover, students had not been assigned to the courses, so the students were non-randomly placed into the female instructor or male instructor condition. The guest lecturer then gave a lecture that was intended to be strong (animated or direct) or weak (dry and indirect); the lecture topic concerned advanced studies and careers related to psychology.

Ratings were lower for the weak lecture than for the strong lecture, but results indicated no main effect for lecturer sex or sexual orientation. However, results did indicate an interaction that was the opposite of that which would be expected if students used the weak lecture to derogate gay and lesbian lecturers: "after a strong lecture, students rated acknowledged gay male and lesbian lecturers more negatively than lecturers of unspecified sexual orientation; but after a weak lecture, students rated acknowledged gay male and lesbian lecturers more positively than lecturers of unspecified sexual orientation" (p.576).

---

57.

Anderson and Smith 2005 "Students' Preconceptions of Professors: Benefits and Barriers According to Ethnicity and Gender" reports on an experiment involving 633 undergraduate responses based on a syllabus for a Race, Gender and Inequality course, with variation in instructor gender, instructor ethnicity, and instructor teaching style. From what I can tell, there were no main effects detected for instructor gender or instructor ethnicity in terms of instructor warmth, instructor capability, or instructor political bias.

I'll put the other two Anderson publications next...

---

58.

Anderson 2010 "Students' Stereotypes of Professors: An Exploration of the Double Violations of Ethnicity and Gender" reports on an experiment involving 594 undergraduate responses based on a course syllabus with variation in instructor gender, instructor ethnicity, instructor teaching style, and course taught. The most relevant outcome variable is student responses about the instructor's professional competence, for which "there were no significant effects associated with this analysis" (p. 466). The abstract indicates that "Women professors were viewed as more warm than men professors even though their course syllabuses were identical", but that does not seem like an important finding for deciding whether to use student evaluations of teaching in employment decisions.

---

59.

Anderson and Kanner 2011 "Inventing a Gay Agenda: Students' Perceptions of Lesbian and Gay Professors" reports on an experiment involving 622 undergraduate responses based on a syllabus for a Psychology of Human Sexuality course with variation in instructor gender, instructor sexual orientation, instructor professor political ideology, and typographical errors. Results did not indicate a difference in perceived competence for gay/lesbian instructors compared to heterosexual instructors and did not indicate an interaction of instructor sexual orientation with the presence of errors in the syllabus. But results did indicate that students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, with a difference equivalent to 0.24 standard deviations based on summary statistics on page 1548.

Anderson and Kanner 2011 reports a second study (N=545) that had the same design as the first study but included measures for homonegativity, with results indicating, among other things, that "Among modern homonegatives, lesbian/gay professors (M = 3.07, SD = 0.09) were viewed as more politically biased than were heterosexuals (M = 2.52, SD = 0.08)..." (p. 1556). About half of the students were classified as non-homonegative, and it's not clear to me from the article whether, overall, students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, which would replicate a key finding in the first study.

It's possible that, for a Psychology of Human Sexuality course, students on average unfairly rate gay or lesbian instructors as more politically biased than students rate heterosexual professors. But unless there is a "political bias" item on student evaluations of teaching or unless this unfairness manifests in other ratings, I don't know how relevant that unfairness is for assessing whether to use student evaluations of teaching in employment decisions.

---

60.

Basow and Montgomery 2005 "Student Ratings and Professor Self-Ratings of College Teaching: Effects of Gender and Divisional Affiliation" reports on results from evaluations of 23 male professors and 20 female professors at a small liberal arts college, rated by 407 male students, 365 female students, and 31 students who did not report their gender. Professors were at the assistant professor rank or higher, taught 100- or 200-level courses, and were matched by division. Evaluations were conducted between weeks 7 through 12 of a 14-week semester.

Based on my read of the results of this study, it's not clear to me why Holman et al. 2019 would classify this study as "finding bias", in which the "finding bias" category is presented as finding bias against women or nonwhites (Holman et al. has a separate category for "bias favoring women"). Here is the Holman et al. 2019 summary of Basow and Montgomery 2005 related to gender bias in student evaluations:

Based on the results of the study, Basow and Montgomery concluded that professor gender and divisional affiliation (department/field of study) contributed to the results of student evaluations. Female professors were rated higher than male professors on two interpersonal factors and on scholarship, and natural science courses were rated the lowest for most factors. Humanities professors received the highest overall ratings however, male professors in the humanities received lower ratings than female professors.

---

61.

DiPietro and Faye 2005 "Online Student-Ratings-of-Instruction (SRI) Mechanisms for Maximal Feedback to Instructors" is an unpublished paper that I didn't locate. Here is a summary from Smith 2009 "Student Ratings of Teaching Effectiveness for Faculty Groups Based on Race and Gender" (p. 617):

Of the three groups of faculty (Hispanic, Asian-American, and White) included in the DiPietro and Faye (2005) study, Hispanic faculty received the lowest course evaluation ratings. Asian-American faculty received slightly better course evaluations than their Hispanic colleagues, but their scores were still lower than the scores of White faculty. The number of African-American faculty in DiPietro and Faye study was too small to draw any conclusions.

I'm not sure why Smith 2009 isn't in the Holman et al. 2019 list.

---

62.

Hammermesh and Parker 2005 "Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical Productivity" reports an analysis of student evaluations of instructors at the University of Texas at Austin from 2000 to 2002. Results in Table 3 indicate that instructor beauty associates with higher evaluations, that male instructors are rated higher than female instructors, and that the association of instructor beauty with evaluations is stronger for male faculty than for female faculty. The analysis included only predictors for these instructor or course characteristics: beauty, sex, minority status, non-native English status, tenure-track status, lower division course, and one-credit course.

---

63.

Abel and Meltzer 2007 "Student Ratings of a Male and Female Professors' Lecture on Sex Discrimination in the Workforce" reports results from an experiment in which psychology students, among other things, responded to ten items about a written lecture and the lecturer, with a variation in whether the lecture was attributed to Dr. Michael Smith or Dr. Mary Smith. Mean responses differed at p<0.05 for 3 of the 5 items about the lecture and 3 of the 5 items about the instructor.

So the experiment has internal validity. But the experiment also suffers from issues common to research published prior to the replication crisis, such as a lack of preregistration and a small sample size of 43 in the male instructor condition and 44 in the female instructor condition that had a 21 percent chance of detecting a 0.25 standard deviation difference. R code for that:

library(pwr)
pwr.t2n.test(n1=43, n2=44, d=0.25, sig.level=0.05)

But some observed differences were much larger than 0.25 standard deviations, such as the 0.77 standard deviation difference for the item of "The professor presented the information in a sexist light", so the experiment would be sufficiently powered (0.94) to detect a difference that large. And, again, it would be fine with me for student evaluations of teaching to not include an item such as "The professor presented the information in a sexist light".

The lecture concerned gender differences in occupation and ended with a claim that:

...current research shows that men and women equally employed in the same male-dominated position, with equal education, skills and credentials have different pay scales. Typically, men still receive higher salaries than women for doing the same things. This is largely the result of a historical male dominated society in the United States that still exists today.

So even accepting the key finding from the experiment, the finding is limited to psychology students rating a man instructor asserting that men have an unfair advantage higher than psychology student rated a woman instructor asserting that men have an unfair advantage. There is no parallel experiment reported here comparing men instructors asserting that women have an unfair advantage to women instructors asserting that women have an unfair advantage, and there is no experiment reported here comparing men instructors to women instructors for a lecture not directly related to gender.

It's worth noting that the unrepresentative context of the experiment—a man instructor and a woman instructor asserting that men have an unfair advantage—sometimes gets lost such that the experiment is described in general terms, such as in Nittrouer et al. 2017, with citation 18 referring to Abel and Meltzer 2007:

Additional studies similarly show that female (versus male) teachers are rated more negatively (14–17). For example, participants who read a lecture, which was posited as having been written and delivered by a male or female professor, rated the lecture by the male (versus the female) professor significantly more positively (18).

---

Here's a passage from the discussion in Abel and Meltzer 2007 (p. 179):

Finally, we are currently designing an experimental study comparing student ratings for male and female professors presenting more mundane lecture information versus the more emotionally charged lecture as used in this study which could then examine the effects for type of lecture information related to sex of student and sex of professor.

I'm not sure what happened to that study.

---

64.

McPherson et al. 2009 "What Determines Student Evaluation Scores? A Random Effects Analysis of Undergraduate Economics Classes" reports on student evaluation data from economic courses at the University of North Texas from 1994 through 2005, using the four items that the Department of Economics has chosen. Reporting results separately for the introductory principles course and upper-level courses, main effects appeared for instructor sex in both models, although differences were small (0.094 units and 0.066 units on a 4-point scale), and a main effect appeared for White instructors in the upper-level courses (0.120 units).

Models contained controls for factors such as instructor age and instructor experience, but nothing that would credibly measure instructor effectiveness.

---

65.

Boring 2015 "Gender Biases in Student Evaluations of Teachers and Their Impact on Teacher Incentives" reports results for 22,665 observations made by 4,423 undergraduate students of 372 different teachers at a university in France.

The abstract indicates that: "The results of generalized ordered logit regressions and fixed-effects models suggest that male teachers tend to receive higher SET scores because of students' gender biases". But the evidence indicates only that this is a bias in the sense of difference.

The abstract indicates that: "Men are perceived as being more knowledgeable (male gender stereotype) and obtain higher SET scores than women, but students appear to learn as much from women as from men, suggesting that female teachers are as knowledgeable as men". But teaching evaluations aren't intended as measures of teaching effectiveness ("Student ratings have never been intended to serve as a proxy for learning", Linse 2017: 95) and, even if they were, it is an inferential leap from student final exam scores to teacher knowledge.

---

66.

Boring et al. 2016 "Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness" analyze data from MacNell et al. 2014 and Boring 2015.

The discussion of the Boring 2015 data in Boring et al. 2016 has added the term "natural experiment"; however, the research design did not hold the quality of instruction constant between female instructors and male instructors: "the natural experiment does not allow us to control for differences in teaching styles across instructors" (p. 7).

---

67.

I suspect that Boring 2017 "Gender Biases in Student Evaluations of Teaching" might be a substantially reworked version of Boring 2015 "Gender Biases in Student Evaluations of Teachers and their Impact on Teacher Incentives", given that, among other things, the study setup and findings are similar, the acknowledgements are similar, and the funding grant number (612413) is the same.

---

68.

Let's transition from the "natural" experiment of the Boring 2015 data from France to the "quasi-experimental" data on 19,952 student evaluations from a School of Business at a university in the Netherlands between 2009 and 2013, reported on in Mengel et al. 2017 "Gender Bias in Teaching Evaluations". Students were randomly assigned to section instructors. Controlling for factors such as student grades, female instructors were rated 0.21 standard deviations lower on instructor-related items (pp. 552-553) than male instructors were by male students and 0. 08 standard deviations lower by female students.

However, Table 7 indicates that these differences differed by instructor rank in a pattern that does not seem consistent with the differences being due to a straighforward student gender bias: male students rated female instructors lower than they rated male instructors for student instructors and Ph.D. student instructors but not for lecturers and professors; moreover, female students rated female instructors lower than they rated male instructors for student instructors but rated female instructors higher than they rated male instructors for lecturers and professors.

---

69.

I think the "Freiderike, M., J. Sauermann, & U. Zolitz" 2017 "Gender Bias in Teaching Evaluations" paper listed in Holman et al. 2019 became the Mengel et al. 2017 "Gender Bias in Teaching Evaluations" article: for example, data in the paper's Table 7 are identical to data in the article's Table 7. And "Freiderike" is the first name of the first author Friederike Mengel.

---

70.

Wagner et al. 2016 "Gender, Ethnicity and Teaching Evaluations: Evidence from Mixed Teaching Teams" reports data from student evaluations in a graduate school in the Netherlands from 2010 through 2015. The data include courses in which a male instructor and a female instructor co-taught the course.

Evaluations ranged from an observed low of 1.82 to an observed high of 5, with a 4.271 mean and a 0.443 standard deviation. Results in Table 4 indicate that female instructors were rated about 0.25 standard deviations lower than male instructors were rated, controlling for Caucasian status, whether the instructor was a course leader, and either the instructor age and age-squared (0.091 points, 0.05<p<0.10) or whether the instructor was new (0.110 points, 0.05<p<0.10). Some results in Table 5 include a control for publications, and some results in Table 6 interact instructor sex with the course type (e.g., Governance & development policy), but Table 5 and Table 6 models don't control for instructor experience (age, or being new), even though sample sizes for the relevant base models in Tables 4 through 6 range from an N of 499 to an N of 688.

---

71.

Rosen 2017 "Correlations, Trends, and Potential Biases among Publicly Accessible Web-Based Student Evaluations of Teaching: A Large-scale Study of RateMyProfessors.com Data" reported results from ratings on RateMyProfessors.com for instructors who had at least 20 ratings. Results indicated evidence at p<0.001 that male instructors were rated higher on clarity, helpfulness, and overall quality and were rated lower on easiness, but that these mean differences were trivial, respectively 0.05, 0.04, 0.04, and 0.03 units on a 1-to-5 scale.

Further results were reported controlling for the hot/not hot rating (p. 41):

Approximately, 22.7% of male faculty on RateMyProfessors are rated as 'hot', compared to 27.8% of female faculty. Since it has already been shown in Table 1 that perceived physical appearance correlates with evaluation scores, whether a professor is rated as 'hot' or' not hot' should be controlled when analysing potential gender biases as well.

But I don't understand the reason for the "should be controlled" if the hot/not hot ratings are gender biased or at least could plausibly be gender biased.

---

72.

Drake et al. 2019 "Grading Teachers: Race and Gender Differences in Low Evaluation Ratings and Teacher Employment Outcomes" reports an analysis of performance ratings of preK-12 teachers in Michigan from 2011 to 2015. Results indicated that "male teachers and teachers of color were more likely to be labeled 'minimally effective' or 'ineffective' than their same-school peers even after conditioning on evaluators' prior judgments and value-added scores" (p. 1826).

I don't think that these were ratings made by students.

---

73.

Fan et al. 2019 "Gender and Cultural Bias in Student Evaluations: Why Representation Matters" reports results from student evaluations at a large public university in Australia. A discussion of results and limitations are indicated in this passage from the article (p. 14):

Throughout this paper, and in the title, we have used the term "bias" when describing the statistically significant effect females and non-English speaking teachers. It should be pointed out that one of the limitations of this study is that it is only able to show association, e.g., being female is associated with a lower SET score, we cannot say what really was the cause for a lower score. However, if SET is really measuring teaching quality, then the only plausible causes are either that females are generally bad teachers across a large population, or there's bias, the same argument can be made for teachers who have non-English speaking background. Since we find no credible support that females, or someone with an accent, should generally be bad teachers, we have chosen to use the term "bias".

The article indicates that "Around 80% of the scores are given at either 5 or 6 and our results suggest that bias comes in at this top level, between 'agree' and 'strongly agree'" (p. 8). Given that the difference occurs between two positive ratings, I don't think that it is correct for the authors to be concerned about finding credible support for the claim that women or persons with an accent are "bad" teachers.

---

I'll plan to use the next post to wrap up the review of the Holman et al. 2019 "finding bias" list.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

40.

Sidanious and Crane 1989 "Job Evaluation and Gender: The Case of University Faculty" reported on data for 9,005 undergraduates evaluating 254 male instructors and 147 female instructors. Controlling for factors such as the student's sex and GPA and the instructor's rank and broad field, male instructors received higher evaluations than did female instructors. However, the discussion provides several caveats (pp. 192-193):

...the fact that, in general, men were perceived as being more competent than women need not be a function of gender stereotyping or bias; it is quite possible that men are, in fact more competent in their teaching roles...[and]...Even if these differences [in evaluations] are a function of gender bias rather than perceptual accuracy, the differences are too small to play any major role in how men and women are evaluated.

Holman et al. 2019 lists this study as "finding bias".

---

41.

Feldman 1992 "College Students' Views of Male and Female College Teachers: Part I: Evidence from the Social Laboratory and Experiments" is a review of experimental studies of student evaluations of teaching. I'm not sure that Feldman 1992 should count as an independent publication finding bias. Moreover, the sense of the literature provided in Feldman 1992 is a bit in tension with that provided in the Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list. Feldman 1992 includes studies up through 1990 and indicates the following (p. 342, emphasis in the original):

Most of the laboratory studies reviewed here (and summarized in the Appendix) found that male and female teachers did not differ in college student's overall evaluation of them as professionals (as indicated by students' perceptions of their overall teaching performance, their instructional ability, their effectiveness, and their competence and by whether or not students would take a course with them).

However, for studies from 1990 and earlier, the Holman et al. 2019 list includes zero "No Gender or Race Bias" entries and only one "Bias Favoring Women" entry, and the "Bias Favoring Women" entry isn't an experiment. Part of this tension might be due to the Holman et al. 2019 list being incomplete. Multiple studies listed in the Feldman 1992 references are not included in the Holman et al. 2019 list. For example, here is a passage from Mackie 1976 "Students' Perceptions of Female Professors" (p. 346, emphasis in the original):

Contrary to expectations, women teachers were perceived as more competent than male teachers in both task and socio-emotional spheres. Further, the males were not assigned a significantly higher prestige score.

And here is a passage from Basow and Distenfeld 1985 "Teacher Expressiveness: More Important for Males than Females?" (p. 51):

As other research has found (Elmore & LaPointe, 1974, 1975; Lombardo & Tocci, 1979), teacher sex did not have a main effect on student evaluations of teachers.

I have already discussed three relevant publications omitted from the Holman et al. 2019 list. The Feldman 1992 and 1993 reviews indicate that there are additional relevant studies omitted from the Holman et al. 2019 list.

---

42.

Feldman 1993 "College Students' Views of Male and Female College Teachers: Part II: Evidence from Students' Evaluations of Their Classroom Teachers" is a review of non-experimental studies of student evaluations of teaching. The abstract indicates (emphasis added):

Although a majority of studies have found that male and female college teachers do not differ in the global ratings they receive from their students, when statistically significant differences are found, more of them favor women than men. Across studies, the average association between gender and overall evaluation, while favoring women (average r = + .02), is so small as to be insignificant in practical terms.

This again raises the question of the representativeness of the Holman et al. 2019 list, at least for early studies. The Holman et al. 2019 list does not include Bausell and Magoon 1972 "Expected Grade in a Course, Grade Point Average, and Student Ratings of the Course and the Instructor". Here is the Feldman 1993 summary for the association between instructor sex and students' overall evaluation of the instructor in that study:

Bausell and Magoon (1972): 23 courses taught by women and 23 by men at the University of Delaware (excluding courses in the College of Economics, College of Nursing, and the Department of Secretarial Studies), academic year 1969-1970, matched on the semester course was taught, level of the course, and academic department within which the course was taught; single overall rating item ("Overall, how do you evaluate the instructor?").

The direction of the r of .03 (as derived from data on p. 171) and Z of 0.203 cannot be determined from information given.

The Holman et al. 2019 list does not include Brown 1976 "Faculty Ratings and Student Grades: A University-wide Multiple Regression Analysis". Here is the Feldman 1993 summary for the association between instructor sex and students' overall evaluation of the instructor:

Brown (1976): 2,360 course sections at the University of Connecticut, Spring semester of 1973; average score on the 8-item University of Connecticut Rating Scale for Instruction. r = + .04* (as given in Tables 2 and 3); Z = +1 .943*; N = 2,360 section ratings.

For what it's worth, I think that it is acceptable and preferable to not include studies from the 20th century in a review of research on bias in student evaluations of teaching, if the purpose of the list is to be informative for assessing the handling of student evaluations of teaching in 2019 and beyond. But if Holman et al. 2019 includes 20th century studies, it would be nice to have some indication about whether the inclusion of studies is representative.

---

Comments are open if you disagree, but I don't think that there are novel data in these publications that indicate an unfair bias. Even if novel data in these publications did indicate an unfair bias, I think that the data would be too old to be relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond. (I'm using "novel data" to refer to data that is initially reported and to not refer to data presented again in reviews.)

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

37.

Kierstead et al. 1988 "Sex Role Stereotyping of College Professors: Bias in Students' Ratings of Instructors" reported results for two experiments. The first experiment had 20 female college students and 20 male college students assigned to one of four conditions for a text description of a teaching situation: male/female professor and a professor who was/was not described a frequently spending time with students outside of class; the male professor received more favorable ratings than did the female professor (p<0.02).

The second experiment had 20 female college students and 20 male college students assigned to one of four conditions for a slide tape presentation of a lecture: male/female teacher and a teacher who was/was not smiling; the male teacher received more favorable ratings than did the female teacher (p<0.06).

Results in these two experiments were largely driven by the female target who was not sociable or who did not smile. Here are the means, on a scale from 1 for poor to 6 for outstanding:

5.3 for the sociable male professor

5.3 for the sociable female professor

5.4 for the unsociable male professor

4.4 for the unsociable female professor

---

4.1 for the smiling male teacher

4.2 for the smiling female teacher

4.5 for the unsmiling male teacher

3.3 for the unsmiling female teacher

Pooled standard deviations were about 0.65 for the "sociable" experiment conditions and about 0.9 for the "smiling" experiment conditions, so these are large differences between the sociable female and the unsociable female (about 1.4 standard deviations) and between the smiling female and the unsmiling female (about 1 standard deviation).

---

38.

Buck and Tiene 1989 "The Impact of Physical Attractiveness, Gender, and Teaching Philosophy on Teacher Evaluations" reported results for 42 undergraduate education majors at a state university in the Midwest across eight conditions, with experimental manipulations for instructor sex, instructor attractiveness, and whether an authoritarian or humanistic teaching perspective was attributed to the instructor. I did not see an indication of whether the instructors were described as college instructors or K12 instructors, but the photographs were described as being for persons about 21 years old.

Students rated the instructors on eight items. Many of the interactions had a p-value under 0.05 for a given evaluation item, but there was no main effect for instructor attractiveness and only one main effect for instructor sex (female instructors rated higher than male instructors on overall effectiveness). There was a statistically significant difference for seven of the eight items for teaching perspective (see Tiene and Buck 1987 for a discussion of the teaching perspective results).

I'll quote in full the Holman et al. 2019 summary of Buck and Tiene 1989:

An experiment was conducted at a Midwestern state university. The sample size was composed of 42 undergraduate seniors, mostly Caucasian females; 10 of the 42 students were male and 3 were black (all females). The students were given 1 of 4 different photographs- attractive or unattractive teachers who were either male or female. Each photograph also included a description of the teachers' teaching style/philosophy, divided by either an authoritarian or humanistic style. The results of the study showed that attractiveness did not have an effect on the ratings of instructor effectiveness. The study also found that authoritarianism was strongly associated with negative evaluations. However, contradictingly attractive authoritarian females were rated significantly more positively than the other 3 possibilities of authoritarian instructor characteristics.

It's not clear to me why Buck and Tiene 1989 is in the Holman et al. 2019 list for "Finding Bias" when Holman et al. 2019 has separate lists for "Bias Favoring Women" and "No Gender or Race Bias". The bias in favor of attractive authoritarian females would, if anything, suggest filing under "Bias Favoring Women". Strictly speaking, it is a "bias" that students rated authoritarian teachers less favorably than humanistic teachers, but filing Buck and Tiene 1989 under "Finding Bias" for that reason would stretch the definition of "bias" to include legitimate reasons such as teaching philosophy for students to rate one instructor more favorably than another instructor. (And the implication of the three main Holman et al. 2019 categories is that the "Finding Bias" category is limited to race bias or gender bias disfavoring women).

---

39.

Dukes and Victoria 1989 "The Effects of Gender, Status, and Effective Teaching on the Evaluation of College Instruction" reported results from 144 undergraduates from four sociology courses and two political science courses. Each student was given a description of four scenarios of college teaching, with experimental manipulations that included the professor's sex (e.g., Carl Pierce or Carla Pierce), whether the professor was a department chair, and the presence or absence of a certain characteristic of the professor (knowledgeable, enthusiastic, rapport, and organized).

For predicting teacher effectiveness, results indicated no main effect of professor sex by the knowledgeable scenario, no main effect of professor sex by the enthusiastic scenario, no main effect of professor sex by the rapport scenario, and no main effect of professor sex by the organized scenario. There were two reported interactions involving professor sex that, from what I can tell, were limited to one of the four scenarios, such as instructor sex and chair status interacting in the organized scenario.

Dukes and Victoria 1989 is another publication that I'm not sure should be classified under "Finding Bias". The Feldman 1992 review of the literature (pp. 356-357) summarizes results from Dukes and Victoria 1989, indicating only 2 of the 32 comparisons from Dukes and Victoria 1989 detected a statistically significant association and that neither of these 2 comparisons were of main effects of instructor sex.

---

Comments are open if you disagree, but I don't think that data from the 1980s or earlier are relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

34.

Martin 1984 "Power and Authority in the Classroom: Sexist Stereotypes in Teaching Evaluations" reported on data from 240 female students and 154 male students across nine course at a large Midwestern university. The instructors were three male social scientists, three female social scientists, and three female women's studies faculty. From the article (p. 488):

I made no attempt to match up male and female faculty on the basis of equal teaching skills (an impossible task in any case). I assumed that, if there were statistically significant differences between the ratings of an instructor by male and female students, then I could attribute these differences to sex bias.

Here is a sample from the results section (p. 490):

Results showed that, when male students were evaluating female social science instructors, high ratings on teaching effectiveness were strongly associated with high ratings on friendliness (r2 = 51 percent), smiles (r2 = 61 percent), eye contact (r2 = 61.6 percent), confidence (r2 = 62.9 percent), and decisiveness (r2 = 72.6 percent). I found no similarly strong associations in male students' ratings of male instructors or of women's studies instructors. No strong associations between teaching effectiveness and personal traits appeared when I examined female students' ratings.

This is interpreted as a bias (p. 492):

...more than their male colleagues, female instructors are likely to have their competence judged by male students on the basis of personal characteristics associated with feminine behavior, such as friendliness, frequent eye contact, and regular smiles. The message to women seems clear: if your institution bases personnel decisions on student evaluations, make sure your colleagues are aware of the possibility of sex bias.

I'm not sure that this sort of bias favors men over women, if the results mean that women social science instructors can improve their student evaluations by smiling and being friendlier but men social science instructors can't improve their student evaluations by smiling and being friendlier.

---

35.

Schuster and Van Dyne 1985 "The Changing Classroom" is an essay that reports no novel data. I'm not sure that it is properly placed in a list of studies "finding bias".

---

36.

Basow and Silberg 1987 "Student Evaluations of College Professors: Are Female and Male Professors Rated Differently?" reported responses from 553 male students and 527 female students at a small private college. Of the 22 female professors at the college who had taught at the college full-time at least one year, 16 were matched to a male professor by rank, division, and years of experience at the college.

Results indicated, among other things, that "...male students gave female professors significantly (p < .05) less positive ratings than they gave male professors on all dependent measures" and that "female students rated female professors significantly more negatively than they rated male professors on Instructor-Individual Student Interaction (p < .01), Dynamism/Enthusiasm (p < .05), and overall teaching ability (p < .01)" (p. 310).

However, Basow and Silberg 1987 notes that "...male professors may in fact be better overall teachers than female professors" (p. 313). Moreover (p. 313):

...the magnitude of the effect sizes is quite small, indicating that sex of instructor and sex of student account for only a small percentage of the variance in student ratings. For example, the combination of knowledge of teacher sex and student sex can predict only about 4% of the variance in scores on overall teaching ability.

---

Comments are open if you disagree, but I don't think that data from the 1980s or earlier are relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond.

Tagged with: , ,