Tour of research on student evaluations of teaching [43-73]

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post should take us through all but three of the remaining Holman et al. 2009 "finding bias" entries.

---

For my numbering, I had the Huston 2006 review listed as both #10 and #16, so I am replacing my #16 with another review/essay-type publication: Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor". Here is a sample from Sandler 1991 (p. 13):

Humor is a good way to handle some issues. If students call you Ms. or Mrs. or Miss, you can jokingly say "Oops, I've lost my professorship (or doctorate) again."

The above quote is the full text of one of the bullet points from Sandler 1991, but Sandler has more nuanced advice on her website (footnote omitted):

Humor is a good way to handle some issues, partly because it indicates that you are not taking what is happening as a personal affront, i.e., humor can be a way of showing strength because it shows that you are in charge. For example, if students call you Ms. or Mrs. or Miss, you can jokingly say, "Oops, I've lost my professorship (or doctorate) again." Although this works well with people who are comfortable using humor, it carries the risk of backfiring by putting the faculty member into a "joking match" with students.

I'm not sure how "Oops, I've lost my professorship (or doctorate) again" indicates that you are not taking what is happening as a personal affront. For what it's worth, I think that instructors who want to have such a discussion about a student's non-use of a title such as "Dr." should start that discussion with "I think that you should refer to me as 'Dr' because...".

Regardless, I don't see anything in Sandler 1991 that can be cited as novel evidence that student evaluations of teaching are biased.

---

43.

Kaschak 1981 "Another Look at Sex Bias in Students Evaluations of Professors" reports data from 40 male undergraduates and 40 female undergraduates in an experiment that had target instructors who were male or female and who were in one of six fields. Students rated instructors on six items, and 6 of the 42 (14%) potential comparisons produced a p-value less than 0.05 based on these six items across seven tests involving the sex of the student, the sex of the instructor, and the teaching field, and combinations of these factors.

Main effects for instructor sex appeared for "powerful" and "effective" (favoring the male instructor) but did not appear for "concerned", "likeable", "excellent", or "would take this course".

I don't think that a publication from 1981 should be used to inform policy about the use of student evaluations of teaching in 2020 or beyond, but I think it would be a good idea for student evaluations of teaching to not ask students to rate how "powerful" the instructor is or appeared to be.

---

44.

Statham et al. 1991 Gender and University Teaching: A Negotiated Difference is a book that reports on a study that involved interviews with professors, observations of professors, and surveys of students for 167 professors at a large Midwestern state university: 31 female professors and 57 male professors from male-dominated departments (80%+ men) and 40 female professors and 39 male professors from departments that were not male-dominated.

Results indicated several gender differences in teaching style: among other things and controlling for factors such as class size and course level, the women professors tended to spend more class time involving students than did the men professors (Table 3.2), the women professors offered more positive evaluations and more negative evaluations of students than did the men professors (Table 4.2), and the women professors offered more personalizations such as personal statements about themselves than did the men professors (Table 5.1). Moreover, there were no gender differences at p<0.05 in students challenging professors (Table 4.1).

Table 6.3 reports results from student evaluations for six competence-related items and five likability items, with the item scales ranging from 1 for strongly agree to 5 for strongly disagree. None of the 11 items had a p<0.05 difference between the mean for the men professors and the mean for the women professors.

Some differences between women professors and men professors appeared in associations of student evaluations by sex for instructional activities (Table 6.4), authority management techniques (Table 6.5), and personalizing activities (Table 6.6). For example, acknowledgement of student contributions positively associated at p<0.001 with likability among women professors but negatively associated at p<0.001 with likability among men professors.

Some of the patterns might be due to gender bias among students. For example, Table 6.4 indicates that students' solicitation of information negatively associated at p<0.001 with competence ratings of women professors but did not associate at p<0.05 with competence ratings for men professors. However, even presuming that the p-value for this difference is less than 0.05, for all we know students' solicitation of information truly did negatively associate with competence among women professors but not among men professors.

The finding of gender differences in teaching style, if correct, is relevant for studies of bias in student evaluations of teaching that do not sufficiently control for teaching style.

---

45.

Freeman 1994 "Student Evaluations of College Instructors: Effects of Type of Course Taught, Instructor Gender and Gender Role, and Student Gender" reports results from two experiments with undergraduates from introduction to psychology classes. In Experiment 1, each student responded to descriptions of three female instructors or descriptions of three male instructors; there was no main effect for instructor gender, but, with regard to gender role, androgynous instructors (mean of 5.99) were rated more effective than feminine instructors (4.47) or masculine instructors (4.40). Experiment 2 did not concern student evaluations of teaching.

---

46.

Basow 1995 "Student Evaluations of College Professors: When Gender Matters" reports on evaluations from students of 37 female faculty and 99 male faculty at a private undergraduate institution over four semesters, with a more detailed discussion of results from Fall 1986 (5,403 evaluations) and Spring 1990 (5,216 evaluations). Discussing the results, Basow 1995 indicates that (p. 622):

It can be argued that the effect sizes of the gender variables are so small, individually accounting for only .5% to 4% of the variance in the instructor ratings, as to be negligible.

I think that Holman et al. 2019 reversed the summaries for Basow and Silberg 1987 and Basow 1995.

---

47.

Burns-Glover and Veith 1995 "Revisiting Gender and Teaching Evaluations: Sex Still Makes a Difference" reported results from an experiment involving 78 undergraduates from a small liberal arts college who were asked to indicate which personality characteristics the university should look for in a tenure-track candidate referred to as [Sam/Sarah/Dr.] Larson. Students were asked to rate on a 0-to-6 scale 52 traits such as rational, daring, soft, and jolly. For the next page of the questionnaire, students were asked to rate on a 0-to-6 scale 25 behaviors such as "is an expert in field of study".

Table 1 lists the 20 traits that had a mean above the midpoint of the scale: 13 of these traits were male typed and 7 were female typed, but I don't see in the article an indication of the total number of male-typed traits or the total number of female typed traits. Three of the traits were rated higher at p<0.05 for Sarah than for Sam or Dr.: self-confident, stable, and steady; there were no other traits listed in Table 1 with a main effect for the Sarah/Sam/Dr. manipulation.

Responses about behaviors were analyzed with a discriminant analysis. For instructor gender, the modal category limited to Sarah or Sam was 51 percent. However, knowing the full set of ratings for a given student permitted correct classification of the rated instructor 91 percent of the time (39 of 43). The authors make an argument that the discriminant analysis results for "Dr." suggest that students perceived "Dr. Lawson" to be male; that might be true, but "Sam" is an unfortunate choice to signal instructor sex, given that "Sam" can be short for Samantha.

The data are interesting, but I don't know that it would matter if student responses truly differed as indicated in this study, if these differences did not translate into a bias detectable in student evaluations of teaching.

---

48.

Andersen and Miller 1997 "Gender and Student Evaluations of Teaching" is a review. Here is a quote that Anderson and Miller 1997 relayed (p. 127) from Sandler 1991 (not the version published in Communication Education but another version; see this link for the quote):

One male student continually objected to many of the statements made by a woman faculty member in her class. He would call out comments such as "That doesn't make sense," "I disagree with that," and similar statements in response to the professor's substantive remarks. She recognized that his comments were not related to the substance of her statements when the following occurred: the faculty member, using her own experience as a teaching example, began to state that she had been at a supermarket and the male student immediately interrupted to call out "That's not true." (Sandler 1991, pp. 5).

Maybe that happened, and maybe that was due to gender bias.

---

49.

Baker and Copp 1997 "Gender Performance Matters Most: The Interaction of Gendered Expectations, Feminist Course Content and Pregnancy in Students' Course Evaluations" reports an analysis of data from 1992 (student evaluations N=245 across three terms) about "the complexities of students' gendered expectations by considering what happens when a woman professor (Dr. Phyllis Baker, coauthor) teaches a course with controversial (feminist) content and then becomes pregnant" (p. 29).

Maybe something here generalizes to other situations in 2020 and beyond, and, if so, it would be nice to have updated research on that.

---

50.

Greenwald and Gillmore 1997 "No Pain, No Gain? The Importance of Measuring Course Workload in Student Ratings of Instruction" concerns grading leniency and student evaluations, and presents no evidence "demonstrating gender and/or racial bias in student evaluations".

---

51.

Moore 1997 "Student Resistance to Course Content: Reactions to the Gender of the Messenger" discusses resistance that a female instructor received and her attempts to address this resistance. Here is a sample (p. 130):

I asked the class how they would have responded if the male professor was teaching this course with the same primary textbook with the word "feminist" in the title. A female student reported that a male professor then would be speaking against males, so no favoritism would be apparent, whereas a female relaying the same message would be favoring "girls." The recurrent theme was that I am necessarily self-interested when I teach about family, especially when I present any data or theory that may be interpreted as favorable to women or unfavorable to men. I sensed that students felt that the male professor was not only more objective and scientific in his approach, but also that he speaks for everyone. I, on the other hand, am only capable of proffering "the women's" perspective.

This does not appear to be a correct reading of the female student's statement, which appears to make the point that a member of Group X speaking against Group X is properly treated differently than a member of Group Y speaking against Group X. The question of fairness would thus involve whether a woman speaking against men is treated the same as a man speaking against women.

I'm not sure what evidence of bias in student evaluations this publication offers. The publication notes that "While women's studies courses earn high evaluations, instructors of those courses are often attacked personally in those evaluations, accused of bias and of hating men" (p. 132). But if student evaluations of teaching aren't used to compare instructors of women's studies courses to instructors of courses in different departments, then I'm not sure that this would be a bias that would influence employment outcomes.

---

52.

Basow 2000 "Best and Worst Professors: Gender Patterns in Students' Choices" reports responses from 61 female students and 47 male students to two open-ended items, in counterbalanced order: "Think of the best professor you've had in college...describe what made him or her the 'best', in your opinion" and the same item but with "worst" substituted for "best". Results included these findings: "...about twice as many male as female faculty were chosen as 'best' by this sample; this is proportional to the male/female ratio of faculty at the college", and "For choice of 'worst' professors, there was no student gender by faculty gender interaction. Students made their choice proportional to the number of male and female faculty they estimated they had had" (pp. 411-412).

This sounds similar to Basow et al. 2006 "Gender Patterns in College Students' Choices of Their Best and Worst Professors".

---

53.

Basow 2000 "Gender Dynamics in the Classroom" is from a book that isn't in my institutional library. Here's a description from Maricic et al. 2016 "Gender Bias in Student Assessment of Teaching Performance" (p. 138):

Interesting studies were conducted by Basow (2000b) and Sprague and Massoni (2005). Namely, both studies asked students to depict their "best" and "worst" male and female teachers. Their results were conclusive. Traits of "best" female teachers were caring and nurturing, while the traits of "best" male teachers were funny and entertaining. On the other hand, when it comes to "worst" teachers, common traits for both genders were unorganised, unclear, indifferent, and rude. "Worst" female teachers were just the opposite of the "best" female teacher; they were characterized as rigid, mean, and unfair. Interestingly, "worst" male teachers were self-centred and unenthusiastic.

But Maricic et al. 2016 might have confused the two Basow 2000 citations. The terms "best professor" or "best professors" isn't showing up in a Google Books search of the Basow 2000 "Gender Dynamics in the Classroom" chapter.

---

54.

Sinclair and Kunda 2000 "Motivated Stereotyping of Women: She's Fine If She Praised Me but Incompetent If She Criticized Me" reports results from three studies.

In Study 1, 83 male undergraduates and 97 female undergraduates provided ratings regarding courses that the student had taken the prior term. Results indicated that "students' evaluations of female instructors are more dependent on the grades they have received from them than are their evaluations of male instructors" (p. 1333); for example, using the first course that a student rated, the drop-off in mean evaluations from a high grade to a low grade was 6.22 for male instructors and was 20.54 for female instructors, with female instructors rated lower than male instructors in the low grade group but with no p<0.05 difference in the high grade group.

In Study 2, 54 male undergraduates watched a male confederate manager or a female confederate manager provide positive feedback or negative feedback on the participant's responses to an interpersonal skills questionnaire. The pattern from Study 1 conceptually replicated in Study 2: no p<0.05 gender difference in ratings of the manager's skill in the positive feedback condition, a p<0.05 difference in ratings of the manager's skill in the negative feedback condition, and a p<0.05 difference in the difference. For participants' rating of manager competence, only a main effect appeared at p<0.05.

The Study 2 p<0.05 differences in ratings of manager skill based on 54 participants spread across four conditions suggest a large effect size: the "penalty" for negative feedback was d=0.66 for the male managers and was d=2.05 for the female managers.

Study 3 involved participants from Study 2 paired with another participant, to assess whether the key patterns from Study 2 replicate when the positive feedback or the negative feedback is provided to another person. Results indicated no differences when participants rated the female manager or the male manager who had provided feedback to a person who was not the participant: "Male and female evaluators were given comparable ratings when providing negative feedback as well as when providing positive feedback" (p. 1339).

---

55.

Arbuckle and Williams 2003 "Students' Perceptions of Expressiveness: Age and Gender Effects on Teacher Evaluations" reports results from an experiment involving a computer-generated stick figure lecturer with a female voice that in prior research had been attributed by students about equally to a man or a woman. The 352 student participants in the present study watched the 35-minute video that had the stick figure lecturer interspersed, and the students then completed an evaluation that indicated to participants that the professor was a [male/female] [under age 35/over age 55].

The lecturer in the male condition was rated higher on 4 of the 9 dependent variables for which results were reported: enthusiasm, felt accepted, meaningful voice tone, and showed interest. The 5 remaining dependent variables were: precise teaching, logical and organized, seemed conscientious, used scientific terminology, and relaxed and confident. Table II indicated that the higher ratings in the male condition were nearly entirely attributable to the male under age 35 condition.

My calculations indicated that the difference between the mean for the male condition and the mean for the female condition was 0.31 on the 6-point scale and that the pooled standard deviation was about 1.33 for the male condition and 1.38 for the female condition, so the 0.31 difference was about 0.23 standard deviations. The p-value was 0.03 for a t-test for a difference in the mean rating in the male condition compared to the mean rating in the female condition, presuming that participants were equally assigned to the male condition and the female condition (ttesti 176 0 1.33 176 0.31 1.38).

The evaluations had 12 items, but results were not reported for the 3 of the 12 dependent variables that had "negative skews and a heterogeneity of variance" (p. 510). The appendix indicated that the three dropped items were: "Presented scientific principles rather than opinions", "Helped me to gain a broader understanding", and "Appeared to know and understand the subject well". Those seem like important items, and it's odd and a bit suspicious to have these 3 items dropped from the analysis; it's not like in real life faculty evaluations don't consider student evaluation items that have a heterogeneity of variance.

So we are left with no evidence of a gender difference on important items such as logical and organized and appeared to know and understand the subject well, but with evidence of a gender difference on the less important items of enthusiasm, felt accepted, meaningful voice tone, and showed interest. It would be fine with me if student evaluations of teaching did not include a "meaningful voice tone" item.

Based on this 1995 AP story, there appears to be an unpublished paper by the co-authors providing more experimental evidence of bias in student evaluations.

---

56.

Ewing et al. 2003 "Prejudice against Gay Male and Lesbian Lecturers" reports results from a study involving involving 261 introduction to psychology students. Student participants were given a curriculum vitae for the guest lecturer, with the curriculum vitae signaling or not signaling that the instructor was gay or lesbian. The assignment was not purely random, and was based on which side of the room the student sat; moreover, students had not been assigned to the courses, so the students were non-randomly placed into the female instructor or male instructor condition. The guest lecturer then gave a lecture that was intended to be strong (animated or direct) or weak (dry and indirect); the lecture topic concerned advanced studies and careers related to psychology.

Ratings were lower for the weak lecture than for the strong lecture, but results indicated no main effect for lecturer sex or sexual orientation. However, results did indicate an interaction that was the opposite of that which would be expected if students used the weak lecture to derogate gay and lesbian lecturers: "after a strong lecture, students rated acknowledged gay male and lesbian lecturers more negatively than lecturers of unspecified sexual orientation; but after a weak lecture, students rated acknowledged gay male and lesbian lecturers more positively than lecturers of unspecified sexual orientation" (p.576).

---

57.

Anderson and Smith 2005 "Students' Preconceptions of Professors: Benefits and Barriers According to Ethnicity and Gender" reports on an experiment involving 633 undergraduate responses based on a syllabus for a Race, Gender and Inequality course, with variation in instructor gender, instructor ethnicity, and instructor teaching style. From what I can tell, there were no main effects detected for instructor gender or instructor ethnicity in terms of instructor warmth, instructor capability, or instructor political bias.

I'll put the other two Anderson publications next...

---

58.

Anderson 2010 "Students' Stereotypes of Professors: An Exploration of the Double Violations of Ethnicity and Gender" reports on an experiment involving 594 undergraduate responses based on a course syllabus with variation in instructor gender, instructor ethnicity, instructor teaching style, and course taught. The most relevant outcome variable is student responses about the instructor's professional competence, for which "there were no significant effects associated with this analysis" (p. 466). The abstract indicates that "Women professors were viewed as more warm than men professors even though their course syllabuses were identical", but that does not seem like an important finding for deciding whether to use student evaluations of teaching in employment decisions.

---

59.

Anderson and Kanner 2011 "Inventing a Gay Agenda: Students' Perceptions of Lesbian and Gay Professors" reports on an experiment involving 622 undergraduate responses based on a syllabus for a Psychology of Human Sexuality course with variation in instructor gender, instructor sexual orientation, instructor professor political ideology, and typographical errors. Results did not indicate a difference in perceived competence for gay/lesbian instructors compared to heterosexual instructors and did not indicate an interaction of instructor sexual orientation with the presence of errors in the syllabus. But results did indicate that students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, with a difference equivalent to 0.24 standard deviations based on summary statistics on page 1548.

Anderson and Kanner 2011 reports a second study (N=545) that had the same design as the first study but included measures for homonegativity, with results indicating, among other things, that "Among modern homonegatives, lesbian/gay professors (M = 3.07, SD = 0.09) were viewed as more politically biased than were heterosexuals (M = 2.52, SD = 0.08)..." (p. 1556). About half of the students were classified as non-homonegative, and it's not clear to me from the article whether, overall, students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, which would replicate a key finding in the first study.

It's possible that, for a Psychology of Human Sexuality course, students on average unfairly rate gay or lesbian instructors as more politically biased than students rate heterosexual professors. But unless there is a "political bias" item on student evaluations of teaching or unless this unfairness manifests in other ratings, I don't know how relevant that unfairness is for assessing whether to use student evaluations of teaching in employment decisions.

---

60.

Basow and Montgomery 2005 "Student Ratings and Professor Self-Ratings of College Teaching: Effects of Gender and Divisional Affiliation" reports on results from evaluations of 23 male professors and 20 female professors at a small liberal arts college, rated by 407 male students, 365 female students, and 31 students who did not report their gender. Professors were at the assistant professor rank or higher, taught 100- or 200-level courses, and were matched by division. Evaluations were conducted between weeks 7 through 12 of a 14-week semester.

Based on my read of the results of this study, it's not clear to me why Holman et al. 2019 would classify this study as "finding bias", in which the "finding bias" category is presented as finding bias against women or nonwhites (Holman et al. has a separate category for "bias favoring women"). Here is the Holman et al. 2019 summary of Basow and Montgomery 2005 related to gender bias in student evaluations:

Based on the results of the study, Basow and Montgomery concluded that professor gender and divisional affiliation (department/field of study) contributed to the results of student evaluations. Female professors were rated higher than male professors on two interpersonal factors and on scholarship, and natural science courses were rated the lowest for most factors. Humanities professors received the highest overall ratings however, male professors in the humanities received lower ratings than female professors.

---

61.

DiPietro and Faye 2005 "Online Student-Ratings-of-Instruction (SRI) Mechanisms for Maximal Feedback to Instructors" is an unpublished paper that I didn't locate. Here is a summary from Smith 2009 "Student Ratings of Teaching Effectiveness for Faculty Groups Based on Race and Gender" (p. 617):

Of the three groups of faculty (Hispanic, Asian-American, and White) included in the DiPietro and Faye (2005) study, Hispanic faculty received the lowest course evaluation ratings. Asian-American faculty received slightly better course evaluations than their Hispanic colleagues, but their scores were still lower than the scores of White faculty. The number of African-American faculty in DiPietro and Faye study was too small to draw any conclusions.

I'm not sure why Smith 2009 isn't in the Holman et al. 2019 list.

---

62.

Hammermesh and Parker 2005 "Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical Productivity" reports an analysis of student evaluations of instructors at the University of Texas at Austin from 2000 to 2002. Results in Table 3 indicate that instructor beauty associates with higher evaluations, that male instructors are rated higher than female instructors, and that the association of instructor beauty with evaluations is stronger for male faculty than for female faculty. The analysis included only predictors for these instructor or course characteristics: beauty, sex, minority status, non-native English status, tenure-track status, lower division course, and one-credit course.

---

63.

Abel and Meltzer 2007 "Student Ratings of a Male and Female Professors' Lecture on Sex Discrimination in the Workforce" reports results from an experiment in which psychology students, among other things, responded to ten items about a written lecture and the lecturer, with a variation in whether the lecture was attributed to Dr. Michael Smith or Dr. Mary Smith. Mean responses differed at p<0.05 for 3 of the 5 items about the lecture and 3 of the 5 items about the instructor.

So the experiment has internal validity. But the experiment also suffers from issues common to research published prior to the replication crisis, such as a lack of preregistration and a small sample size of 43 in the male instructor condition and 44 in the female instructor condition that had a 21 percent chance of detecting a 0.25 standard deviation difference. R code for that:

library(pwr)
pwr.t2n.test(n1=43, n2=44, d=0.25, sig.level=0.05)

But some observed differences were much larger than 0.25 standard deviations, such as the 0.77 standard deviation difference for the item of "The professor presented the information in a sexist light", so the experiment would be sufficiently powered (0.94) to detect a difference that large. And, again, it would be fine with me for student evaluations of teaching to not include an item such as "The professor presented the information in a sexist light".

The lecture concerned gender differences in occupation and ended with a claim that:

...current research shows that men and women equally employed in the same male-dominated position, with equal education, skills and credentials have different pay scales. Typically, men still receive higher salaries than women for doing the same things. This is largely the result of a historical male dominated society in the United States that still exists today.

So even accepting the key finding from the experiment, the finding is limited to psychology students rating a man instructor asserting that men have an unfair advantage higher than psychology student rated a woman instructor asserting that men have an unfair advantage. There is no parallel experiment reported here comparing men instructors asserting that women have an unfair advantage to women instructors asserting that women have an unfair advantage, and there is no experiment reported here comparing men instructors to women instructors for a lecture not directly related to gender.

It's worth noting that the unrepresentative context of the experiment—a man instructor and a woman instructor asserting that men have an unfair advantage—sometimes gets lost such that the experiment is described in general terms, such as in Nittrouer et al. 2017, with citation 18 referring to Abel and Meltzer 2007:

Additional studies similarly show that female (versus male) teachers are rated more negatively (14–17). For example, participants who read a lecture, which was posited as having been written and delivered by a male or female professor, rated the lecture by the male (versus the female) professor significantly more positively (18).

---

Here's a passage from the discussion in Abel and Meltzer 2007 (p. 179):

Finally, we are currently designing an experimental study comparing student ratings for male and female professors presenting more mundane lecture information versus the more emotionally charged lecture as used in this study which could then examine the effects for type of lecture information related to sex of student and sex of professor.

I'm not sure what happened to that study.

---

64.

McPherson et al. 2009 "What Determines Student Evaluation Scores? A Random Effects Analysis of Undergraduate Economics Classes" reports on student evaluation data from economic courses at the University of North Texas from 1994 through 2005, using the four items that the Department of Economics has chosen. Reporting results separately for the introductory principles course and upper-level courses, main effects appeared for instructor sex in both models, although differences were small (0.094 units and 0.066 units on a 4-point scale), and a main effect appeared for White instructors in the upper-level courses (0.120 units).

Models contained controls for factors such as instructor age and instructor experience, but nothing that would credibly measure instructor effectiveness.

---

65.

Boring 2015 "Gender Biases in Student Evaluations of Teachers and Their Impact on Teacher Incentives" reports results for 22,665 observations made by 4,423 undergraduate students of 372 different teachers at a university in France.

The abstract indicates that: "The results of generalized ordered logit regressions and fixed-effects models suggest that male teachers tend to receive higher SET scores because of students' gender biases". But the evidence indicates only that this is a bias in the sense of difference.

The abstract indicates that: "Men are perceived as being more knowledgeable (male gender stereotype) and obtain higher SET scores than women, but students appear to learn as much from women as from men, suggesting that female teachers are as knowledgeable as men". But teaching evaluations aren't intended as measures of teaching effectiveness ("Student ratings have never been intended to serve as a proxy for learning", Linse 2017: 95) and, even if they were, it is an inferential leap from student final exam scores to teacher knowledge.

---

66.

Boring et al. 2016 "Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness" analyze data from MacNell et al. 2014 and Boring 2015.

The discussion of the Boring 2015 data in Boring et al. 2016 has added the term "natural experiment"; however, the research design did not hold the quality of instruction constant between female instructors and male instructors: "the natural experiment does not allow us to control for differences in teaching styles across instructors" (p. 7).

---

67.

I suspect that Boring 2017 "Gender Biases in Student Evaluations of Teaching" might be a substantially reworked version of Boring 2015 "Gender Biases in Student Evaluations of Teachers and their Impact on Teacher Incentives", given that, among other things, the study setup and findings are similar, the acknowledgements are similar, and the funding grant number (612413) is the same.

---

68.

Let's transition from the "natural" experiment of the Boring 2015 data from France to the "quasi-experimental" data on 19,952 student evaluations from a School of Business at a university in the Netherlands between 2009 and 2013, reported on in Mengel et al. 2017 "Gender Bias in Teaching Evaluations". Students were randomly assigned to section instructors. Controlling for factors such as student grades, female instructors were rated 0.21 standard deviations lower on instructor-related items (pp. 552-553) than male instructors were by male students and 0. 08 standard deviations lower by female students.

However, Table 7 indicates that these differences differed by instructor rank in a pattern that does not seem consistent with the differences being due to a straighforward student gender bias: male students rated female instructors lower than they rated male instructors for student instructors and Ph.D. student instructors but not for lecturers and professors; moreover, female students rated female instructors lower than they rated male instructors for student instructors but rated female instructors higher than they rated male instructors for lecturers and professors.

---

69.

I think the "Freiderike, M., J. Sauermann, & U. Zolitz" 2017 "Gender Bias in Teaching Evaluations" paper listed in Holman et al. 2019 became the Mengel et al. 2017 "Gender Bias in Teaching Evaluations" article: for example, data in the paper's Table 7 are identical to data in the article's Table 7. And "Freiderike" is the first name of the first author Friederike Mengel.

---

70.

Wagner et al. 2016 "Gender, Ethnicity and Teaching Evaluations: Evidence from Mixed Teaching Teams" reports data from student evaluations in a graduate school in the Netherlands from 2010 through 2015. The data include courses in which a male instructor and a female instructor co-taught the course.

Evaluations ranged from an observed low of 1.82 to an observed high of 5, with a 4.271 mean and a 0.443 standard deviation. Results in Table 4 indicate that female instructors were rated about 0.25 standard deviations lower than male instructors were rated, controlling for Caucasian status, whether the instructor was a course leader, and either the instructor age and age-squared (0.091 points, 0.05<p<0.10) or whether the instructor was new (0.110 points, 0.05<p<0.10). Some results in Table 5 include a control for publications, and some results in Table 6 interact instructor sex with the course type (e.g., Governance & development policy), but Table 5 and Table 6 models don't control for instructor experience (age, or being new), even though sample sizes for the relevant base models in Tables 4 through 6 range from an N of 499 to an N of 688.

---

71.

Rosen 2017 "Correlations, Trends, and Potential Biases among Publicly Accessible Web-Based Student Evaluations of Teaching: A Large-scale Study of RateMyProfessors.com Data" reported results from ratings on RateMyProfessors.com for instructors who had at least 20 ratings. Results indicated evidence at p<0.001 that male instructors were rated higher on clarity, helpfulness, and overall quality and were rated lower on easiness, but that these mean differences were trivial, respectively 0.05, 0.04, 0.04, and 0.03 units on a 1-to-5 scale.

Further results were reported controlling for the hot/not hot rating (p. 41):

Approximately, 22.7% of male faculty on RateMyProfessors are rated as 'hot', compared to 27.8% of female faculty. Since it has already been shown in Table 1 that perceived physical appearance correlates with evaluation scores, whether a professor is rated as 'hot' or' not hot' should be controlled when analysing potential gender biases as well.

But I don't understand the reason for the "should be controlled" if the hot/not hot ratings are gender biased or at least could plausibly be gender biased.

---

72.

Drake et al. 2019 "Grading Teachers: Race and Gender Differences in Low Evaluation Ratings and Teacher Employment Outcomes" reports an analysis of performance ratings of preK-12 teachers in Michigan from 2011 to 2015. Results indicated that "male teachers and teachers of color were more likely to be labeled 'minimally effective' or 'ineffective' than their same-school peers even after conditioning on evaluators' prior judgments and value-added scores" (p. 1826).

I don't think that these were ratings made by students.

---

73.

Fan et al. 2019 "Gender and Cultural Bias in Student Evaluations: Why Representation Matters" reports results from student evaluations at a large public university in Australia. A discussion of results and limitations are indicated in this passage from the article (p. 14):

Throughout this paper, and in the title, we have used the term "bias" when describing the statistically significant effect females and non-English speaking teachers. It should be pointed out that one of the limitations of this study is that it is only able to show association, e.g., being female is associated with a lower SET score, we cannot say what really was the cause for a lower score. However, if SET is really measuring teaching quality, then the only plausible causes are either that females are generally bad teachers across a large population, or there's bias, the same argument can be made for teachers who have non-English speaking background. Since we find no credible support that females, or someone with an accent, should generally be bad teachers, we have chosen to use the term "bias".

The article indicates that "Around 80% of the scores are given at either 5 or 6 and our results suggest that bias comes in at this top level, between 'agree' and 'strongly agree'" (p. 8). Given that the difference occurs between two positive ratings, I don't think that it is correct for the authors to be concerned about finding credible support for the claim that women or persons with an accent are "bad" teachers.

---

I'll plan to use the next post to wrap up the review of the Holman et al. 2019 "finding bias" list.

Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.