Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

25.

Elmore and LaPointe 1974 "Effects of Teacher Sex and Student Sex on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1971. Complete data were available from 1,259 students in 38 pairs of courses matched on course number and instructor sex. For the 20 instructor evaluation items analyzed, only two items had a mean difference between female instructors and male instructors using a p=0.01 threshold: men instructors were rated higher for "spoke understandably", and women instructors were rated higher for"promptly returned homework and tests".

I'm not sure why Elmore and LaPointe 1974 is included in a list of studies finding bias in standard evaluations of teaching. No statistically-significant difference was reported for 18 of the 20 instructor evaluation items, and, for the two items for which there was a reported difference, one difference favored male instructors and the other difference favored female instructors. But, more importantly, the Elmore and LaPointe 1974 research design does not permit the inference that student ratings were biased from reality; for example, no evidence is reported that indicates that the female instructors didn't return homework and tests more promptly on average than the male instructors did.

---

26.

Elmore and LaPointe 1975 "Effect of Teacher Sex, Student Sex, and Teacher Warmth on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1974. Data were available from 838 students in 22 pairs of courses matched on course and instructor sex. Twenty standard instructor evaluation items were used, plus instructor responses and student responses to an item about whether the instructor's primary interest lie in the course content or the students and a five-point measure of how warm a person the instructor is. The p-value threshold was 0.0025.

Results indicated that "When students rate their instructor's interest and warmth, teachers perceived as warmer or primarily interested in students receive higher ratings in effectiveness regardless of their sex", that "In general, female faculty receive significantly higher effectiveness ratings than do male faculty when they rate themselves low in warmth or interested in course content", and that "Male teachers who rate themselves high in warmth or primarily interested in students receive significantly higher ratings than male teachers who rate themselves low in warmth or primarily interested in course content, respectively" (p. 374).

I'm not sure how these data establish an unfair bias in student evaluations of teaching.

---

27.

Ferber and Huber 1975 "Sex of Student and Instructor: A Study of Student Bias" reported on responses to three items from students in the first class meeting of four large introductory economics or sociology courses at the University of Illinois Urbana from 1972.

The first item asked students to rate men college teachers that they had had in seven academic areas and women college teachers that they had had in seven academic areas. Results in Table 1 indicate that, across the seven academic areas, the mean rating for men college teachers was identical to the mean rating for women college teachers (2.24).

The second question asked about student preferences for men instructors or women instructors in various types of classroom situations. Results in Table 2 indicate that most students did not express a preference, but, of the students who did express a preference, the majority preferred a man instructor. For example, of 1,241 students, 39 percent expressed a preference for a man instructor in a large lecture and 2 percent expressed a preference for a woman instructor in a large lecture.

The third item asked students to rate their level of agreement with a statement, attributed to a man or to a woman. For one statement, the prompt was: "A well-known American economist [Mary Killingsworth/Charles Knight] proposes that compulsory military service be replaced by the requirement that all young people give one year of service for their country". Results in Table 6 indicate that the mean level of agreement did not differ between Mary and Charles at p<0.05 among male students, among female students, or among the full sample.

For the other statement, the prompt was: "According to the contemporary social theorist [Frank Merton/Alice Parsons], in order to achieve equal educational opportunity in the United States, no parents should be allowed to pay for their children's education; every college student should borrow from the federal government to pay for tuition and living expenses". Results in Table 6 indicate that, on a rating scale from 1 for strongly agree to 5 for strongly disagree, the mean level of agreement differed at p<0.05 among male students, among female students, and among the full sample, with Alice favored over Frank (respective overall means of 3.38 and 3.66).

I'm not sure why Ferber and Huber 1975 is included in a list of studies finding bias in standard evaluations of teaching. The first item is the only item directly on point for assessing bias in student evaluations of teaching, and there was no overall difference in that item for male instructors and female instructors and no evidence that the lack of a difference was unfair.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

And it's worth considering whether these data from the Nixon administration should be included in the main Holman et al. 2019 list, given that the sum of "76" studies "finding bias" in the Holman et al. 2019 list is being used to suggest inferences about the handling of student evaluations of teaching in contemporary times.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

22.

Heilman and Okimoto 2007 "Why Are Women Penalized for Success at Male Tasks?: The Implied Communality Deficit" reports on three experiments regarding evaluations of fictional vice presidents of financial affairs. The experiments do not concern student evaluations of teaching, so it's not clear to me that Holman et al. 2019 should classify this article under "Evidence of Bias in Standard Evaluations of Teaching".

---

23.

Punyanunt-Carter and Carter 2015 "Students' Gender Bias in Teaching Evaluations" indicated that 58 students in an introductory communication course were asked to complete a survey about a male professor or about a female professor. The article did not report inferential statistics, and, given the reported percentages and sample sizes, it's not clear to me that this study should be classified as finding bias.

For example, here are results from the first question, about instructor effectiveness, for which the article reported results only for the percentage of each student gender that agreed or strongly agreed that the instructor was effective:

For the female professor:
82% of 17 males, so 14 of 17
67% of 15 females, so 10 of 15

For the male professor:
69% of 13 males, so 9 of 13
69% of 13 females, so 9 of 13

Overall, that's 21 of 32 (66%) for the female professor and 18 of 26 (69%) for the male professor, producing a p-value of 0.77 in a test for the equality of proportions.

---

24.

Young et al. 2009 "Evaluating Gender Bias in Ratings of University Instructors' Teaching Effectiveness" had graduate students and undergraduate students evaluate on 25 items "a memorable college or university teacher of their choice" (p. 4). Results indicated that "Female students rated their female instructors significantly higher on pedagogical characteristics and course content characteristics than they rated their male instructors. Also, male students rated male instructors significantly higher on the same two factors. Interpersonal characteristics of male and female instructors were not rated differently by the male and female students" (p. 9).

I'm not sure how much to make of the finding quoted above based on this study, given results in Table 4 of the article. The p-value section of Table 4 has a column for each of the three factors (interpersonal characteristics, pedagogical characteristics, and course content characteristics) and has seven rows, for student gender (A), student level (B), instructor gender (C), AxB, AxC, BxC, and AxBxC. So the table has 21 p-values, only 2 of which are under 0.05; the average of the 21 p-values is 0.52.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's pause our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias", to discuss three studies of student evaluations of teaching that are not in the Holman et al. 2019 list. I'll use the prefix "B" to refer to these bonus studies.

---

B1.

Meltzer and McNulty 2011 "Contrast Effects of Stereotypes: 'Nurturing' Male Professors Are Evaluated More Positively than 'Nurturing' Female Professors" reported on an experiment in which undergraduates rated a psychology job candidate, with variation in candidate gender (Dr. Michael Smith or Dr. Michelle Smith), variation in whether the candidate was described as "particularly nurturing", and variation in whether the candidate was described as "organized" or "disorganized". Participants responded to items such as "Do you think Dr. Smith's responses to students' questions in class would be helpful?" and "How do you think you would rate Dr. Smith's overall performance in this course?". Results indicated no main effect for gender, but the nurturing male candidate was rated higher than the control male candidate and the nurturing female candidate and marginally higher than the control female candidate.

For some reason, results for the "organized"/"disorganized" variation were not reported.

---

B2.

Basow et al. 2013 "The Effects of Professors' Race and Gender on Student Evaluations and Performance" reported on an experiment in which undergraduates from psychology, economics, and mathematics courses evaluated a three-minute engineering lecture from an animated instructor whose race and sex was Black or White and male or female; participants also took a quiz on lecture content. Results indicated that "student evaluations did not vary by teacher gender", that "students rated the African American professor higher than the White professor on several teaching dimensions", and that students in the male instructor condition and in the White instructor condition did better on the quiz (p. 359).

---

B3.

I don't have access to Chisadza et al. 2019 "Race and Gender Biases in Student Evaluations of Teachers", but the highlights indicate that "We use an RCT to investigate race and gender bias in student evaluations of teachers" and that "We note biases in favor of female lecturers and against black lecturers". The abstract at Semantic Scholar indicates that the experiment was conducted in South Africa and that "Students are randomly assigned to follow video lectures with identical narrated slides and script but given by lecturers of different race and gender".

---

Comments are open if you disagree, but I don't think that there is much in B1 or B2 that would undercut the use of student evaluations in employment decisions. The experiments have high internal validity, but B1 had no main effect for gender and B2 results aren't strong and consistent. Moreover, B1 and B2 use brief stimuli, so I don't know that the results are sufficiently informative about student evaluations at the end of a 15-week course.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

19.

Miller and Chamberlin 2000 "Women Are Teachers, Men Are Professors: A Study of Student Perceptions" reported on a study in which students in sociology courses were asked to indicate their familiarity with faculty members on a list, and, for faculty members that the student was familiar with, to indicate the highest education degree that the student thinks the faculty member has attained; listed faculty members were the faculty members in the sociology department, plus a fictitious man and a fictitious woman that footnote 6 indicates no student indicated a familiarity with. Results indicated that "controlling for faculty salary, seniority, rank, and award nomination rate, the level of educational attainment attributed to male classroom instructors is substantially and significantly higher than it is for women" (p. 294).

This study isn't about student evaluations of teaching and, from what I can tell, any implications of the study for student evaluations of teaching should be detectable in student evaluations of teaching.

---

20.

From what I can tell, the key finding mentioned above from Miller and Chamberlin 2000 did not replicate in Chamberlin and Hickey 2001 "Student Evaluations of Faculty Performance: The Role of Gender Expectations in Differential Evaluations", which indicated that: "Male versus female faculty credentials and expertise were also nonsignificant on items assessing student perceptions of the highest degree received by the faculty member, the rank of the faculty member, and whether the faculty member was tenured" (p. 10). Chamberlin and Hickey 2001 reported evidence of male faculty being rated differently than female faculty on certain items, but no analysis was reported that assessed whether these differences in ratings could be accounted for by plausible alternate explanations such as faculty performance.

---

21.

Sprague and Massoni 2005 "Student Evaluations and Gendered Expectations: What We Can't Count Can Hurt Us" analyzed data from 66 students at a public university on the East Coast and 223 students at a public university in the Midwest in 1999. Key data were student responses to a prompt to print up to four adjectives to describe the worst teacher that the student ever had and then to print up to four adjectives to describe the best teacher that the student ever had. Results were interpreted to indicate that "Men teachers are more likely to be held to an entertainer standard...[and]...Women teachers are held to a nurturer standard" (p. 791). Table V indicates that Caring is the most common factor for the best male teachers and that Uncaring is the second most common factor for the worst male teachers, so it's not obvious to me that the data permit a strong inference that men aren't also held to a nurturer standard.

---

Comments are open if you disagree, but I don't think that studies 19 and 20 report data indicating for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. Study 21 (Sprague and Massoni 2005) reported results suggesting a difference in student expectations for male faculty and female faculty, but I don't know that there's enough in that study to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

16.

Huston 2006 "Race and Gender Bias in Higher Education: Could Faculty Course Evaluations Impede Further Progress toward Parity" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor" is a review/essay-type of publication.

---

17.

Miles and House 2015 "The Tail Wagging the Dog; An Overdue Examination of Student Teaching Evaluations" [sic for the semicolon] reported on an analysis of student evaluations from a southwestern university College of Business, with 30,571 cases from 2011 through 2013 for 255 professors across 1,057 courses with class sizes from 10 to 190. The mean rating for the 774 male-instructed courses did not statistically differ from the mean rating for the 279 female-instructed courses (p=0.33), but Table 7 indicates that the 136 male-instructed large required courses had a higher mean rating than the 30 female-instructed large required courses (p=0.01). I don't see results reported for a gender difference in small courses.

For what it's worth, page 121 incorrectly notes that scores from male-instructed courses range from 4.96 to 4.26; the 4.96 should be 4.20 based on the lower bound of 4.196 in Table 4. Moreover, Hypothesis 6 is described as regarding a gender difference for "medium and large sections of required classes" (p. 119) but the results are for "large sections of required classes" (p. 122, 123) and the discussion of Hypothesis 6 included elective courses (p. 119), so it's not clear why medium classes and elective courses weren't included in the Table 7 analysis.

---

18.

Martin 2016 "Gender, Teaching Evaluations, and Professional Success in Political Science" reports on publicly available student evaluations for undergraduate political science courses from a southern R1 university from 2011 through 2014 and a western R1 university from 2007 through 2013. Results for the items, on a five-point scale, indicated little gender difference in small classes of 10 students, a mean male instructor rating 0.1 and 0.2 points higher than the mean female instructor rating for classes of 100, and a mean male instructor rating 0.5 points higher than the mean female instructor rating for classes of 200 or 400.

The statistical models had predictors only for instructor gender, class size, and an interaction term of instructor gender and class size. No analysis was reported that assessed whether ratings could be accounted for by plausible alternate explanations such as course or faculty performance.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. The interaction of instructor gender and class size that appeared in Miles and House 2015 and Martin 2016 appears to be worth further consideration in a research design that adequately addresses plausible alternate explanations.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

13.

Smith and Hawkins 2011 "Examining Student Evaluations of Black College Faculty: Does Race Matter" analyzed "undergraduate student ratings data for tenure-track faculty who used the 36-item student evaluation form adapted by the college" (p. 152), over a three-year period for the College of Education at a southeastern research university. Mean ratings from low to high were for Black faculty, White faculty, and nonwhite nonblack faculty.

No analysis was reported that assessed whether ratings on the items could be explained by plausible alternate explanations such as course or faculty performance.

---

14.

Reid 2010 "The Role of Perceived Race and Gender in the Evaluation of College Teaching on RateMyProfessors.com" reported on RateMyProfessors data for faculty at the 25 highest ranked liberal arts colleges. Table 3 indicated that the mean overall quality ratings by race were: White (3.89), Other (3.88), Latino (3.87), Asian (3.75), and Black (3.48). Table 4 indicated that the mean overall quality ratings by gender were: male (3.87) and female (3.86).

No analysis was reported that assessed whether ratings on the overall quality item or the more specific items could be explained by plausible alternate explanations such as faculty department, course, or faculty performance.

---

15.

Subtirelu 2015 "'She Does Have an Accent but…': Race and Language Ideology in Students' Evaluations of Mathematics Instructors on RateMyProfessors.com" reported that an analysis of data on RateMyProfessors indicated that "instructors with Chinese or Korean last names were rated significantly lower in Clarity and Helpfulness" than instructors with "US last names", that "RMP users commented on the language of their 'Asian' instructors frequently but were nearly entirely silent about the language of instructors with common US last names", and that "RMP users tended to withhold extreme positive evaluation from instructors who have Chinese or Korean last names, although this was frequently lavished on instructors with US last names" (pp. 55-56).

Discussing the question of whether this is unfair bias, Subtirelu 2015 indicated that "...a consensus about whether an instructor has 'legitimate' problems with his or her speech...would have to draw on some ideological framework of expectations for what or whose language will be legitimized [that] would almost certainly serve the interests of some by constructing their language as 'without problems' or 'normal'...while marginalizing others by constructing their language as 'containing problems' or 'being abnormal'" (p. 56).

In that spirit, I'll refrain from classifying as "containing problems" the difference in ratings that Subtirelu 2015 detected.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

10.

Huston 2006 "Race and Gender Bias in Higher Education: Could Faculty Course Evaluations Impede Further Progress toward Parity" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

---

11.

Baldwin and Blattner 2003 "Guarding Against Potential Bias in Student Evaluations: What Every Faculty Member Needs to Know" doesn't report novel data about unfair bias in student evaluations. I don't know why that publication would be classified under "academic articles, book chapters, and working papers finding bias".

---

12.

Pittman 2010 "Race and Gender Oppression in the Classroom: The Experiences of Women Faculty of Color with White Male Students" discusses interviews with 17 nonwhite female faculty. Here is a sample from one of the interviews, from page 190:

Now I can't prove that these are racial events, OK. But I have some supposition that they may be racially motivated...the occurrence of...white males...much more predominantly white males, are coming into my class and questioning my expertise...whereas I don't believe, and I can't prove this, but I don't believe that they go into their chemistry class and challenge their chemistry white male,...now that may be gender as well as race. Because I just don't think that they'd go to some of their other classes and question or challenge their professors in ways that I've been questioned or challenged.

This study doesn't provide much evidence of unfair bias in student evaluations, although the article does note that "Several women faculty of color talked about low course evaluation ratings from race- and gender-privileged students and expressed their fear of how these might affect their departmental merit reviews" (p.191).

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity.

Tagged with: , ,