Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

28.

I did not locate the text of Hogan 1978 "Review of the Literature: The Evaluation of Teaching in Higher Education", but I'm guessing from the "Review of the Literature" title that Hogan 1978 doesn't report novel data.

---

29.

Kaschak 1978 "Sex Bias in Student Evaluations of College Professors" had 100 seniors or first-year graduate students at San Jose State University (50 male and 50 female) rate fictional professors in business administration, chemistry, home economics, elementary education, psychology, and history, based on descriptions of the professors' teaching methods and practices; each professor had a male name or a female name, depending on the form that a student received. Ratings were reported for 1-to-10 scales for: effective/ineffective, concerned/unconcerned, likeable/not at all likeable, poor/excellent, powerless/powerful, and definitely would/would not take the course.

Male students had mean ratings on each item that were more positive for the male professor than for the female professor. Female students had statistically different mean ratings by professor sex only for indicating that the male professor is more powerful and for indicating a preference for taking the female professor's course.

---

30.

Lombardo and Tocci 1979 "Attribution of Positive and Negative Characteristics of Instructors as a Function of Attractiveness and Sex of Instructor and Sex of Subject" had 120 introductory psychology students (60 male and 60 female) rate a person in a photograph, with the experimental manipulation that the person in the photograph was male or female and was attractive or unattractive. Students were told that the photograph was of Mary Dickson or Andrew Dickson and were told that the person had earned a Ph.D. and had just finished a second year of teaching. Ratings included nine scales (such as from intelligent to not intelligent) and the items "Compared with the faculty members at this college, how would you rate the over-all teaching performance of this instructor?" and "How much would you like to take a course from this faculty member?".

Results indicated that "Each of the dependent measures was analyzed by a 2 (attractive vs unattractive) X 2 (sex of pictured person) X 2 (sex of subject) analysis of variance...A significant main effect was found...for attractive-unattractiveness. The absence of other main effects or interactions indicated that the attractive pictures were rated significantly more attractive" (p. 493) and that "An interaction between the attractiveness of the picture and the sex of the instructor...on the question of how much they would like to take a course from this instructor indicated that all subjects preferred to take a course from a male" (p. 494).

---

Comments are open if you disagree, but I don't think that data from the 1970s is relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond. For example, there has been a substantial increase since the 1970s in female representation among students and faculty, which can be plausibly expected to have reduced biases against female college faculty present during the Nixon administration.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

25.

Elmore and LaPointe 1974 "Effects of Teacher Sex and Student Sex on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1971. Complete data were available from 1,259 students in 38 pairs of courses matched on course number and instructor sex. For the 20 instructor evaluation items analyzed, only two items had a mean difference between female instructors and male instructors using a p=0.01 threshold: men instructors were rated higher for "spoke understandably", and women instructors were rated higher for"promptly returned homework and tests".

I'm not sure why Elmore and LaPointe 1974 is included in a list of studies finding bias in standard evaluations of teaching. No statistically-significant difference was reported for 18 of the 20 instructor evaluation items, and, for the two items for which there was a reported difference, one difference favored male instructors and the other difference favored female instructors. But, more importantly, the Elmore and LaPointe 1974 research design does not permit the inference that student ratings were biased from reality; for example, no evidence is reported that indicates that the female instructors didn't return homework and tests more promptly on average than the male instructors did.

---

26.

Elmore and LaPointe 1975 "Effect of Teacher Sex, Student Sex, and Teacher Warmth on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1974. Data were available from 838 students in 22 pairs of courses matched on course and instructor sex. Twenty standard instructor evaluation items were used, plus instructor responses and student responses to an item about whether the instructor's primary interest lie in the course content or the students and a five-point measure of how warm a person the instructor is. The p-value threshold was 0.0025.

Results indicated that "When students rate their instructor's interest and warmth, teachers perceived as warmer or primarily interested in students receive higher ratings in effectiveness regardless of their sex", that "In general, female faculty receive significantly higher effectiveness ratings than do male faculty when they rate themselves low in warmth or interested in course content", and that "Male teachers who rate themselves high in warmth or primarily interested in students receive significantly higher ratings than male teachers who rate themselves low in warmth or primarily interested in course content, respectively" (p. 374).

I'm not sure how these data establish an unfair bias in student evaluations of teaching.

---

27.

Ferber and Huber 1975 "Sex of Student and Instructor: A Study of Student Bias" reported on responses to three items from students in the first class meeting of four large introductory economics or sociology courses at the University of Illinois Urbana from 1972.

The first item asked students to rate men college teachers that they had had in seven academic areas and women college teachers that they had had in seven academic areas. Results in Table 1 indicate that, across the seven academic areas, the mean rating for men college teachers was identical to the mean rating for women college teachers (2.24).

The second question asked about student preferences for men instructors or women instructors in various types of classroom situations. Results in Table 2 indicate that most students did not express a preference, but, of the students who did express a preference, the majority preferred a man instructor. For example, of 1,241 students, 39 percent expressed a preference for a man instructor in a large lecture and 2 percent expressed a preference for a woman instructor in a large lecture.

The third item asked students to rate their level of agreement with a statement, attributed to a man or to a woman. For one statement, the prompt was: "A well-known American economist [Mary Killingsworth/Charles Knight] proposes that compulsory military service be replaced by the requirement that all young people give one year of service for their country". Results in Table 6 indicate that the mean level of agreement did not differ between Mary and Charles at p<0.05 among male students, among female students, or among the full sample.

For the other statement, the prompt was: "According to the contemporary social theorist [Frank Merton/Alice Parsons], in order to achieve equal educational opportunity in the United States, no parents should be allowed to pay for their children's education; every college student should borrow from the federal government to pay for tuition and living expenses". Results in Table 6 indicate that, on a rating scale from 1 for strongly agree to 5 for strongly disagree, the mean level of agreement differed at p<0.05 among male students, among female students, and among the full sample, with Alice favored over Frank (respective overall means of 3.38 and 3.66).

I'm not sure why Ferber and Huber 1975 is included in a list of studies finding bias in standard evaluations of teaching. The first item is the only item directly on point for assessing bias in student evaluations of teaching, and there was no overall difference in that item for male instructors and female instructors and no evidence that the lack of a difference was unfair.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

And it's worth considering whether these data from the Nixon administration should be included in the main Holman et al. 2019 list, given that the sum of "76" studies "finding bias" in the Holman et al. 2019 list is being used to suggest inferences about the handling of student evaluations of teaching in contemporary times.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

22.

Heilman and Okimoto 2007 "Why Are Women Penalized for Success at Male Tasks?: The Implied Communality Deficit" reports on three experiments regarding evaluations of fictional vice presidents of financial affairs. The experiments do not concern student evaluations of teaching, so it's not clear to me that Holman et al. 2019 should classify this article under "Evidence of Bias in Standard Evaluations of Teaching".

---

23.

Punyanunt-Carter and Carter 2015 "Students' Gender Bias in Teaching Evaluations" indicated that 58 students in an introductory communication course were asked to complete a survey about a male professor or about a female professor. The article did not report inferential statistics, and, given the reported percentages and sample sizes, it's not clear to me that this study should be classified as finding bias.

For example, here are results from the first question, about instructor effectiveness, for which the article reported results only for the percentage of each student gender that agreed or strongly agreed that the instructor was effective:

For the female professor:
82% of 17 males, so 14 of 17
67% of 15 females, so 10 of 15

For the male professor:
69% of 13 males, so 9 of 13
69% of 13 females, so 9 of 13

Overall, that's 21 of 32 (66%) for the female professor and 18 of 26 (69%) for the male professor, producing a p-value of 0.77 in a test for the equality of proportions.

---

24.

Young et al. 2009 "Evaluating Gender Bias in Ratings of University Instructors' Teaching Effectiveness" had graduate students and undergraduate students evaluate on 25 items "a memorable college or university teacher of their choice" (p. 4). Results indicated that "Female students rated their female instructors significantly higher on pedagogical characteristics and course content characteristics than they rated their male instructors. Also, male students rated male instructors significantly higher on the same two factors. Interpersonal characteristics of male and female instructors were not rated differently by the male and female students" (p. 9).

I'm not sure how much to make of the finding quoted above based on this study, given results in Table 4 of the article. The p-value section of Table 4 has a column for each of the three factors (interpersonal characteristics, pedagogical characteristics, and course content characteristics) and has seven rows, for student gender (A), student level (B), instructor gender (C), AxB, AxC, BxC, and AxBxC. So the table has 21 p-values, only 2 of which are under 0.05; the average of the 21 p-values is 0.52.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's pause our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias", to discuss three studies of student evaluations of teaching that are not in the Holman et al. 2019 list. I'll use the prefix "B" to refer to these bonus studies.

---

B1.

Meltzer and McNulty 2011 "Contrast Effects of Stereotypes: 'Nurturing' Male Professors Are Evaluated More Positively than 'Nurturing' Female Professors" reported on an experiment in which undergraduates rated a psychology job candidate, with variation in candidate gender (Dr. Michael Smith or Dr. Michelle Smith), variation in whether the candidate was described as "particularly nurturing", and variation in whether the candidate was described as "organized" or "disorganized". Participants responded to items such as "Do you think Dr. Smith's responses to students' questions in class would be helpful?" and "How do you think you would rate Dr. Smith's overall performance in this course?". Results indicated no main effect for gender, but the nurturing male candidate was rated higher than the control male candidate and the nurturing female candidate and marginally higher than the control female candidate.

For some reason, results for the "organized"/"disorganized" variation were not reported.

---

B2.

Basow et al. 2013 "The Effects of Professors' Race and Gender on Student Evaluations and Performance" reported on an experiment in which undergraduates from psychology, economics, and mathematics courses evaluated a three-minute engineering lecture from an animated instructor whose race and sex was Black or White and male or female; participants also took a quiz on lecture content. Results indicated that "student evaluations did not vary by teacher gender", that "students rated the African American professor higher than the White professor on several teaching dimensions", and that students in the male instructor condition and in the White instructor condition did better on the quiz (p. 359).

---

B3.

I don't have access to Chisadza et al. 2019 "Race and Gender Biases in Student Evaluations of Teachers", but the highlights indicate that "We use an RCT to investigate race and gender bias in student evaluations of teachers" and that "We note biases in favor of female lecturers and against black lecturers". The abstract at Semantic Scholar indicates that the experiment was conducted in South Africa and that "Students are randomly assigned to follow video lectures with identical narrated slides and script but given by lecturers of different race and gender".

---

Comments are open if you disagree, but I don't think that there is much in B1 or B2 that would undercut the use of student evaluations in employment decisions. The experiments have high internal validity, but B1 had no main effect for gender and B2 results aren't strong and consistent. Moreover, B1 and B2 use brief stimuli, so I don't know that the results are sufficiently informative about student evaluations at the end of a 15-week course.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

19.

Miller and Chamberlin 2000 "Women Are Teachers, Men Are Professors: A Study of Student Perceptions" reported on a study in which students in sociology courses were asked to indicate their familiarity with faculty members on a list, and, for faculty members that the student was familiar with, to indicate the highest education degree that the student thinks the faculty member has attained; listed faculty members were the faculty members in the sociology department, plus a fictitious man and a fictitious woman that footnote 6 indicates no student indicated a familiarity with. Results indicated that "controlling for faculty salary, seniority, rank, and award nomination rate, the level of educational attainment attributed to male classroom instructors is substantially and significantly higher than it is for women" (p. 294).

This study isn't about student evaluations of teaching and, from what I can tell, any implications of the study for student evaluations of teaching should be detectable in student evaluations of teaching.

---

20.

From what I can tell, the key finding mentioned above from Miller and Chamberlin 2000 did not replicate in Chamberlin and Hickey 2001 "Student Evaluations of Faculty Performance: The Role of Gender Expectations in Differential Evaluations", which indicated that: "Male versus female faculty credentials and expertise were also nonsignificant on items assessing student perceptions of the highest degree received by the faculty member, the rank of the faculty member, and whether the faculty member was tenured" (p. 10). Chamberlin and Hickey 2001 reported evidence of male faculty being rated differently than female faculty on certain items, but no analysis was reported that assessed whether these differences in ratings could be accounted for by plausible alternate explanations such as faculty performance.

---

21.

Sprague and Massoni 2005 "Student Evaluations and Gendered Expectations: What We Can't Count Can Hurt Us" analyzed data from 66 students at a public university on the East Coast and 223 students at a public university in the Midwest in 1999. Key data were student responses to a prompt to print up to four adjectives to describe the worst teacher that the student ever had and then to print up to four adjectives to describe the best teacher that the student ever had. Results were interpreted to indicate that "Men teachers are more likely to be held to an entertainer standard...[and]...Women teachers are held to a nurturer standard" (p. 791). Table V indicates that Caring is the most common factor for the best male teachers and that Uncaring is the second most common factor for the worst male teachers, so it's not obvious to me that the data permit a strong inference that men aren't also held to a nurturer standard.

---

Comments are open if you disagree, but I don't think that studies 19 and 20 report data indicating for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. Study 21 (Sprague and Massoni 2005) reported results suggesting a difference in student expectations for male faculty and female faculty, but I don't know that there's enough in that study to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

16.

Huston 2006 "Race and Gender Bias in Higher Education: Could Faculty Course Evaluations Impede Further Progress toward Parity" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor" is a review/essay-type of publication.

---

17.

Miles and House 2015 "The Tail Wagging the Dog; An Overdue Examination of Student Teaching Evaluations" [sic for the semicolon] reported on an analysis of student evaluations from a southwestern university College of Business, with 30,571 cases from 2011 through 2013 for 255 professors across 1,057 courses with class sizes from 10 to 190. The mean rating for the 774 male-instructed courses did not statistically differ from the mean rating for the 279 female-instructed courses (p=0.33), but Table 7 indicates that the 136 male-instructed large required courses had a higher mean rating than the 30 female-instructed large required courses (p=0.01). I don't see results reported for a gender difference in small courses.

For what it's worth, page 121 incorrectly notes that scores from male-instructed courses range from 4.96 to 4.26; the 4.96 should be 4.20 based on the lower bound of 4.196 in Table 4. Moreover, Hypothesis 6 is described as regarding a gender difference for "medium and large sections of required classes" (p. 119) but the results are for "large sections of required classes" (p. 122, 123) and the discussion of Hypothesis 6 included elective courses (p. 119), so it's not clear why medium classes and elective courses weren't included in the Table 7 analysis.

---

18.

Martin 2016 "Gender, Teaching Evaluations, and Professional Success in Political Science" reports on publicly available student evaluations for undergraduate political science courses from a southern R1 university from 2011 through 2014 and a western R1 university from 2007 through 2013. Results for the items, on a five-point scale, indicated little gender difference in small classes of 10 students, a mean male instructor rating 0.1 and 0.2 points higher than the mean female instructor rating for classes of 100, and a mean male instructor rating 0.5 points higher than the mean female instructor rating for classes of 200 or 400.

The statistical models had predictors only for instructor gender, class size, and an interaction term of instructor gender and class size. No analysis was reported that assessed whether ratings could be accounted for by plausible alternate explanations such as course or faculty performance.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. The interaction of instructor gender and class size that appeared in Miles and House 2015 and Martin 2016 appears to be worth further consideration in a research design that adequately addresses plausible alternate explanations.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

13.

Smith and Hawkins 2011 "Examining Student Evaluations of Black College Faculty: Does Race Matter" analyzed "undergraduate student ratings data for tenure-track faculty who used the 36-item student evaluation form adapted by the college" (p. 152), over a three-year period for the College of Education at a southeastern research university. Mean ratings from low to high were for Black faculty, White faculty, and nonwhite nonblack faculty.

No analysis was reported that assessed whether ratings on the items could be explained by plausible alternate explanations such as course or faculty performance.

---

14.

Reid 2010 "The Role of Perceived Race and Gender in the Evaluation of College Teaching on RateMyProfessors.com" reported on RateMyProfessors data for faculty at the 25 highest ranked liberal arts colleges. Table 3 indicated that the mean overall quality ratings by race were: White (3.89), Other (3.88), Latino (3.87), Asian (3.75), and Black (3.48). Table 4 indicated that the mean overall quality ratings by gender were: male (3.87) and female (3.86).

No analysis was reported that assessed whether ratings on the overall quality item or the more specific items could be explained by plausible alternate explanations such as faculty department, course, or faculty performance.

---

15.

Subtirelu 2015 "'She Does Have an Accent but…': Race and Language Ideology in Students' Evaluations of Mathematics Instructors on RateMyProfessors.com" reported that an analysis of data on RateMyProfessors indicated that "instructors with Chinese or Korean last names were rated significantly lower in Clarity and Helpfulness" than instructors with "US last names", that "RMP users commented on the language of their 'Asian' instructors frequently but were nearly entirely silent about the language of instructors with common US last names", and that "RMP users tended to withhold extreme positive evaluation from instructors who have Chinese or Korean last names, although this was frequently lavished on instructors with US last names" (pp. 55-56).

Discussing the question of whether this is unfair bias, Subtirelu 2015 indicated that "...a consensus about whether an instructor has 'legitimate' problems with his or her speech...would have to draw on some ideological framework of expectations for what or whose language will be legitimized [that] would almost certainly serve the interests of some by constructing their language as 'without problems' or 'normal'...while marginalizing others by constructing their language as 'containing problems' or 'being abnormal'" (p. 56).

In that spirit, I'll refrain from classifying as "containing problems" the difference in ratings that Subtirelu 2015 detected.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations.

Tagged with: , ,