In this post, I discussed the possibility that "persons at the lower levels of hostile sexism are nontrivially persons who are sexist against men". Brian Schaffner provides more information on this possibility, in the paper "How Political Scientists Should Measure Sexist Attitudes". I'll place Figure 2 from the paper below:

From the paper discussion of Figure 2 (p. 14):

The plot on the right shows the modest influence of hostile sexism on moderating the gender treatment in the politician conjoint. Subjects in the bottom third of the hostile sexism distribution were about 10 points more likely to select the female profile, a difference that is statistically significant (p=.005). However, the treatment effect was small and not statistically significant among those in the middle and top terciles.

From what I can tell, this evidence suggests that the proper interpretation of the hostile sexism scale is not as a measure of sexism against women but as a measure of male/female preference, with participants who prefer men sorted to high levels of the measure and participants who prefer women sorted to low levels of the measure. If hostile sexism were a proper linear measure of sexism against women, low values of hostile sexism would predict equal treatment of men and women and higher levels would predict favoritism of men over women.

Tagged with: , ,

This post discusses three unpublished studies that I don't expect to be working on in the foreseeable future.

---

1.

I posted at APSA Preprints "The Yuge Effect of Racist Resentment on Support for Donald Trump and…Attitudes about Automobile Fuel Efficiency Requirements?". This paper reports evidence indicating that a published measure of "racist resentment" does a remarkably good job predicting non-racial outcome variables such as environmental policy preferences. Sample results are at this prior post.

---

2.

I posted at OSF a write-up of results for a preregistered study "Belief in Genetic Differences and Support for Efforts to Reduce Inequality". I reported these data in an unaccepted proposal for a short study with the Time-sharing Experiments for the Social Science.

---

3.

Data for the second study and this third study are from a 2017 YouGov survey that I had conducted using funds from Illinois State University New Faculty Start-up Support and the Illinois State University College of Arts and Sciences. My initial plan was to run a version of my 2014 TESS proposal, but I saw the Carney and Enos paper (current version) in the 2015 MPSA program and realized that their experiments were similar to my plan, so I changed the survey.

Here is an early version of the planned survey.

One element of the new survey was an experiment involving attitudes about food stamps. I planned for the final three slides to each include an item about poor Americans, with the third item being randomly assigned to be about either poor White Americans or poor Black Americans. The third item was:

Most [randomize: poor Black Americans/poor White Americans] who receive government welfare could get along without it if they tried.

Carney and Enos had done something similar with the traditional racial resentment items, but these traditional racial resentment items aren't particularly good at measuring resentment (such as "Over the past few years, blacks have gotten less than they deserve"). The "could get along without it if they tried" is an old racial resentment item that wasn't included on the traditional four-item battery, but I think it does a nice job of capturing resentment.

I posted at OSF a write-up of results from this "unnecessary welfare experiment". I submitted to a journal a more extensive analysis and discussion, but the manuscript was rejected in peer review.

The "unnecessary welfare experiment" research design might have caused the estimated differences to be underestimates, given that the prior two items had the same response scale and were about poor Americans in general. Nonetheless, the results provide evidence that, in 2017, non-Hispanic White conservatives and non-Hispanic Whites in general reported more agreement that most poor Black Americans who receive government welfare could get along without it if they tried, compared to their reported agreement to the same statement about poor White Americans who receive government welfare.

Tagged with: , , , ,

The Morning et al. 2019 DuBois Review article "Socially Desirable Reporting and the Expression of Biological Concepts of Race" reports on an experiment from the Time-sharing Experiments for the Social Sciences. Documentation at the TESS link indicates that the survey was fielded between Oct 8 and Oct 14 of 2004, and the article was published online Oct 14 of 2019, so the data were about 15 years old, but I did not see anything in the article that indicated the year of data collection.

Here is a key result, discussed on page 11 of the article:

When respondents in the comparison group were asked directly whether they agreed with the statement on genetics and race, only 13% said they did. This figure is significantly lower than the 22% we estimated previously as "truly" supporting the race statement. As a result, we conclude that the social desirability effect for this item equals 9 percentage points (22 – 13).

That 22% estimate of support is for non-Black responses that are not weighted to reflect population characteristics, but my analysis indicated that the estimate of support falls to 14% when the weight variable in the TESS dataset is applied to the non-Black responses. The social desirability effect in the analysis with these weights is thus not statistically different than zero in the data. Nonetheless, the Morning et al. 2019 abstract generalizes the results to the population of non-Black Americans:

We show that one in five non-Black Americans attribute income inequality between Black and White people to unspecified genetic differences between the two groups. We also find that this number is substantially underestimated when using a direct question.

---

I would like for peer review to require [1] an indication of the year(s) of data collection and [2] a discussion of weighted results for an experiment when the data should be known or suspected to have included a third-party weight variable (such as data from TESS or a CCES module).

---

NOTES

1. This post is a follow-up of this tweet that tagged two of the Morning et al. 2019 co-authors.

2. In this tweet, I expressed doubt that a peer reviewer or editor would check these data to see if inferences are robust to weighting. Morning et al. 2019 indicates that a peer reviewer suggested that a weight be applied to account for an inequality between experimental groups (p. 8):

...the baseline group has a disproportionately large middle-income share and small lower-income share relative to the test and comparison groups. As suggested by one anonymous reviewer, we reran the analyses using a weight calculated such that the income distribution in the baseline group corresponds to that found in the treatment and comparison groups.

3. I am co-author in an article that discusses, among other things, variation in the use of weights for survey experiments in a political science literature.

Tagged with: ,

Let's conclude our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post focuses on what I consider to be promising research designs.

---

74.

Mitchell and Martin 2018 "Gender Bias in Student Evaluations" has two studies. The first study reports a comparison of comments on one female instructor to comments on one male instructor, from official course evaluations and RateMyProfessors comments on dimensions such as competence, appearance, personality, and use of "teacher" or "professor". There's not much all-else-equal in the study established for instructor teaching style or effectiveness, and this study didn't even involve the instructors teaching the same set of courses. Moreover, as I indicated here, the p-values for the reported results are biased downward in a way that undercuts inferences that there is a statistical difference.

The second study has a better research design, with Mitchell and Martin teaching different sections of the same course and so that "all lectures, assignments, and content were exactly the same in all sections" (p. 650).

But this study also has errors in the p-values and a lack of an all-else-equal element that could sufficiently eliminate all plausible explanations other than gender bias ("The only aspects of the course that varied between Dr. Mitchell's and Dr. Martin's sections were the course grader and contact with the instructor", p. 650). Moreover, Martin taught the higher-numbered sections for which students plausibly differed from students in the lower-numbered sections (e.g., from what I can tell, response rates to the student evaluations were a 17 percent for Mitchell and 12 percent for Martin); and Martin and Mitchell had taught before at the university and thus could have developed a reputation that caused any difference in student evaluations for the course.

And the analysis for the second study excluded student evaluation data for sections 1 to 5 of the course; these data are not available on the university website.

---

75.

Rivera and Tilcsik 2019 "Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation" has a preregistered survey experiment and a quasi-natural experiment.

The quasi-natural experiment involved a university switching from a 10-point evaluation scale to a 6-point evaluation scale. Data from 105,034 student ratings of 369 instructors in 235 courses over 20 semesters with a 10-point scale and 9 semesters with a 6-point scale indicated that male instructors were more likely than were female instructors to get the highest rating on the 10-point scale (p. 257) but were not more likely than were female instructors to get the highest rating on the 6-point scale (p. 258); moreover, the 0.5-point male/female gap for the 10-point scale was reduced to a 0.1-point male/female gap for the 6-point scale. The authors reported results that addressed potential confounds, such as: "the change we observe is not driven by a general linear trend toward higher ratings for women (Models 8 and 11)" (p. 263).

The survey experiment was intended to build on these results with a research design that permitted a stronger causal inference. The experiment was conducted online with 400 students from 40 universities, in which the experimental manipulations involved the sex of the instructor and whether the rating scale had 6 points or 10 points. The teaching samples involved students receiving "identical excerpts from the transcript of a lecture and [being] randomly assigned either a male or a female name to the instructor who had ostensibly given the lecture" (p. 256).

For the 10-point scale, the mean rating was 7.8 for the male instructor and was 7.1 for the female instructor, a difference at p<0.05 of about 0.32 standard deviations, using the 0.64 gap mentioned on page 265. However, for the 6-point scale, the mean rating was 4.9 for the male instructor and was 4.8 for the female instructor, a difference at p>0.05 of about 0.1 standard deviations.

Results from the survey experiment also indicated evidence that participants were more likely to use superlative words to describe the male instructor than to describe the female instructor.

I think that Rivera and Tilcsik 2019 is a great study that is convincing that numeric student evaluations of teaching should have no more than 6 points. But I don't think that it is convincing evidence that numeric student evaluations of teaching with 6 or fewer points should not be used in employment decisions, given that the experiment did not detect at p<0.05 a difference between the mean rating for the male instructor and the mean rating for the female instructor when using a 6-point scale and did not even detect a meaningfully large difference in the quasi-experimental data.

The survey experiment did provide evidence of gender bias in student use of superlatives in items that the preregistration form indicated were for exploratory purposes, but I don't know that these sort of superlatives have a nontrivial effect on the employment decisions that student evaluations of teaching are properly used for.

Two notes and/or criticisms...

First, the topic of the lecture in the survey experiment was selected with gender in mind (p. 256):

All participants read an identical excerpt from the transcript of a lecture on the social and economic implications of technological change. We chose this topic because it has potentially broad appeal, and both technology and economics are traditionally male-dominated fields.

I don't know why it would be a good idea to select a topic that drew on two male-dominated fields instead of selecting a topic that was gender neutral or, preferably, adding a dimension to the survey experiment in which some participants received a topic from a male-dominated field and other participants received a topic from a female-dominated field. Limiting the topic to male-dominated fields undercuts the ability to generalize any bias detected in favor of the male instructors.

Second, the article suggests that teaching quality was held constant in the survey experiment ("...randomly vary the focal instructor's (perceived) gender and the rating scale while holding constant instructor quality", p. 256, emphasis in the original). But the only "teaching" that participants were exposed to was the teaching that could be inferred from reading lecture notes, which undercuts the ability to generalize results to in-person teaching, especially in-person teaching over a semester.

---

76.

The shortcoming of the survey experiment methodology in Rivera and Tilcsik 2019 is that the exposure to the instructor is too brief to permit the inference that any trivial-to-moderate bias detected in the survey experiment will survive an entire semester of exposure to the instructor, especially if the exposure is two or three days per week for 12 or more weeks. Related, from Anderson and Kanner 2011: "Many studies have found that stereotypes decrease as individuals get to know out-group members (e.g., Anderssen, 2002)" (p. 1560).

MacNell et al. 2014 "What's in a Name: Exposing Gender Bias in Student Ratings of Teaching" addresses this shortcoming by having a male assistant instructor and a female assistant instructor for an introductory-level anthropology/sociology course act as themselves for a section and act as the other assistant instructor for another section; students were randomly assigned to section. If conducted properly and preregistered, this sort of study could provide strong evidence of gender bias. But as Benton and Li 2014 indicate, the study or the article have multiple important flaws, such as the assistant instructors not being blind to the sex that the instructor was acting as and the article not reporting results for the item that asked about the instructor's overall quality of teaching.

The experiment is also substantially underpowered, based on estimates from Rivera and Tilcsik 2019. The 6-point scales in the Rivera and Tilcsik 2019 survey experiment produced an estimate of the bias against female instructors of 0.10 standard deviations, and that estimate was not statistically significant. Using that estimate and the MacNell et al. 2014 sample sizes, the MacNell et al. 2014 study had a 6 percent statistical power. The items in MacNell et al. 2014 had a 5-point scale, but the statistical power is only 25 percent even using the 0.40 standard deviation estimate from the Rivera and Tilcsik 2019 survey experiment conditions with the 10-point scale. R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.1)

pwr.t2n.test(n1=20, n2=23, d=0.4)

The MacNell et al. 2014 data are here. My analysis indicated that, comparing the perceived female instructor group to the perceived male instructor group, the p-value is p=0.15 for the overall evaluation item (a 0.46 standard deviation difference) and is p=0.07 for an index of all 15 items (a 0.58 standard deviation difference). The estimated bias on the index was 0.35 standard deviations (p=0.51) among male students and was 0.81 standard deviations (p=0.055) among female students, although the p-value is only p=0.45 for this 0.46 standard deviation difference. Stata code:

egen overall_std = std(overall)
ttest overall_std, by(taidgender) unp une

gen index = professional + respect + caring + enthusiastic + communicate + helpful + feedback + prompt + consistent + fair + responsive + praised + knowledgeable + clear + overall
egen index_std = std(index)
ttest index_std, by(taidgender) unp une

ttest index_std if gender==1, by(taidgender) unp une
ttest index_std if gender==2, by(taidgender) unp une
reg index_std taidgender##gender

---

Discussion

Detecting gender bias in student evaluations of teaching is not simple. Non-experimental research that doesn't control for gender differences in teaching quality or teaching style and doesn't control for student differences isn't of much value unless the detected difference in evaluations is large enough to not be plausibly explained by a combination of gender differences in teaching quality or teaching style and non-random student selection to instructors and courses. And experimental research that involves only a brief exposure to a target instructor isn't informative about whether any detected gender bias will persist for an entire semester of exposure to an instructor.

I think that studies 74 through 76 have among the strongest research designs in the literature on bias in student evaluations of teaching, if not the strongest research designs, or at least have the potential for convincing research designs if aforementioned flaws are addressed. However, even accounting for the fact that online research designs in these studies produce inferences that might not apply to face-to-face courses, I don't know what about these studies or other studies should lead a department or university to not use student evaluations of teaching in employment decisions.

I think that the research design of the type in MacNell et al. 2014 could provide convincing evidence of bias in student evaluations of teaching, if preregistered, sufficiently powered, and with instructors blinded to condition. Even better would be for the research to be conducted or supervised by a team of researchers in an adversarial collaboration.

An interesting research design has been to analyze evaluations to assess whether the sub-items that predict an overall evaluation differ for female instructors and male instructors. However, it's not clear to me that such differences would be a bias that should lead a department or university to not use student evaluations of teaching in employment decisions, unless these biases manifest in the student responses to the items and not only in the correlations among responses to the items.

---

I think that student evaluations of teaching are a useful tool when used properly, but, if I were opposed to their use in employment decisions or otherwise, I don't know that I would focus on the claims of gender and race biases, such as "Mounting evidence of favouritism towards white male instructors doesn't dissuade universities". I think that it's correct that any gender or race bias would be smaller than a "beauty premium" favoring attractive instructors (e.g., Lombardo and Tocci 1979, Hammermesh and Parker 2005, Wallisch and Cachia 2018).

---

Let me end the post with a discussion of the Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list of 76 studies "finding bias" that I have been working through pursuant to a suggestion by a Holman et al. coauthor. This list was presented by a Holman et al. 2019 coauthor in a widely-distributed tweet as a "list of 76 articles demonstrating gender and/or racial bias in student evaluations".

But the Holman et al. 2019 list has important flaws. The list is incomplete, and it is not clear that omitted studies are missing at random. For example, the Feldman 1992 review of experimental studies indicates that "Most of the laboratory studies reviewed here (and summarized in the Appendix) found that male and female teachers did not differ in college student's overall evaluation of them as professionals..." (emphasis in the original), and the Feldman 1993 review of non-experimental studies indicates that "a majority of studies have found that male and female college teachers do not differ in the global ratings they receive from their students" and that "when statistically significant differences are found, more of them favor women than men". This is not the impression provided by the Holman et al. 2019 list, which includes over fifteen pre-1991 "finding bias" studies, one pre-1991 "bias favoring women" study, and zero pre-1991 "no gender or race bias" studies.

Moreover, the Holman et al. 2019 list has at least nine studies that are essays or reviews that present no novel non-anecdotal data (Schuster and Van Dyne 1985, Feldman 1992, Feldman 1993, Andersen and Miller 1997, Baldwin and Blattner 2003, Huston 2006, Laube et al. 2007, Spooren et al. 2013, and Stark and Freishtat 2014), has at least five studies that might not be properly classified as being about student evaluations of teaching (Brooks 1982, Heilman and Okimoto 2007, El-Alayli et al. 2018, Drake et al. 2019, and Piatak and Mohr 2019), has at least three studies that might be better classified as duplicates of other entries (Boring 2015, Boring et al. 2016, and "Freiderike" et al. 2017), has at least one study that might be better classified as "no bias" or as "bias favoring women" (Basow and Montgomery 2005), and has at least three studies that might be better not being listed as finding bias if the bias is intended to be a bias in favor of straight white men or some combination thereof (Greenwald and Gillmore 1997, Uttl et al. 2017, and Hessler et al. 2018).

It's also worth noting that a nontrivial percentage of studies listed as "finding bias" are at least 20 years old: 6 studies from before 1980, 17 studies from before 1990, and 28 studies (37%) from before 1999.

I think that there is a public benefit to a list of studies assessing bias in student evaluations of teaching, but the list should at least be representative or be accompanied by an indication of evidence that the list is representative. For what it's worth, on November 26, I tweeted to two Holman et al. 2019 coauthors a link to a post listing errors in one of the 76 summaries; the errors were still there on December 16.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

This post should take us through all but three of the remaining Holman et al. 2009 "finding bias" entries.

---

For my numbering, I had the Huston 2006 review listed as both #10 and #16, so I am replacing my #16 with another review/essay-type publication: Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor". Here is a sample from Sandler 1991 (p. 13):

Humor is a good way to handle some issues. If students call you Ms. or Mrs. or Miss, you can jokingly say "Oops, I've lost my professorship (or doctorate) again."

The above quote is the full text of one of the bullet points from Sandler 1991, but Sandler has more nuanced advice on her website (footnote omitted):

Humor is a good way to handle some issues, partly because it indicates that you are not taking what is happening as a personal affront, i.e., humor can be a way of showing strength because it shows that you are in charge. For example, if students call you Ms. or Mrs. or Miss, you can jokingly say, "Oops, I've lost my professorship (or doctorate) again." Although this works well with people who are comfortable using humor, it carries the risk of backfiring by putting the faculty member into a "joking match" with students.

I'm not sure how "Oops, I've lost my professorship (or doctorate) again" indicates that you are not taking what is happening as a personal affront. For what it's worth, I think that instructors who want to have such a discussion about a student's non-use of a title such as "Dr." should start that discussion with "I think that you should refer to me as 'Dr' because...".

Regardless, I don't see anything in Sandler 1991 that can be cited as novel evidence that student evaluations of teaching are biased.

---

43.

Kaschak 1981 "Another Look at Sex Bias in Students Evaluations of Professors" reports data from 40 male undergraduates and 40 female undergraduates in an experiment that had target instructors who were male or female and who were in one of six fields. Students rated instructors on six items, and 6 of the 42 (14%) potential comparisons produced a p-value less than 0.05 based on these six items across seven tests involving the sex of the student, the sex of the instructor, and the teaching field, and combinations of these factors.

Main effects for instructor sex appeared for "powerful" and "effective" (favoring the male instructor) but did not appear for "concerned", "likeable", "excellent", or "would take this course".

I don't think that a publication from 1981 should be used to inform policy about the use of student evaluations of teaching in 2020 or beyond, but I think it would be a good idea for student evaluations of teaching to not ask students to rate how "powerful" the instructor is or appeared to be.

---

44.

Statham et al. 1991 Gender and University Teaching: A Negotiated Difference is a book that reports on a study that involved interviews with professors, observations of professors, and surveys of students for 167 professors at a large Midwestern state university: 31 female professors and 57 male professors from male-dominated departments (80%+ men) and 40 female professors and 39 male professors from departments that were not male-dominated.

Results indicated several gender differences in teaching style: among other things and controlling for factors such as class size and course level, the women professors tended to spend more class time involving students than did the men professors (Table 3.2), the women professors offered more positive evaluations and more negative evaluations of students than did the men professors (Table 4.2), and the women professors offered more personalizations such as personal statements about themselves than did the men professors (Table 5.1). Moreover, there were no gender differences at p<0.05 in students challenging professors (Table 4.1).

Table 6.3 reports results from student evaluations for six competence-related items and five likability items, with the item scales ranging from 1 for strongly agree to 5 for strongly disagree. None of the 11 items had a p<0.05 difference between the mean for the men professors and the mean for the women professors.

Some differences between women professors and men professors appeared in associations of student evaluations by sex for instructional activities (Table 6.4), authority management techniques (Table 6.5), and personalizing activities (Table 6.6). For example, acknowledgement of student contributions positively associated at p<0.001 with likability among women professors but negatively associated at p<0.001 with likability among men professors.

Some of the patterns might be due to gender bias among students. For example, Table 6.4 indicates that students' solicitation of information negatively associated at p<0.001 with competence ratings of women professors but did not associate at p<0.05 with competence ratings for men professors. However, even presuming that the p-value for this difference is less than 0.05, for all we know students' solicitation of information truly did negatively associate with competence among women professors but not among men professors.

The finding of gender differences in teaching style, if correct, is relevant for studies of bias in student evaluations of teaching that do not sufficiently control for teaching style.

---

45.

Freeman 1994 "Student Evaluations of College Instructors: Effects of Type of Course Taught, Instructor Gender and Gender Role, and Student Gender" reports results from two experiments with undergraduates from introduction to psychology classes. In Experiment 1, each student responded to descriptions of three female instructors or descriptions of three male instructors; there was no main effect for instructor gender, but, with regard to gender role, androgynous instructors (mean of 5.99) were rated more effective than feminine instructors (4.47) or masculine instructors (4.40). Experiment 2 did not concern student evaluations of teaching.

---

46.

Basow 1995 "Student Evaluations of College Professors: When Gender Matters" reports on evaluations from students of 37 female faculty and 99 male faculty at a private undergraduate institution over four semesters, with a more detailed discussion of results from Fall 1986 (5,403 evaluations) and Spring 1990 (5,216 evaluations). Discussing the results, Basow 1995 indicates that (p. 622):

It can be argued that the effect sizes of the gender variables are so small, individually accounting for only .5% to 4% of the variance in the instructor ratings, as to be negligible.

I think that Holman et al. 2019 reversed the summaries for Basow and Silberg 1987 and Basow 1995.

---

47.

Burns-Glover and Veith 1995 "Revisiting Gender and Teaching Evaluations: Sex Still Makes a Difference" reported results from an experiment involving 78 undergraduates from a small liberal arts college who were asked to indicate which personality characteristics the university should look for in a tenure-track candidate referred to as [Sam/Sarah/Dr.] Larson. Students were asked to rate on a 0-to-6 scale 52 traits such as rational, daring, soft, and jolly. For the next page of the questionnaire, students were asked to rate on a 0-to-6 scale 25 behaviors such as "is an expert in field of study".

Table 1 lists the 20 traits that had a mean above the midpoint of the scale: 13 of these traits were male typed and 7 were female typed, but I don't see in the article an indication of the total number of male-typed traits or the total number of female typed traits. Three of the traits were rated higher at p<0.05 for Sarah than for Sam or Dr.: self-confident, stable, and steady; there were no other traits listed in Table 1 with a main effect for the Sarah/Sam/Dr. manipulation.

Responses about behaviors were analyzed with a discriminant analysis. For instructor gender, the modal category limited to Sarah or Sam was 51 percent. However, knowing the full set of ratings for a given student permitted correct classification of the rated instructor 91 percent of the time (39 of 43). The authors make an argument that the discriminant analysis results for "Dr." suggest that students perceived "Dr. Lawson" to be male; that might be true, but "Sam" is an unfortunate choice to signal instructor sex, given that "Sam" can be short for Samantha.

The data are interesting, but I don't know that it would matter if student responses truly differed as indicated in this study, if these differences did not translate into a bias detectable in student evaluations of teaching.

---

48.

Andersen and Miller 1997 "Gender and Student Evaluations of Teaching" is a review. Here is a quote that Anderson and Miller 1997 relayed (p. 127) from Sandler 1991 (not the version published in Communication Education but another version; see this link for the quote):

One male student continually objected to many of the statements made by a woman faculty member in her class. He would call out comments such as "That doesn't make sense," "I disagree with that," and similar statements in response to the professor's substantive remarks. She recognized that his comments were not related to the substance of her statements when the following occurred: the faculty member, using her own experience as a teaching example, began to state that she had been at a supermarket and the male student immediately interrupted to call out "That's not true." (Sandler 1991, pp. 5).

Maybe that happened, and maybe that was due to gender bias.

---

49.

Baker and Copp 1997 "Gender Performance Matters Most: The Interaction of Gendered Expectations, Feminist Course Content and Pregnancy in Students' Course Evaluations" reports an analysis of data from 1992 (student evaluations N=245 across three terms) about "the complexities of students' gendered expectations by considering what happens when a woman professor (Dr. Phyllis Baker, coauthor) teaches a course with controversial (feminist) content and then becomes pregnant" (p. 29).

Maybe something here generalizes to other situations in 2020 and beyond, and, if so, it would be nice to have updated research on that.

---

50.

Greenwald and Gillmore 1997 "No Pain, No Gain? The Importance of Measuring Course Workload in Student Ratings of Instruction" concerns grading leniency and student evaluations, and presents no evidence "demonstrating gender and/or racial bias in student evaluations".

---

51.

Moore 1997 "Student Resistance to Course Content: Reactions to the Gender of the Messenger" discusses resistance that a female instructor received and her attempts to address this resistance. Here is a sample (p. 130):

I asked the class how they would have responded if the male professor was teaching this course with the same primary textbook with the word "feminist" in the title. A female student reported that a male professor then would be speaking against males, so no favoritism would be apparent, whereas a female relaying the same message would be favoring "girls." The recurrent theme was that I am necessarily self-interested when I teach about family, especially when I present any data or theory that may be interpreted as favorable to women or unfavorable to men. I sensed that students felt that the male professor was not only more objective and scientific in his approach, but also that he speaks for everyone. I, on the other hand, am only capable of proffering "the women's" perspective.

This does not appear to be a correct reading of the female student's statement, which appears to make the point that a member of Group X speaking against Group X is properly treated differently than a member of Group Y speaking against Group X. The question of fairness would thus involve whether a woman speaking against men is treated the same as a man speaking against women.

I'm not sure what evidence of bias in student evaluations this publication offers. The publication notes that "While women's studies courses earn high evaluations, instructors of those courses are often attacked personally in those evaluations, accused of bias and of hating men" (p. 132). But if student evaluations of teaching aren't used to compare instructors of women's studies courses to instructors of courses in different departments, then I'm not sure that this would be a bias that would influence employment outcomes.

---

52.

Basow 2000 "Best and Worst Professors: Gender Patterns in Students' Choices" reports responses from 61 female students and 47 male students to two open-ended items, in counterbalanced order: "Think of the best professor you've had in college...describe what made him or her the 'best', in your opinion" and the same item but with "worst" substituted for "best". Results included these findings: "...about twice as many male as female faculty were chosen as 'best' by this sample; this is proportional to the male/female ratio of faculty at the college", and "For choice of 'worst' professors, there was no student gender by faculty gender interaction. Students made their choice proportional to the number of male and female faculty they estimated they had had" (pp. 411-412).

This sounds similar to Basow et al. 2006 "Gender Patterns in College Students' Choices of Their Best and Worst Professors".

---

53.

Basow 2000 "Gender Dynamics in the Classroom" is from a book that isn't in my institutional library. Here's a description from Maricic et al. 2016 "Gender Bias in Student Assessment of Teaching Performance" (p. 138):

Interesting studies were conducted by Basow (2000b) and Sprague and Massoni (2005). Namely, both studies asked students to depict their "best" and "worst" male and female teachers. Their results were conclusive. Traits of "best" female teachers were caring and nurturing, while the traits of "best" male teachers were funny and entertaining. On the other hand, when it comes to "worst" teachers, common traits for both genders were unorganised, unclear, indifferent, and rude. "Worst" female teachers were just the opposite of the "best" female teacher; they were characterized as rigid, mean, and unfair. Interestingly, "worst" male teachers were self-centred and unenthusiastic.

But Maricic et al. 2016 might have confused the two Basow 2000 citations. The terms "best professor" or "best professors" isn't showing up in a Google Books search of the Basow 2000 "Gender Dynamics in the Classroom" chapter.

---

54.

Sinclair and Kunda 2000 "Motivated Stereotyping of Women: She's Fine If She Praised Me but Incompetent If She Criticized Me" reports results from three studies.

In Study 1, 83 male undergraduates and 97 female undergraduates provided ratings regarding courses that the student had taken the prior term. Results indicated that "students' evaluations of female instructors are more dependent on the grades they have received from them than are their evaluations of male instructors" (p. 1333); for example, using the first course that a student rated, the drop-off in mean evaluations from a high grade to a low grade was 6.22 for male instructors and was 20.54 for female instructors, with female instructors rated lower than male instructors in the low grade group but with no p<0.05 difference in the high grade group.

In Study 2, 54 male undergraduates watched a male confederate manager or a female confederate manager provide positive feedback or negative feedback on the participant's responses to an interpersonal skills questionnaire. The pattern from Study 1 conceptually replicated in Study 2: no p<0.05 gender difference in ratings of the manager's skill in the positive feedback condition, a p<0.05 difference in ratings of the manager's skill in the negative feedback condition, and a p<0.05 difference in the difference. For participants' rating of manager competence, only a main effect appeared at p<0.05.

The Study 2 p<0.05 differences in ratings of manager skill based on 54 participants spread across four conditions suggest a large effect size: the "penalty" for negative feedback was d=0.66 for the male managers and was d=2.05 for the female managers.

Study 3 involved participants from Study 2 paired with another participant, to assess whether the key patterns from Study 2 replicate when the positive feedback or the negative feedback is provided to another person. Results indicated no differences when participants rated the female manager or the male manager who had provided feedback to a person who was not the participant: "Male and female evaluators were given comparable ratings when providing negative feedback as well as when providing positive feedback" (p. 1339).

---

55.

Arbuckle and Williams 2003 "Students' Perceptions of Expressiveness: Age and Gender Effects on Teacher Evaluations" reports results from an experiment involving a computer-generated stick figure lecturer with a female voice that in prior research had been attributed by students about equally to a man or a woman. The 352 student participants in the present study watched the 35-minute video that had the stick figure lecturer interspersed, and the students then completed an evaluation that indicated to participants that the professor was a [male/female] [under age 35/over age 55].

The lecturer in the male condition was rated higher on 4 of the 9 dependent variables for which results were reported: enthusiasm, felt accepted, meaningful voice tone, and showed interest. The 5 remaining dependent variables were: precise teaching, logical and organized, seemed conscientious, used scientific terminology, and relaxed and confident. Table II indicated that the higher ratings in the male condition were nearly entirely attributable to the male under age 35 condition.

My calculations indicated that the difference between the mean for the male condition and the mean for the female condition was 0.31 on the 6-point scale and that the pooled standard deviation was about 1.33 for the male condition and 1.38 for the female condition, so the 0.31 difference was about 0.23 standard deviations. The p-value was 0.03 for a t-test for a difference in the mean rating in the male condition compared to the mean rating in the female condition, presuming that participants were equally assigned to the male condition and the female condition (ttesti 176 0 1.33 176 0.31 1.38).

The evaluations had 12 items, but results were not reported for the 3 of the 12 dependent variables that had "negative skews and a heterogeneity of variance" (p. 510). The appendix indicated that the three dropped items were: "Presented scientific principles rather than opinions", "Helped me to gain a broader understanding", and "Appeared to know and understand the subject well". Those seem like important items, and it's odd and a bit suspicious to have these 3 items dropped from the analysis; it's not like in real life faculty evaluations don't consider student evaluation items that have a heterogeneity of variance.

So we are left with no evidence of a gender difference on important items such as logical and organized and appeared to know and understand the subject well, but with evidence of a gender difference on the less important items of enthusiasm, felt accepted, meaningful voice tone, and showed interest. It would be fine with me if student evaluations of teaching did not include a "meaningful voice tone" item.

Based on this 1995 AP story, there appears to be an unpublished paper by the co-authors providing more experimental evidence of bias in student evaluations.

---

56.

Ewing et al. 2003 "Prejudice against Gay Male and Lesbian Lecturers" reports results from a study involving involving 261 introduction to psychology students. Student participants were given a curriculum vitae for the guest lecturer, with the curriculum vitae signaling or not signaling that the instructor was gay or lesbian. The assignment was not purely random, and was based on which side of the room the student sat; moreover, students had not been assigned to the courses, so the students were non-randomly placed into the female instructor or male instructor condition. The guest lecturer then gave a lecture that was intended to be strong (animated or direct) or weak (dry and indirect); the lecture topic concerned advanced studies and careers related to psychology.

Ratings were lower for the weak lecture than for the strong lecture, but results indicated no main effect for lecturer sex or sexual orientation. However, results did indicate an interaction that was the opposite of that which would be expected if students used the weak lecture to derogate gay and lesbian lecturers: "after a strong lecture, students rated acknowledged gay male and lesbian lecturers more negatively than lecturers of unspecified sexual orientation; but after a weak lecture, students rated acknowledged gay male and lesbian lecturers more positively than lecturers of unspecified sexual orientation" (p.576).

---

57.

Anderson and Smith 2005 "Students' Preconceptions of Professors: Benefits and Barriers According to Ethnicity and Gender" reports on an experiment involving 633 undergraduate responses based on a syllabus for a Race, Gender and Inequality course, with variation in instructor gender, instructor ethnicity, and instructor teaching style. From what I can tell, there were no main effects detected for instructor gender or instructor ethnicity in terms of instructor warmth, instructor capability, or instructor political bias.

I'll put the other two Anderson publications next...

---

58.

Anderson 2010 "Students' Stereotypes of Professors: An Exploration of the Double Violations of Ethnicity and Gender" reports on an experiment involving 594 undergraduate responses based on a course syllabus with variation in instructor gender, instructor ethnicity, instructor teaching style, and course taught. The most relevant outcome variable is student responses about the instructor's professional competence, for which "there were no significant effects associated with this analysis" (p. 466). The abstract indicates that "Women professors were viewed as more warm than men professors even though their course syllabuses were identical", but that does not seem like an important finding for deciding whether to use student evaluations of teaching in employment decisions.

---

59.

Anderson and Kanner 2011 "Inventing a Gay Agenda: Students' Perceptions of Lesbian and Gay Professors" reports on an experiment involving 622 undergraduate responses based on a syllabus for a Psychology of Human Sexuality course with variation in instructor gender, instructor sexual orientation, instructor professor political ideology, and typographical errors. Results did not indicate a difference in perceived competence for gay/lesbian instructors compared to heterosexual instructors and did not indicate an interaction of instructor sexual orientation with the presence of errors in the syllabus. But results did indicate that students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, with a difference equivalent to 0.24 standard deviations based on summary statistics on page 1548.

Anderson and Kanner 2011 reports a second study (N=545) that had the same design as the first study but included measures for homonegativity, with results indicating, among other things, that "Among modern homonegatives, lesbian/gay professors (M = 3.07, SD = 0.09) were viewed as more politically biased than were heterosexuals (M = 2.52, SD = 0.08)..." (p. 1556). About half of the students were classified as non-homonegative, and it's not clear to me from the article whether, overall, students rated gay/lesbian instructors as more politically biased than students rated heterosexual instructors, which would replicate a key finding in the first study.

It's possible that, for a Psychology of Human Sexuality course, students on average unfairly rate gay or lesbian instructors as more politically biased than students rate heterosexual professors. But unless there is a "political bias" item on student evaluations of teaching or unless this unfairness manifests in other ratings, I don't know how relevant that unfairness is for assessing whether to use student evaluations of teaching in employment decisions.

---

60.

Basow and Montgomery 2005 "Student Ratings and Professor Self-Ratings of College Teaching: Effects of Gender and Divisional Affiliation" reports on results from evaluations of 23 male professors and 20 female professors at a small liberal arts college, rated by 407 male students, 365 female students, and 31 students who did not report their gender. Professors were at the assistant professor rank or higher, taught 100- or 200-level courses, and were matched by division. Evaluations were conducted between weeks 7 through 12 of a 14-week semester.

Based on my read of the results of this study, it's not clear to me why Holman et al. 2019 would classify this study as "finding bias", in which the "finding bias" category is presented as finding bias against women or nonwhites (Holman et al. has a separate category for "bias favoring women"). Here is the Holman et al. 2019 summary of Basow and Montgomery 2005 related to gender bias in student evaluations:

Based on the results of the study, Basow and Montgomery concluded that professor gender and divisional affiliation (department/field of study) contributed to the results of student evaluations. Female professors were rated higher than male professors on two interpersonal factors and on scholarship, and natural science courses were rated the lowest for most factors. Humanities professors received the highest overall ratings however, male professors in the humanities received lower ratings than female professors.

---

61.

DiPietro and Faye 2005 "Online Student-Ratings-of-Instruction (SRI) Mechanisms for Maximal Feedback to Instructors" is an unpublished paper that I didn't locate. Here is a summary from Smith 2009 "Student Ratings of Teaching Effectiveness for Faculty Groups Based on Race and Gender" (p. 617):

Of the three groups of faculty (Hispanic, Asian-American, and White) included in the DiPietro and Faye (2005) study, Hispanic faculty received the lowest course evaluation ratings. Asian-American faculty received slightly better course evaluations than their Hispanic colleagues, but their scores were still lower than the scores of White faculty. The number of African-American faculty in DiPietro and Faye study was too small to draw any conclusions.

I'm not sure why Smith 2009 isn't in the Holman et al. 2019 list.

---

62.

Hammermesh and Parker 2005 "Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical Productivity" reports an analysis of student evaluations of instructors at the University of Texas at Austin from 2000 to 2002. Results in Table 3 indicate that instructor beauty associates with higher evaluations, that male instructors are rated higher than female instructors, and that the association of instructor beauty with evaluations is stronger for male faculty than for female faculty. The analysis included only predictors for these instructor or course characteristics: beauty, sex, minority status, non-native English status, tenure-track status, lower division course, and one-credit course.

---

63.

Abel and Meltzer 2007 "Student Ratings of a Male and Female Professors' Lecture on Sex Discrimination in the Workforce" reports results from an experiment in which psychology students, among other things, responded to ten items about a written lecture and the lecturer, with a variation in whether the lecture was attributed to Dr. Michael Smith or Dr. Mary Smith. Mean responses differed at p<0.05 for 3 of the 5 items about the lecture and 3 of the 5 items about the instructor.

So the experiment has internal validity. But the experiment also suffers from issues common to research published prior to the replication crisis, such as a lack of preregistration and a small sample size of 43 in the male instructor condition and 44 in the female instructor condition that had a 21 percent chance of detecting a 0.25 standard deviation difference. R code for that:

library(pwr)
pwr.t2n.test(n1=43, n2=44, d=0.25, sig.level=0.05)

But some observed differences were much larger than 0.25 standard deviations, such as the 0.77 standard deviation difference for the item of "The professor presented the information in a sexist light", so the experiment would be sufficiently powered (0.94) to detect a difference that large. And, again, it would be fine with me for student evaluations of teaching to not include an item such as "The professor presented the information in a sexist light".

The lecture concerned gender differences in occupation and ended with a claim that:

...current research shows that men and women equally employed in the same male-dominated position, with equal education, skills and credentials have different pay scales. Typically, men still receive higher salaries than women for doing the same things. This is largely the result of a historical male dominated society in the United States that still exists today.

So even accepting the key finding from the experiment, the finding is limited to psychology students rating a man instructor asserting that men have an unfair advantage higher than psychology student rated a woman instructor asserting that men have an unfair advantage. There is no parallel experiment reported here comparing men instructors asserting that women have an unfair advantage to women instructors asserting that women have an unfair advantage, and there is no experiment reported here comparing men instructors to women instructors for a lecture not directly related to gender.

It's worth noting that the unrepresentative context of the experiment—a man instructor and a woman instructor asserting that men have an unfair advantage—sometimes gets lost such that the experiment is described in general terms, such as in Nittrouer et al. 2017, with citation 18 referring to Abel and Meltzer 2007:

Additional studies similarly show that female (versus male) teachers are rated more negatively (14–17). For example, participants who read a lecture, which was posited as having been written and delivered by a male or female professor, rated the lecture by the male (versus the female) professor significantly more positively (18).

---

Here's a passage from the discussion in Abel and Meltzer 2007 (p. 179):

Finally, we are currently designing an experimental study comparing student ratings for male and female professors presenting more mundane lecture information versus the more emotionally charged lecture as used in this study which could then examine the effects for type of lecture information related to sex of student and sex of professor.

I'm not sure what happened to that study.

---

64.

McPherson et al. 2009 "What Determines Student Evaluation Scores? A Random Effects Analysis of Undergraduate Economics Classes" reports on student evaluation data from economic courses at the University of North Texas from 1994 through 2005, using the four items that the Department of Economics has chosen. Reporting results separately for the introductory principles course and upper-level courses, main effects appeared for instructor sex in both models, although differences were small (0.094 units and 0.066 units on a 4-point scale), and a main effect appeared for White instructors in the upper-level courses (0.120 units).

Models contained controls for factors such as instructor age and instructor experience, but nothing that would credibly measure instructor effectiveness.

---

65.

Boring 2015 "Gender Biases in Student Evaluations of Teachers and Their Impact on Teacher Incentives" reports results for 22,665 observations made by 4,423 undergraduate students of 372 different teachers at a university in France.

The abstract indicates that: "The results of generalized ordered logit regressions and fixed-effects models suggest that male teachers tend to receive higher SET scores because of students' gender biases". But the evidence indicates only that this is a bias in the sense of difference.

The abstract indicates that: "Men are perceived as being more knowledgeable (male gender stereotype) and obtain higher SET scores than women, but students appear to learn as much from women as from men, suggesting that female teachers are as knowledgeable as men". But teaching evaluations aren't intended as measures of teaching effectiveness ("Student ratings have never been intended to serve as a proxy for learning", Linse 2017: 95) and, even if they were, it is an inferential leap from student final exam scores to teacher knowledge.

---

66.

Boring et al. 2016 "Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness" analyze data from MacNell et al. 2014 and Boring 2015.

The discussion of the Boring 2015 data in Boring et al. 2016 has added the term "natural experiment"; however, the research design did not hold the quality of instruction constant between female instructors and male instructors: "the natural experiment does not allow us to control for differences in teaching styles across instructors" (p. 7).

---

67.

I suspect that Boring 2017 "Gender Biases in Student Evaluations of Teaching" might be a substantially reworked version of Boring 2015 "Gender Biases in Student Evaluations of Teachers and their Impact on Teacher Incentives", given that, among other things, the study setup and findings are similar, the acknowledgements are similar, and the funding grant number (612413) is the same.

---

68.

Let's transition from the "natural" experiment of the Boring 2015 data from France to the "quasi-experimental" data on 19,952 student evaluations from a School of Business at a university in the Netherlands between 2009 and 2013, reported on in Mengel et al. 2017 "Gender Bias in Teaching Evaluations". Students were randomly assigned to section instructors. Controlling for factors such as student grades, female instructors were rated 0.21 standard deviations lower on instructor-related items (pp. 552-553) than male instructors were by male students and 0. 08 standard deviations lower by female students.

However, Table 7 indicates that these differences differed by instructor rank in a pattern that does not seem consistent with the differences being due to a straighforward student gender bias: male students rated female instructors lower than they rated male instructors for student instructors and Ph.D. student instructors but not for lecturers and professors; moreover, female students rated female instructors lower than they rated male instructors for student instructors but rated female instructors higher than they rated male instructors for lecturers and professors.

---

69.

I think the "Freiderike, M., J. Sauermann, & U. Zolitz" 2017 "Gender Bias in Teaching Evaluations" paper listed in Holman et al. 2019 became the Mengel et al. 2017 "Gender Bias in Teaching Evaluations" article: for example, data in the paper's Table 7 are identical to data in the article's Table 7. And "Freiderike" is the first name of the first author Friederike Mengel.

---

70.

Wagner et al. 2016 "Gender, Ethnicity and Teaching Evaluations: Evidence from Mixed Teaching Teams" reports data from student evaluations in a graduate school in the Netherlands from 2010 through 2015. The data include courses in which a male instructor and a female instructor co-taught the course.

Evaluations ranged from an observed low of 1.82 to an observed high of 5, with a 4.271 mean and a 0.443 standard deviation. Results in Table 4 indicate that female instructors were rated about 0.25 standard deviations lower than male instructors were rated, controlling for Caucasian status, whether the instructor was a course leader, and either the instructor age and age-squared (0.091 points, 0.05<p<0.10) or whether the instructor was new (0.110 points, 0.05<p<0.10). Some results in Table 5 include a control for publications, and some results in Table 6 interact instructor sex with the course type (e.g., Governance & development policy), but Table 5 and Table 6 models don't control for instructor experience (age, or being new), even though sample sizes for the relevant base models in Tables 4 through 6 range from an N of 499 to an N of 688.

---

71.

Rosen 2017 "Correlations, Trends, and Potential Biases among Publicly Accessible Web-Based Student Evaluations of Teaching: A Large-scale Study of RateMyProfessors.com Data" reported results from ratings on RateMyProfessors.com for instructors who had at least 20 ratings. Results indicated evidence at p<0.001 that male instructors were rated higher on clarity, helpfulness, and overall quality and were rated lower on easiness, but that these mean differences were trivial, respectively 0.05, 0.04, 0.04, and 0.03 units on a 1-to-5 scale.

Further results were reported controlling for the hot/not hot rating (p. 41):

Approximately, 22.7% of male faculty on RateMyProfessors are rated as 'hot', compared to 27.8% of female faculty. Since it has already been shown in Table 1 that perceived physical appearance correlates with evaluation scores, whether a professor is rated as 'hot' or' not hot' should be controlled when analysing potential gender biases as well.

But I don't understand the reason for the "should be controlled" if the hot/not hot ratings are gender biased or at least could plausibly be gender biased.

---

72.

Drake et al. 2019 "Grading Teachers: Race and Gender Differences in Low Evaluation Ratings and Teacher Employment Outcomes" reports an analysis of performance ratings of preK-12 teachers in Michigan from 2011 to 2015. Results indicated that "male teachers and teachers of color were more likely to be labeled 'minimally effective' or 'ineffective' than their same-school peers even after conditioning on evaluators' prior judgments and value-added scores" (p. 1826).

I don't think that these were ratings made by students.

---

73.

Fan et al. 2019 "Gender and Cultural Bias in Student Evaluations: Why Representation Matters" reports results from student evaluations at a large public university in Australia. A discussion of results and limitations are indicated in this passage from the article (p. 14):

Throughout this paper, and in the title, we have used the term "bias" when describing the statistically significant effect females and non-English speaking teachers. It should be pointed out that one of the limitations of this study is that it is only able to show association, e.g., being female is associated with a lower SET score, we cannot say what really was the cause for a lower score. However, if SET is really measuring teaching quality, then the only plausible causes are either that females are generally bad teachers across a large population, or there's bias, the same argument can be made for teachers who have non-English speaking background. Since we find no credible support that females, or someone with an accent, should generally be bad teachers, we have chosen to use the term "bias".

The article indicates that "Around 80% of the scores are given at either 5 or 6 and our results suggest that bias comes in at this top level, between 'agree' and 'strongly agree'" (p. 8). Given that the difference occurs between two positive ratings, I don't think that it is correct for the authors to be concerned about finding credible support for the claim that women or persons with an accent are "bad" teachers.

---

I'll plan to use the next post to wrap up the review of the Holman et al. 2019 "finding bias" list.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

40.

Sidanious and Crane 1989 "Job Evaluation and Gender: The Case of University Faculty" reported on data for 9,005 undergraduates evaluating 254 male instructors and 147 female instructors. Controlling for factors such as the student's sex and GPA and the instructor's rank and broad field, male instructors received higher evaluations than did female instructors. However, the discussion provides several caveats (pp. 192-193):

...the fact that, in general, men were perceived as being more competent than women need not be a function of gender stereotyping or bias; it is quite possible that men are, in fact more competent in their teaching roles...[and]...Even if these differences [in evaluations] are a function of gender bias rather than perceptual accuracy, the differences are too small to play any major role in how men and women are evaluated.

Holman et al. 2019 lists this study as "finding bias".

---

41.

Feldman 1992 "College Students' Views of Male and Female College Teachers: Part I: Evidence from the Social Laboratory and Experiments" is a review of experimental studies of student evaluations of teaching. I'm not sure that Feldman 1992 should count as an independent publication finding bias. Moreover, the sense of the literature provided in Feldman 1992 is a bit in tension with that provided in the Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" list. Feldman 1992 includes studies up through 1990 and indicates the following (p. 342, emphasis in the original):

Most of the laboratory studies reviewed here (and summarized in the Appendix) found that male and female teachers did not differ in college student's overall evaluation of them as professionals (as indicated by students' perceptions of their overall teaching performance, their instructional ability, their effectiveness, and their competence and by whether or not students would take a course with them).

However, for studies from 1990 and earlier, the Holman et al. 2019 list includes zero "No Gender or Race Bias" entries and only one "Bias Favoring Women" entry, and the "Bias Favoring Women" entry isn't an experiment. Part of this tension might be due to the Holman et al. 2019 list being incomplete. Multiple studies listed in the Feldman 1992 references are not included in the Holman et al. 2019 list. For example, here is a passage from Mackie 1976 "Students' Perceptions of Female Professors" (p. 346, emphasis in the original):

Contrary to expectations, women teachers were perceived as more competent than male teachers in both task and socio-emotional spheres. Further, the males were not assigned a significantly higher prestige score.

And here is a passage from Basow and Distenfeld 1985 "Teacher Expressiveness: More Important for Males than Females?" (p. 51):

As other research has found (Elmore & LaPointe, 1974, 1975; Lombardo & Tocci, 1979), teacher sex did not have a main effect on student evaluations of teachers.

I have already discussed three relevant publications omitted from the Holman et al. 2019 list. The Feldman 1992 and 1993 reviews indicate that there are additional relevant studies omitted from the Holman et al. 2019 list.

---

42.

Feldman 1993 "College Students' Views of Male and Female College Teachers: Part II: Evidence from Students' Evaluations of Their Classroom Teachers" is a review of non-experimental studies of student evaluations of teaching. The abstract indicates (emphasis added):

Although a majority of studies have found that male and female college teachers do not differ in the global ratings they receive from their students, when statistically significant differences are found, more of them favor women than men. Across studies, the average association between gender and overall evaluation, while favoring women (average r = + .02), is so small as to be insignificant in practical terms.

This again raises the question of the representativeness of the Holman et al. 2019 list, at least for early studies. The Holman et al. 2019 list does not include Bausell and Magoon 1972 "Expected Grade in a Course, Grade Point Average, and Student Ratings of the Course and the Instructor". Here is the Feldman 1993 summary for the association between instructor sex and students' overall evaluation of the instructor in that study:

Bausell and Magoon (1972): 23 courses taught by women and 23 by men at the University of Delaware (excluding courses in the College of Economics, College of Nursing, and the Department of Secretarial Studies), academic year 1969-1970, matched on the semester course was taught, level of the course, and academic department within which the course was taught; single overall rating item ("Overall, how do you evaluate the instructor?").

The direction of the r of .03 (as derived from data on p. 171) and Z of 0.203 cannot be determined from information given.

The Holman et al. 2019 list does not include Brown 1976 "Faculty Ratings and Student Grades: A University-wide Multiple Regression Analysis". Here is the Feldman 1993 summary for the association between instructor sex and students' overall evaluation of the instructor:

Brown (1976): 2,360 course sections at the University of Connecticut, Spring semester of 1973; average score on the 8-item University of Connecticut Rating Scale for Instruction. r = + .04* (as given in Tables 2 and 3); Z = +1 .943*; N = 2,360 section ratings.

For what it's worth, I think that it is acceptable and preferable to not include studies from the 20th century in a review of research on bias in student evaluations of teaching, if the purpose of the list is to be informative for assessing the handling of student evaluations of teaching in 2019 and beyond. But if Holman et al. 2019 includes 20th century studies, it would be nice to have some indication about whether the inclusion of studies is representative.

---

Comments are open if you disagree, but I don't think that there are novel data in these publications that indicate an unfair bias. Even if novel data in these publications did indicate an unfair bias, I think that the data would be too old to be relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond. (I'm using "novel data" to refer to data that is initially reported and to not refer to data presented again in reviews.)

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

37.

Kierstead et al. 1988 "Sex Role Stereotyping of College Professors: Bias in Students' Ratings of Instructors" reported results for two experiments. The first experiment had 20 female college students and 20 male college students assigned to one of four conditions for a text description of a teaching situation: male/female professor and a professor who was/was not described a frequently spending time with students outside of class; the male professor received more favorable ratings than did the female professor (p<0.02).

The second experiment had 20 female college students and 20 male college students assigned to one of four conditions for a slide tape presentation of a lecture: male/female teacher and a teacher who was/was not smiling; the male teacher received more favorable ratings than did the female teacher (p<0.06).

Results in these two experiments were largely driven by the female target who was not sociable or who did not smile. Here are the means, on a scale from 1 for poor to 6 for outstanding:

5.3 for the sociable male professor

5.3 for the sociable female professor

5.4 for the unsociable male professor

4.4 for the unsociable female professor

---

4.1 for the smiling male teacher

4.2 for the smiling female teacher

4.5 for the unsmiling male teacher

3.3 for the unsmiling female teacher

Pooled standard deviations were about 0.65 for the "sociable" experiment conditions and about 0.9 for the "smiling" experiment conditions, so these are large differences between the sociable female and the unsociable female (about 1.4 standard deviations) and between the smiling female and the unsmiling female (about 1 standard deviation).

---

38.

Buck and Tiene 1989 "The Impact of Physical Attractiveness, Gender, and Teaching Philosophy on Teacher Evaluations" reported results for 42 undergraduate education majors at a state university in the Midwest across eight conditions, with experimental manipulations for instructor sex, instructor attractiveness, and whether an authoritarian or humanistic teaching perspective was attributed to the instructor. I did not see an indication of whether the instructors were described as college instructors or K12 instructors, but the photographs were described as being for persons about 21 years old.

Students rated the instructors on eight items. Many of the interactions had a p-value under 0.05 for a given evaluation item, but there was no main effect for instructor attractiveness and only one main effect for instructor sex (female instructors rated higher than male instructors on overall effectiveness). There was a statistically significant difference for seven of the eight items for teaching perspective (see Tiene and Buck 1987 for a discussion of the teaching perspective results).

I'll quote in full the Holman et al. 2019 summary of Buck and Tiene 1989:

An experiment was conducted at a Midwestern state university. The sample size was composed of 42 undergraduate seniors, mostly Caucasian females; 10 of the 42 students were male and 3 were black (all females). The students were given 1 of 4 different photographs- attractive or unattractive teachers who were either male or female. Each photograph also included a description of the teachers' teaching style/philosophy, divided by either an authoritarian or humanistic style. The results of the study showed that attractiveness did not have an effect on the ratings of instructor effectiveness. The study also found that authoritarianism was strongly associated with negative evaluations. However, contradictingly attractive authoritarian females were rated significantly more positively than the other 3 possibilities of authoritarian instructor characteristics.

It's not clear to me why Buck and Tiene 1989 is in the Holman et al. 2019 list for "Finding Bias" when Holman et al. 2019 has separate lists for "Bias Favoring Women" and "No Gender or Race Bias". The bias in favor of attractive authoritarian females would, if anything, suggest filing under "Bias Favoring Women". Strictly speaking, it is a "bias" that students rated authoritarian teachers less favorably than humanistic teachers, but filing Buck and Tiene 1989 under "Finding Bias" for that reason would stretch the definition of "bias" to include legitimate reasons such as teaching philosophy for students to rate one instructor more favorably than another instructor. (And the implication of the three main Holman et al. 2019 categories is that the "Finding Bias" category is limited to race bias or gender bias disfavoring women).

---

39.

Dukes and Victoria 1989 "The Effects of Gender, Status, and Effective Teaching on the Evaluation of College Instruction" reported results from 144 undergraduates from four sociology courses and two political science courses. Each student was given a description of four scenarios of college teaching, with experimental manipulations that included the professor's sex (e.g., Carl Pierce or Carla Pierce), whether the professor was a department chair, and the presence or absence of a certain characteristic of the professor (knowledgeable, enthusiastic, rapport, and organized).

For predicting teacher effectiveness, results indicated no main effect of professor sex by the knowledgeable scenario, no main effect of professor sex by the enthusiastic scenario, no main effect of professor sex by the rapport scenario, and no main effect of professor sex by the organized scenario. There were two reported interactions involving professor sex that, from what I can tell, were limited to one of the four scenarios, such as instructor sex and chair status interacting in the organized scenario.

Dukes and Victoria 1989 is another publication that I'm not sure should be classified under "Finding Bias". The Feldman 1992 review of the literature (pp. 356-357) summarizes results from Dukes and Victoria 1989, indicating only 2 of the 32 comparisons from Dukes and Victoria 1989 detected a statistically significant association and that neither of these 2 comparisons were of main effects of instructor sex.

---

Comments are open if you disagree, but I don't think that data from the 1980s or earlier are relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond.

Tagged with: , ,