Comments on "Gender Bias in Student Evaluations"

Gronke et al. (2018) reported in Table 6 that "Gender Bias in Student Evaluations" (Mitchell and Martin 2018, hereafter MM) was, as of 25 July 2018, the PS: Political Science & Politics article with the highest Altmetric score, described as "a measure of attention an article receives" (p. 906, emphasis removed).

The MM research design compared student evaluations of and comments on Mitchell (a woman) to student evaluations of and comments on Martin (a man) in official university course evaluations and on the Rate My Professors website. MM reported evidence that "the language students use in evaluations regarding male professors is significantly different than language used in evaluating female professors" and that "a male instructor administering an identical online course as a female instructor receives higher ordinal scores in teaching evaluations, even when questions are not instructor-specific" (p. 648).

I think that there are errors in the MM article that warrant a correction. I mention or at least allude to some or all of these things in a forthcoming symposium piece in PS: Political Science & Politics, but I elaborate below. Comments are open if you see an error in my analyses or inferences.

---

MM Table 1 reports on comparisons of official university course evaluations for Mitchell and for Martin. The table indicates that the sample size was 68, and the file that Dr. Mitchell sent me upon my request has 23 of these comments for Martin and 45 of these comments for Mitchell. Table 1's "Personality" row indicates 4.3% for Martin and 15.6% for Mitchell, which correspond to 1 personality-related comment of 23 comments for Martin and 7 personality-related comments of 45 comments for Mitchell. The table has three asterisks to indicate a p-value less than 0.01 for the comparison of the 4.3% and the 15.6%, but it is not clear how such a low p-value was derived.

I conducted a simulation in R to estimate, given 8 personality-related comments across 68 comments, how often random distribution of these 8 personality-related comments would results in Martin's 23 comments having 1 or fewer personality-related comment. For the simulation, for 10 million trials, I started with eight 1s and sixty 0s, drew 23 of these 68 numbers to represent comments on Martin, and calculated the difference between the proportion of 1s for Martin and the proportion of 1s in the residual numbers (representing comments on Mitchell):

list <- rep_len(NA,10000000)
for (i in 1:10000000){
   comments <- c(rep_len(1,8),rep_len(0,60))
   martin <- sample(comments,23,replace=FALSE)
   diff.prop <- sum(martin)/23 - (8-sum(martin))/45
   list[i] <- diff.prop
}
stack(table(list))

Here are results from the simulation:

   values                 ind
1  290952  -0.177777777777778
2 1412204   -0.11207729468599
3 2788608 -0.0463768115942029
4 2927564  0.0193236714975845
5 1782937   0.085024154589372
6  646247   0.150724637681159
7  135850   0.216425120772947
8   14975   0.282125603864734
9     663   0.347826086956522

The -0.1778 in line 1 represents 0 personality-related comments of 23 comments for Martin and 8 personality-related comments of 45 comments for Mitchell (0% to 17.78%), which occurred 290,952 times in the 10 million simulations (2.9 percent of the time). The -0.1121 in line 2 represents 1 personality-related comment of 23 comments for Martin and 7 personality-related comments of 45 comments for Mitchell (4.3% to 15.6%), which occurred 1,412,204 times in the 10 million simulations (14.1 percent of the time). So the simulation indicated that Martin receiving only 1 or fewer of the 8 personality-related comments would be expected to occur about 17 percent of the time if the 8 personality-related comments were distributed randomly. But recall that the MM Table 1 asterisks for this comparison indicate a p-value less than 0.01.

MM Table 2 reports on comparisons of Rate My Professors comments for Mitchell and for Martin, with a reported sample size of N=54, which is split into sample sizes of 9 for Martin and 45 for Mitchell in the file that Dr. Mitchell sent me upon my request; the nine comments for Martin are still available at the Rate My Professors website. I conducted another simulation in R for the incompetency-related comments, in which corresponding proportions were 0 of 9 for Martin and 3 of 45 for Mitchell (0% to 6.67%).

list <- rep_len(NA,10000000)
for (i in 1:10000000){
   comments <- c(rep_len(1,3),rep_len(0,51))
   martin <- sample(comments,9,replace=FALSE)
   diff.prop <- sum(martin)/9 - (3-sum(martin))/45
   list[i] <- diff.prop
}
stack(table(list))

Here are results from the simulation:

   values                 ind
1 5716882 -0.0666666666666667
2 3595302  0.0666666666666667
3  653505                 0.2
4   34311   0.333333333333333

The -0.0667 in line 1 represents 0 incompetency-related comments of 9 comments for Martin and 3 incompetency-related comments of 45 comments for Mitchell (0% to 6.67%), which occurred 5,716,882 times in 10 million simulations (57 percent of the time). So the simulation indicated that Martin's 9 comments having zero of the 3 incompetency-related comments would be expected to occur about 57 percent of the time if the 3 incompetency-related comments were distributed randomly. The MM Table 2 asterisk for this comparison indicates a p-value less than 0.1.

I have concerns about other p-value asterisks in MM Table 1 and MM Table 2, but I will not report simulations for those comparisons here.

---

MM Table 4 inferential statistics appear to be unadjusted for the lack of independence of some observations. Click here, then Search by Course > Spring 2015 > College of Arts and Sciences > Political Science > POLS 2302 (or click here). Each "Total Summary" row at the bottom has 218 evaluations; for example, the first item of "Overall the instructor(s) was (were) effective" has 43 strongly agrees, 55 agrees, 75 neutrals, 24 disagrees, and 21 strongly disagrees, which suggests that 218 students completed these evaluations. But the total Ns reported in MM Table 4 are greater than 218. For example, the "Course" line in MM Table 4 has an N of 357 for Martin and an N of 1,169 for Mitchell, which is a total N of 1,526. That 1,526 is exactly seven times 218, and the MM appendix indicates that the student evaluations had 7 "Course" items.

Using this code, I reproduced MM Table 4 t-scores closely or exactly by treating each observation as independent and conducting a t-test assuming equal variances, suggesting that MM Table 4 inferential statistics were not adjusted for the lack of independence of some observations. However, for the purpose of calculating inferential statistics, multiple ratings from the same student cannot be treated as if these were independent ratings.

The aforementioned code reports p-values for individual-item comparisons of evaluations for Mitchell and for Martin, which avoids the problem of a lack of independence for some student responses. But I'm not sure that much should be made of any differences detected or not detected between evaluations for Mitchell and evaluations for Martin, given the lack of randomization of students to instructors or any evidence that the students in Mitchell's sections were sufficiently equal before the course to the students in Martin's sections, and given the possibility that students in these sections might have already has courses or interactions with Mitchell and/or Martin and that the evaluations reflected these prior experiences.

---

Corrected inferential statistics for MM Table 1 and MM Table 2 would ideally reflect consideration of whether non-integer counts of comments should be used, as MM appears to have done. Multiplying proportions in MM Table 1 and MM Table 2 by sample sizes from the MM data produces some non-integer counts of comments. For example, the 15.2% for Martin in the MM Table 1 "Referred to as 'Teacher'" row corresponds to 3.5 of 23 comments, and the 20.9% for Mitchell in the MM Table 2 "Personality" row corresponds to 9.4 of 45 comments. Based on the data that Dr. Mitchell sent me, it seems that a comment might have been discounted by the number of sentences in the comment; for example, four of the official university course evaluations comments for Martin contain the word "Teacher", but the percentage for Martin is not 4 of 23 comments (17.4%) but is instead 3.5 of 23 comments (15.2%), presumably because one of the "teacher" comments had two sentences, only one of which referred to Martin as a teacher; the other three comments that referred to Martin as a teacher did not have multiple sentences.

Corrected inferential statistics for MM Table 1 and MM Table 2 for the frequency of references to the instructors as a professor should reflect consideration of the instructors' titles and job titles. For instance, for MM Table 1, the course numbers in the MM data match course listings for the five courses that Mitchell or Martin taught face-to-face at Texas Tech University in Fall 2015 or Spring 2015 (see here):

Mitchell
POLS 3312 Game Theory [Fall 2015]
POLS 3361 International Politics: Honors [Spring 2015]
POLS 3366 International Political Economy [Spring 2015]

Martin
POLS 3371 Comparative Politics [Fall 2015]
POLS 3373 Governments of Western Europe [Spring 2015]

Online CVs indicated that Mitchell's CV listed her Texas Tech title in 2015 as Instructor and that Martin's CV listed his Texas Tech title in 2015 as Visiting Professor.

A correction could also discuss the fact that, while Mitchell is referred to as "Dr." 19 times across all MM Table 1 and MM Table 2 comments, none of these comments refer to Martin as "Dr.". Martin's CV indicated that he earned his Ph.D. in 2014, so I do not see how non-reporting of references to Mitchell and Martin as "Dr." in the official student evaluations in MM Table 1 can be attributed to some comments being made before Martin received his Ph.D. Rate My Professors comments for Martin date to November 2014; however, even if the non-reporting of references to Mitchell and Martin as "Dr." in MM Table 2 can be attributed to some comments being made before Martin received his Ph.D., any use of "Professor" for Martin must be discounted because students presumably more titles to refer to Mitchell (e.g., "Dr.", "Professor") than to refer to Martin (e.g., "Professor").

---

Other notes:

---

PS: Political Science & Politics should require authors to upload data and code so that readers can more clearly assess what the authors did.

---

MM Table 4 data appear to have large percentages of enrolled students who did not evaluate Mitchell or Martin. Texas Tech data for Spring 2015 courses here indicate that enrollment for Mitchell's four sections of the course used in the study was 247 (section D6), 247 (section D7), 243 (section D8), and 243 (section D9), and that enrollment for Martin's two sections of the course was 242 (section D10) and 199 students (section D11). Mitchell's evaluations had ratings for 167 students of the 980 students in her courses, for a 17.0 response rate, and Martin's evaluations had ratings for 51 students of his 441 students, for an 11.6 percent response rate. It's possible that Mitchell's nearly 50 percent higher response rate did not affect differences in mean ratings between the instructors, but the difference in response rates would have been relevant information for the article to include.

---

MM state (p. 652, emphasis in the original):

"To reiterate, of the 23 questions asked, there were none in which a female instructor received a higher rating."

My calculations indicate that Mitchell received a higher rating than Martin did on 3 of the 23 MM Table 4 items: items 17, 21, and 23. Moreover, MM Table 4 indicates that the mean for Mitchell was higher than the mean for Martin across the three Technology items. I think that the "there were none" statement is intended to indicate that Mitchell did not receive a higher rating than Martin did on any of the items for which the corresponding p-value was sufficiently low, but, if that's the case, then that should be stated clearly because the statement can otherwise be misleading.

But I'm curious how MM could have reported a difference in favor of Mitchell if MM were reporting results using one-tailed statistical tests to detect a difference in favor of Martin, as I read the MM Table 4 Technology line to indicate, with a t-score of 1.93 and a p-value of 0.027.

---

MM reports that the study indicated that "a male instructor administering an identical online course as a female instructor receives higher ordinal scores in teaching evaluations, even when questions are not instructor-specific" (p. 648). But that was not always true: as indicated above, MM Table 4 even indicates that the mean for Mitchell was higher than the mean for Martin across the three not-instructor-specific Technology items.

---

The MM appendix (p. 4) indicated that:

Students had a tendency to enroll in the sections with the lowest number initially (merely because those sections appeared first in the registration list). This means that section 1 tended to fill up earlier than section 3 or 4. It may also be likely that students who enroll in courses early are systematically different than those who enroll later in the registration period; for example, they may be seniors, athletes, or simply motivated students. For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10.

The last line should indicate that data were from sections 6 to 11. See the sample sizes for Martin in the Texas Tech website data: item 1 for section D10 has student evaluation sample sizes of 6, 12, 10, 1, and 3, for a total of 32; adding the sample for item 1 from section D11 (7, 5, 6, 1, 0) raises that to 51; multiplying 51 times 7 produces 357, which is the sample size for Martin in the "Course" section of MM Table 4.

---

I think that Blåsjö (2018) interpreted the statement that "For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10" as if Mitchell and Martin collected data for other sections but did not analyze these data. Blåsjö: "Actually the researchers threw away at least half of the actual data". I think that that is a misreading of the (perhaps unclear) statement quoted above from the MM appendix. From what I can tell based on the data at the Texas Tech site, data were collected for only sections 6 to 11.

---

NOTE:

Thanks to representatives from the Texas Tech IRB and the Illinois State University IRB, respectively, for providing and forwarding the link to the Texas Tech student evaluations.

Comments on "Gender Bias in Student Evaluations"

Leave a Reply Cancel reply