Andrew Gelman linked to a story (see also here) about a Science article by Annie Franco, Neil Malhotra, and Gabor Simonovits on the file drawer problem in the Time Sharing Experiments for the Social Sciences. TESS fields social science survey experiments, and sometimes the results of these experiments are not published.

I have been writing up some of these unpublished results but haven't submitted anything yet. Neil Malhotra was kind enough to indicate that I'm not stepping on their toes, so I'll post what I have so far for comment. From what I have been able to determine, none of these studies discussed below were published, but let me know if I am incorrect about that. I'll try to post a more detailed write-up of these results soon, but in the meantime feel free to contact me for details on the analyses.

I've been concentrating on bias studies, because I figure that it's important to know if there is little-to-no evidence of bias in a large-scale nationally-representative sample; not that such a study proves that there's no bias, but reporting these studies helps to provide a better estimate for the magnitude of bias. It's also important to report evidence of bias in unexpected directions.

 

TESS 241

TESS study 241, based on a proposal from Stephen W. Benard, tested for race and sex bias in worker productivity ratings. Respondents received a vignette about the work behavior of a lawyer whose name was manipulated in the experimental conditions to signal the lawyer's sex and race: Kareem (black male), Brad (white male), Tamika (black female), and Kristen (white female). Respondents were asked how productive the lawyer was, how valuable the lawyer was, how hardworking the lawyer was, how competent the lawyer was, whether the lawyer deserved a raise, how respected the lawyer was, how honorable the lawyer was, how prestigious the lawyer was, how capable the lawyer was, how intelligent the lawyer was, and how knowledgeable the lawyer was.

Substantive responses to these eleven items were used to create a rating scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.92. The scale was standardized so that its mean and standard deviation were respectively 0 and 1; higher values on the scale indicate more favorable evaluations.

Here is a chart of the main results, with experimental targets on the left side:

benardThe figure indicates point estimates and 95% confidence intervals for the mean level of evaluations in experimental conditions for all respondents and disaggregated groups; data were not weighted because the dataset did not contain a post-stratification weight variable.

The bias in this study is against Brad relative to Kareem, Kristen, and Tamika.

 

TESS 392

TESS study 392, based on a proposal from Lisa Rashotte and Murray Webster, tested for bias based on sex and age. Respondents were randomly assigned to receive a picture and text description of one of four target persons: Diane Williams, a 21-year-old woman; David Williams, a 21-year-old man; Diane Williams, a 45-year-old woman; and David Williams, a 45-year-old man. Respondents were asked to rate the target person on nine traits, drawn from Webster and Driskell (1983): intelligence, ability in situations in general, ability in things that the respondent thinks counts, capability at most tasks, reading ability, abstract abilities, high school grade point average, how well the person probably did on the Federal Aviation Administration exam for a private pilot license, and physical attractiveness. For the tenth item, respondents were shown their ratings for the previous nine items and given an opportunity to change their ratings.

The physical attractiveness item was used as a control variable in the analysis. Substantive responses to the other eight items were used to create a rating scale, with items standardized before summing and cases retained if the case had substantive responses for at least five items; this scale had a Cronbach's alpha of 0.91. The scale was standardized so that its mean and standard deviation were respectively 0 and 1; higher values on the scale indicate more favorable evaluations.

Here is a chart of the main results, with experimental targets on the left side:

rashotte The figure indicates point estimates and 95% confidence intervals for the mean level of evaluations in experimental conditions for all respondents and disaggregated groups; data were weighted. The bias in this study, among women, is in favor of older persons and, among men, is in favor of the older woman. Here's a table of 95% confidence intervals for mean rating differences for each comparison:

rashottetable

 

TESS 012

TESS study 012, based on a proposal from Emily Shafer, tested for bias for or against married women based on the women's choice of last name after marriage. The study's six conditions manipulated a married woman's last name and the commitment that caused the woman to increase the burden on others. Conditions 1 and 4, 2 and 5, and 3 and 6 respectively reflected the woman keeping her last name, hyphenating her last name, or adopting her husband's last name; the vignette for conditions 1, 2, and 3 indicated that the woman's co-workers were burdened because of the woman's marital commitment, and the vignette for conditions 4, 5, and 6 indicated that the woman's husband was burdened because of the woman's work commitment.

Substantive responses to items 1, 2, 5A, and 6A were used to create an "employee evaluation" scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.73. Substantive responses to items 3, 4, 5B, and 6B were used to create a "wife evaluation" scale, with items standardized before summing and cases retained if there were substantive responses for at least three items; this scale had a Cronbach's alpha of 0.74. Both scales were standardized so that their mean and standard deviation were respectively 0 and 1 and then reversed so that higher scores indicated a more positive evaluation.

Results are presented for the entire sample, for men, for women, for persons who indicated that they were currently married or once married and used traditional last name patterns (traditional respondents), and for persons who indicated that they were currently married or once married but did not use traditional last name patterns (non-traditional respondents); name patterns were considered traditional for female respondents who changed their last name to their spouse's last name (with no last name change by the spouse), and male respondents whose spouse changed their last name (with no respondent last name change).

Here is a chart of the main results, with experimental conditions on the left side:

shafer

The figure displays point estimates and 95% confidence intervals for weighted mean ratings for each condition, adjusted for physical attractiveness. Not much bias detected here, except for men's wife evaluations when the target woman kept her last name.

 

TESS 714

TESS study 714, based on a proposal from Kimberly Rios Morrison, tested whether asking whites to report their race as white had a different effect on multiculturalism attitudes and prejudice than asking whites to report their ethnicity as European American. See here for published research on this topic.

Respondents were randomly assigned to one of three groups: respondents in the European American prime group were asked to identify their race/ethnicity as European American, American Indian or Alaska Native, Asian American or Pacific Islander, Black or African American, Hispanic/Latino, or Other; respondents in the White prime group were asked to identify their race/ethnicity from the same list but with European American replaced with White; and respondents in the control group were not asked to identify their race/ethnicity.

Respondents were shown 15 items regarding ethnic minorities, divided into four sections that we'll call support for multiculturalism, support for pro-ethnic policies, resentment of ethnic minorities, and closeness to whites. Scales were made for items from the first three sections; to create a "closeness to whites" scale, responses to the item on closeness to ethnic minorities were subtracted from responses to the item on closeness to nonminorities, to indicate degree of closeness to whites; this item was then standardized.

Here is a chart of the main results, with experimental conditions on the left side:

rios morrisonThe figure displays weighted point estimates and 95% confidence intervals. The prime did not have much influence, except for the bottom right graph.

---

There's a LOT of interesting things in the TESS archives. Comparing reported results to my own analyses of the data (not for the above studies, but for other studies) has illustrated the inferential variation that researcher degrees of freedom can foster.

One of the ways to assess claims of liberal bias in social science is to comb through data such as the TESS archives, which let us see what a sample of researchers are interested in and what a sample of researchers place into their file drawer. Researchers placing null results into a file drawer is ambiguous because we cannot be sure whether placement in the file drawer is due to the null results or to the political valence of the null results; however, researchers placing statistically significant results into a file drawer has much less ambiguity.

---

UPDATE (Sept 6, 2014)

Gábor Simonovits, one of the co-authors of the Science article, quickly and kindly sent me a Stata file of their dataset; that data and personal communication with Stephen W. Benard indicated that results from none of the four studies reported in this post have been published.

Tagged with: , , , ,

I have posted a working manuscript on symbolic racism here, with its appendix here. Comments are welcome and appreciated. I'll outline the manuscript below and give some background to the research.

---

On 27 October 2012, a Facebook friend posted a link to an Associated Press report "AP poll: Majority harbor prejudice against blacks." I posted this comment about the report:

sr1

During the Facebook discussion, I noted that it not obvious that the implicit measurements indicate racism, given the data on implicit preferences among blacks:

sr2

Bob Somersby at the Daily Howler noticed that the AP report provided data disaggregated by political party but failed to provide data disaggregated by race:

Although Ross and Agiesta were eager to tell you how many Democrats, Republicans and independents were shown to hold "anti-black feelings," they never tell you how many black respondents “hold anti-black feelings” as well!

Why didn't our intrepid reporters give us that information? We can't answer that question. But even a mildly skeptical observer could imagine one possible answer:

If substantial percentages of black respondents were allegedly shown to "hold anti-black feelings," that would make almost anyone wonder how valid the AP's measures may be. It would undermine confidence in the professors—in those men of vast erudition, the orange-shoed fellows who still seem to think that Obama trailed in the national polling all through the summer of 2008.

David Moore at iMediaEthics posted data disaggregated by race that he retrieved from the lead author of the study: based on the same method used in the original report, 30 percent of white Americans implicitly held anti-white sentiments, and 43 percent of black Americans implicitly held anti-black sentiments. Moore discussed how this previously-unreported information alters interpretation of the study's findings:

It appears that racism, as measured by this process, is much more complicated than the news story would suggest. We cannot talk about the 56% of Americans with "anti-black" attitudes as being "racist," if we do not also admit that close to half of all blacks are also "racist" – against their own race.

If we accept the measures of anti-black attitudes as a valid indicator of racism, then we also have to accept the anti-white measures as racism.

Moore did not tell us the results for black respondents on the explicit measures of racism, so that's the impetus behind Study 2 of the working manuscript.

---

The explicit racism measure discussed in the AP report is symbolic racism, also known as racial resentment. Instead of explaining what symbolic racism is, I'll show how symbolic racism is typically measured; items below are from the American National Election Studies, but there were more items in the study discussed in the AP report.

Symbolic racism is measured in the ANES based on whether a survey respondent agrees strongly, agrees somewhat, neither agrees nor disagrees, disagrees somewhat, or disagrees strongly with these four items:

1. Irish, Italians, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without any special favors.

2. Generations of slavery and discrimination have created conditions that make it difficult for blacks to work their way out of the lower class.

3. Over the past few years, blacks have gotten less than they deserve.

4. It's really a matter of some people not trying hard enough; if blacks would only try harder they could be just as well off as whites.

I hope that you can see why these are not really measures of explicit racism. Let's say that non-racist person A opposes special favors for all groups: that person would select the symbolic racist option for item 1, indicating a belief that blacks should work their way up without special favors. Person A is coded the same as a person B who opposes special favors for blacks because of person B's racism. So that's problem #1 with symbolic racism measures: the measures conflate racial attitudes and non-racial beliefs.

But notice that there is another problem. Let's say that person C underestimates the influence of slavery and discrimination on outcomes for contemporary blacks; person C will select a symbolic racism option for item 2, but is that racism? is that racial animosity? is that a reflection that a non-black person -- and even some black persons -- might not appreciate the legacy of slavery and discrimination? or is that something else? That's problem #2 with symbolic racism measures: it's not obvious how to interpret these measures.

---

Researchers typically address problem 1 with control variables; the hope is that placing partisanship, self-reported ideology, and a few conservative values items into a regression sufficiently dilutes the non-racial component of symbolic racism so that the effect of symbolic racism can be interpreted as its racial component only.

In the first part of the working manuscript, I test this hope by predicting non-racial dependent variables, such as opposition to gay marriage. The idea of this test is that -- if statistical control really does sufficiently dilute the non-racial component of symbolic racism -- then symbolic racism should not correlate with opposition to gay marriage, because racism should not be expected to correlate with opposition to gay marriage; but -- if there is a correlation between symbolic racism and gay marriage -- then statistical control did not sufficiently dilute the non-racial component of symbolic racism.

The results indicate that a small set of controls often does not sufficiently dilute the non-racial component of symbolic racism, so results from symbolic racism research with a small set of controls should be treated skeptically. But a more extensive set of controls often does sufficiently dilute the non-racial component of symbolic racism, so we can place more -- but not complete -- confidence in results from symbolic racism research with an extensive set of controls.

---

The way that I addressed problem #2 -- about how to interpret symbolic racism measures -- is to assess the effect of symbolic racism among black respondents. Results indicate that among blacks -- and even among a set of black respondents with quite positive views of their own racial group -- symbolic racism sometimes positively correlates with opposition to policies to help blacks.

Study 2 suggests that it is not legitimate for researchers to interpret symbolic racism among whites differently than symbolic racism among blacks, without some other information that can permit us to state that symbolic racism means something different for blacks and whites. Study 3 assesses whether there is evidence that symbolic racism means something different for blacks and whites.

Tagged with: , ,

This R lesson is for confidence intervals on point estimates. See here for other lessons.

---

Here's the first three lines of code:

pe <- c(2.48, 1.56, 2.96)
y.axis <- c(1:3)
plot(pe, y.axis, type="p", axes=T, pch=19, xlim=c(1,4), ylim=c(1,3))

The first line places 2.48, 1.56, and 2.96 into a vector called "pe" for point estimates; you can call the vector anything that you want, as long as R recognizes the vector name.

The second line sends the integers from 1 to 3 into the vector "y.axis"; instead of y.axis <- c(1:3), you could have written y.axis <- c(1,2,3) to do the same thing.

The third line plots a graph with pe on the x-axis and y.axis on the y-axis; type="p" tells R to plot points, axes=T tells R to draw axes, pch=19 indicates what type of points to draw, xlim=c(1,4) indicates that the x-axis extends from 1 to 4, and ylim=c(1,3) indicates that the y-axis extends from 1 to 3.

Here's the graph so far:

ci1---

Let's make the points a bit larger by adding cex=1.2 to the end of the plot command.

Let's also add a title, using a new line of code: title(main="Negative Stereotype Disagreement > 3").

ci2---

Let's add the 95% confidence interval lines.

lower <- c(2.26, 1.17, 2.64)
upper <- c(2.70, 1.94, 3.28)
segments(lower, y.axis, upper, y.axis, lwd= 1.3)

The first line indicates the lower ends of the confidence intervals; the second line indicates the upper ends of the confidence intervals; and the segments command draws line segments from the coordinate (lower, y.axis) to the coordinate (upper, y.axis), with lwd=1.3 indicating that the line should be slightly thicker than the default.

Here's what we have so far:

ci3---

Let's replace the x-axis and y-axis. First, change axes=T to axes=F in the plot command; then add the code axis(1, at=seq(1,4,by=1)) to tell R to draw an axis at the bottom from 1 to 4 with tick marks every 1 unit. Here's what we get:

ci4Let's get rid of the "pe" and "y.axis" labels. Add to the plot command: xlab="", ylab="". Here's the graph now:

ci5---

Let's work on the y-axis now:

names <- c("Baseline", "Black\nFamily", "Affirmative\nAction")
axis(2, at=y.axis, label=names)

The first line sends three phrases to the vector "names"; the \n in the phrases tells R to place "Family" and "Action" on a new line. Here's the result:

ci6Let's make the y-axis labels perpendicular to the y-axis by adding las=2 to the axis(2 line. [las=0 would keep the labels parallel.]

ci7Now we need to add a little more space to the left of the graph to see the y-axis labels. Add par(mar=c(4, 6, 2, 0)) above the plot command to tell R to make the margins 4, 6, 2, and 0 for the bottom, left, top, and right margins.

ci8---

Let's say that I decided that I prefer to have the baseline on top of the graph and Affirmative Action at the bottom of the graph. I could use the rev() function to reverse the order of the points in the plot, segments, and axis functions to get:

ci9---

Here is the whole code for the above graph. By the way, the graph above can be found in my article on social desirability in the list experiment, "You Wouldn't Like Me When I'm Angry."

Tagged with: ,

This R lesson is for the plot command. See here for other lessons.

---

The start of this code is a bit complex. It's from R Commander, which is a way to use R through a graphical interface without having to write code.

library(foreign)

The library function with the foreign package is used to import data from SPSS, Stata, or some other software.

DWHouse <- read.dta("C:/house_polarization46_113v9.dta", convert.dates=TRUE, convert.factors=TRUE, missing.type=TRUE, convert.underscore=TRUE, warn.missing.labels=TRUE)

The above command reads data from Stata (.dta extension) and places the data into DWHouse. The house_polarization46_113v9.dta dataset is from Voteview polarization data, located here. [The v9 on the end of the dataset indicates that I saved the dataset as Stata version 9.]

---

Here's the plot command:

plot(repmean1~year, type="p", xlim=c(1900,2012), ylim=c(-1,1), xlab="Year", ylab="Liberal - Conservative", pch=19, col="red", main="House", data=DWHouse)

Here are what the arguments mean: the tilde in repmean1~year plots repmean1 as a function of year, type="p" indicates to plot points, xlim=c(1900,2012) indicates the limits for the x-axis, ylim=c(-1,1) indicates the limits for the x-axis, xlab="Year" and ylab="Liberal - Conservative" respectively indicate labels for the x-axis and y-axis, pch=19 indicates to use the 19 plotting character [see here for a list of pchs], col="red" indicates the color for the pchs [see here for a list of colors], main="House" indicates the main title, and data=DWHouse indicates the data to plot.

Here's what the graph looks like so far:

plotgop---

The repmean1 plotted above is the Republican Party mean for the first-dimension DW-Nominate scores among members of the House of Representatives. Let's add the Democrats. Instead of adding a new plot command, we just add points:

points(demmean1~year, type="p", pch=19, col="blue", data=DWHouse)

Now let's add some labels:

text(1960,0.4,labels="GOP mean", col="red")
text(1960,-0.4,labels="Dem mean", col="blue")

The first command adds text at the coordinate x=1960 and y =0.4; the text itself is "GOP mean," and the color of the text is red. I picked x=1960 and y =0.4 through trial and error to see where the text would look the nicest.

Here's the graph now:

plotgopdem---

Notice that the x-axis is labeled in increments of 20 years (1900, 1920, 1940, ...). This can be changed as follows. First, add axes=F to the plot command to shut off axes; you could also write axes=FALSE); then add these axis lines below the plot command:

axis(1, at=seq(1900, 2020, 10))
axis(2, at=seq(-1, 1, 0.5))

The above lines tell R to plot axes at the indicated intervals. The first line arguments are: 1 tells R to plot an axis below [1=below, 2=left, 3=above, and 4=right], and the (1900, 2020, 10) sequence tells R to plot from 1900 to 2020 and place tick marks every 10 years. Here's the resulting graph:

plotgopdem20---

Notice that the x-axis and y-axis do not touch in the graph above. There's a few extra points plotted that I did not intend to plot: I meant to start the graph at 1900 so that the first point was 1901 (DW-Nominate scores are provided in the dataset every two years starting with 1879). To get the x-axis and y-axis to touch, add xaxs="i", yaxs="i" to the plot command. Let's also add box() to get a box around the graph, like we had in the first two graphs above.

plotgopdem20i

---

Here is the whole code for the plot above.

Tagged with: ,

The first graph in this series is a barplot. This post will show how to add error bars to a barplot.

Here's the data that we want to plot, from a t-test conducted in Stata:

ttest---

Here's the first part of the code:

library(Hmisc)

The code above opens the Hmisc library, which has the error bar function that we will use.

means <- c(2.96, 3.59)

The code above places 2.96 and 3.59 into the vector "means".

bp = barplot(means, ylim=c(0,6), names.arg=c("Black", "White"), ylab="Support for life in prison without parole", xlab="Race of the convicted teen", width=c(0.2,0.2), xlim=c(0,1), space=c(1,1), las=1, main="Black Non-Hispanic Respondents")

The code above is similar to the barplot code that we used before, but notice that in this case the barplot is = bp. The remainder of the arguments are: means indicates what data to plot, ylim=c(0,6) indicates that the limits of the y-axis are 0 and 6, names.arg=c("Black", "White") indicates the names for the bars, ylab="Support for life in prison without parole" indicates the label for the y-axis, xlab="Race of the convicted teen" indicates the label for the x-axis, width=c(0.2,0.2) indicates the width of the bars, xlim=c(0,1) indicates that the limits of the x-axis are 0 and 1, space=c(1,1) indicates the spacing between bars, and main="Black Non-Hispanic Respondents" indicates the main title for the graph.

Here's the graph so far:

95a

---

Here's how to add the error bars:

se <- c(0.2346, 0.2022)
lower = c(means-1.96*se, means-1.96*se)
upper = c(means+1.96*se, means+1.96*se)
errbar(bp, means, upper, lower, add=T)

The first line sends the values for the standard errors into the vector "se". The second and third lines are used to calculate the ends of the error bars. The fourth line tells R to plot error bars; the add=T option tells R to keep the existing graph; without add=T, the graph will show only the error bars.

Finally, add the code box(bty="L") so that there is a line on the bottom of the graph. The bty="L" tells R to make the axis look like the letter L. Other options include C, O, 7, and U.

Here is the graph now:

95b---

It's not necessary to use the 1.96 multiplier for the error bars. The following code plugged in the lower and upper limits directly from the Stata output.

library(Hmisc)

means <- c(2.96, 3.59)

bp = barplot(means, ylim=c(0,6), names.arg = c("Black", "White"), ylab="Support for life in prison without parole", xlab="Race of the convicted teen", xpd=T, width=c(0.2,0.2), xlim=c(0,1), space=c(1,1), main="Black Non-Hispanic Respondents")

se <- c(0.2346, 0.2022)
lower = c(2.48, 3.19)
upper = c(3.42, 4.00)
errbar(bp, means, upper, lower, add=T)

box(bty="O")

---

Here's what the graph looks like for the above, shortened code, with the bty="O":

95c

---

Data from this post were drawn from here, with the article here. Click here for the graph code.

Tagged with: ,

If I remember correctly, my first introduction to R came when fellow Pitt graduate student Hirokazu Kikuchi requested that R be installed on the polisci lab computers. I looked into R and found this webpage based on a 2007 Perspectives on Politics article by Jonathan Kastellec and Eduardo Leoni. That link is a good place to start, but in this post I'll introduce a few lines of code to illustrate how nice and easy R can be. (Not that R is always easy.)

I'll indicate lines of R code in bold.

---

less5 <- c(40.91, 7.67, 7.11, 6.19, 15.65, 6.4, 4.57, 4.43, 2.42, 4.66)

The above command assigns the ten numbers (from 40.91 to 4.66) to a vector called "less5." c() is a concatenation function. The following command does the same thing:

 c(40.91, 7.67, 7.11, 6.19, 15.65, 6.4, 4.57, 4.43, 2.42, 4.66) -> less5

---

barplot (less5, main="Countries with a mean < 5", ylab="Percent", ylim=c(0, 40), names=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

The barplot function tells R to plot a bar chart. These are the arguments: less 5 indicates the vector to plot, main="Countries with a mean < 5" indicates the main plot title, ylab="Percent" indicates the label for the y-axis, ylim=c(0, 40) indicates that the y-axis should run from 0 to 40, and names=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")) indicates the set of names that should be placed below the set of bars.

Here's what this graph looks like, based on the two lines of code:

barplot1---

Let's plot three graphs together. Here's the code for graphs 2 and 3:

from56 <- c(18.35, 4.41, 5.68, 4.61, 22.63, 9.31, 7.63, 8.65, 4.99, 13.75)

barplot (from56, main="Countries with a mean > 5 and < 6", ylab="Percent", ylim=c(0, 40), names=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

more6 <- c(7.99, 2.26, 3.37, 3.62, 17.29, 9.46, 8.95, 12.83 ,8.93, 25.3)

barplot (more6, main="Countries with a mean > 6", ylab="Percent", ylim=c(0, 40), names=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Let's put the par function at the top of the code to tell R how to plot these three graphs:

par(mfrow=c(1, 3))

The above line of code tells R to plot 1 row and 3 columns of plots. Here's the output:

barplot3This is the output for par(mfrow=c(3, 1)):

barplot3v---

That's it for this post: here is a text file of the code. By the way, the graph above can be found in my article on midpoint misperceptions.

Tagged with: ,

I have been trying to reproduce several studies and have noticed that the reporting of results from these studies often presents a much stronger impression of results than I get from an investigation of the data itself. I plan to report some of these reproduction attempts, so I have been reading literature on researcher degrees of freedom and the file drawer problem. Below I'll post and comment on some interesting passages that I have happened upon.

---

To put it another way: without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology. (Gelman and Loken, 2013, 14-15, emphasis in the original)

I wonder how many people in the general population take seriously general claims based on only small mTurk and college student samples, provided that these people are informed that these general claims are based only on small unrepresentative samples; I suspect that some of the "taking seriously" that leads to publication in leading psychology journals reflects professional courtesy among peer researchers whose work is also largely based on small unrepresentative samples.

---

Maybe it's because I haven't done much work with small unrepresentative samples, but I feel cheated when investing time in an article framed in general language that has conclusions based on small unrepresentative samples. Here's an article that I recently happened upon: "White Americans' opposition to affirmative action: Group interest and the harm to beneficiaries objection." The abstract:

We focused on a powerful objection to affirmative action – that affirmative action harms its intended beneficiaries by undermining their self-esteem. We tested whether White Americans would raise the harm to beneficiaries objection particularly when it is in their group interest. When led to believe that affirmative action harmed Whites, participants endorsed the harm to beneficiaries objection more than when led to believe that affirmative action did not harm Whites. Endorsement of a merit-based objection to affirmative action did not differ as a function of the policy’s impact on Whites. White Americans used a concern for the intended beneficiaries of affirmative action in a way that seems to further the interest of their own group.

So who were these white Americans?

Sixty White American students (37% female, mean age = 19.6) at the University of Kansas participated in exchange for partial course credit. One participant did not complete the dependent measure, leaving 59 participants in the final sample. (p. 898)

I won't argue that this sort of research should not be done, but I'd like to see this sort of exploratory research replicated with a more representative sample. One of the four co-authors listed her institutional affiliation at California State University San Bernardino, and two other co-authors listed their institutional affiliation at Tulane University, so I would have liked to have seen a second study among a different sample of students. At the very least, I'd like to see a description of the restricted nature of the sample in the abstract to let me and other readers make a more informed judgment about the value of investing time in the article.

---

The Gelman and Loken (2013) passage cited above reminded me of a recent controversy regarding a replication attempt of Schnall et al. (2008). I read about the controversy in a Nicole Janz post at Political Science Replication. The result of the replication (a perceived failure to replicate) was not shocking because Schnall et al. (2008) had reported only two experiments based on data from 40 and 43 University of Plymouth undergraduates.

---

Schnall in a post on the replication attempt:

My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.

I like the term "data detectives" a bit better than "replication police" (h/t Nicole Janz), so I think that I might adopt the label "data detective" for myself.

I can sympathize with the graduate students' fear that someone might target my work and try to find an error in that work, but that's a necessary occupational hazard for a scientist.

The best way to protect research from data detectives is to produce reproducible and perceived replicable research; one of the worst ways to protect research from data detectives is to publish low-powered studies in a high-profile journal, because the high profile draws attention and the low power increases suspicions that the finding was due to the non-reporting of failed experiments.

---

From McBee and Matthews (2014):

Researchers who try to serve the interests of science are going to find themselves out-competed by those who elect to “play the game,” because the ethical researcher will conduct a number of studies that will prove unpublishable because they lack statistically significant findings, whereas the careerist will find ways to achieve significance far more frequently. (p. 77)

This reflects part of the benefit produced by data detectives and the replication police: a more even playing field for researchers reluctant to take advantage of researcher degrees of freedom.

---

This Francis (2012) article is an example of a data detective targeting an article to detect non-reporting of experiments. Balcetis and Dunning (2010) reported five experiments rejecting the null hypothesis; the experiments had Ns, effect sizes, and powers as listed below in a table drawn from Francis (2012) p. 176.

Francis 2012Francis summed the powers to get 3.11, which indicates the number of times that we should expect the null hypothesis to be rejected given the observed effect sizes and powers of the 5 experiments; Francis multiplied the powers to get 0.076, which indicates the probability that the null hypothesis will be rejected in all 5 experiments.

---

Here is Francis again detecting more improbable results. And again. Here's a back-and-forth between Simonsohn and Francis on Francis' publication bias studies.

---

Here's the Galak and Meyvis (2012) reply to another study in which Francis claimed to have detected non-reporting of experiments in Galak and Meyvis (2011). Galak and Meyvis admit to the non-reporting:

We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant. (p. 595)

...but argue that it's not a problem because they weren't interested in effect sizes:

However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions. (p. 595)

Even if it is true that the authors were unconcerned with effect size, I do not understand how that justifies not reporting results that fail to reach conventional levels of statistical significance.

So what about readers who *are* interested in effect sizes? Galak and Meyvis write:

If a researcher is interested in estimating the size of an effect reported in a published paper, we recommend asking the authors for their file drawer and conducting a meta-analysis. (p. 595-596)

That's an interesting solution: if you are reading an article and wonder about the effect size, put down the article, email the researchers, hope that the researchers respond, hope that the researchers send the data, and then -- if you receive the data -- conduct your own meta-analysis.

Tagged with: , , ,