This post reports on publication bias analyses for the Tara L. Mitchell et al. 2005 meta-analysis: "Racial Bias in Mock Juror Decision-Making: A Meta-Analytic Review of Defendant Treatment" [gated, ungated]. The appendices for the article contained a list of sample sizes and effect sizes, but the list did not match the reported results in at least one case. Dr. Mitchell emailed me a file of the correct data (here).

VERDICTS

Here is the funnel plot for the Mitchell et al. 2005 meta-analysis of verdicts:

mitchell-et-al-2005-verdicts-funnel-plotEgger's test did not indicate at the conventional level of statistical significance the presence of funnel plot asymmetry in any of the four funnel plots, with p-values of p=0.80 (white participants, published studies), p=0.82 (white participants, all studies), p=0.10 (black participants, published studies), and p=0.63 (black participants, all studies).

Trim-and-fill with the L0 estimator imputed missing studies for all four funnel plots to the side of the funnel plot indicating same-race favoritism:

mitchell-et-al-2005-verdicts-tf-l0Trim-and-fill with the R0 estimator imputed missing studies for only the funnel plots for published studies with black participants:

mitchell-et-al-2005-verdicts-tf-r0---

SENTENCES

Here is the funnel plot for the Mitchell et al. 2005 meta-analysis of sentences:

mitchell-et-al-2005-sentences-funnel-plotEgger's test did not indicate at the conventional level of statistical significance the presence of funnel plot asymmetry in any of the four funnel plots, with p-values of p=0.14 (white participants, published studies), p=0.41 (white participants, all studies), p=0.50 (black participants, published studies), and p=0.53 (black participants, all studies).

Trim-and-fill with the L0 estimator imputed missing studies for the funnel plots with white participants to the side of the funnel plot indicating same-race favoritism:

mitchell-et-al-2005-sentences-tf-l0Trim-and-fill with the R0 estimator did not impute any missing studies:

mitchell-et-al-2005-sentences-tf-r0---

I also attempted to retrieve and plot data for the Ojmarrh Mitchell 2005 meta-analysis ("A Meta-Analysis of Race and Sentencing Research: Explaining the Inconsistencies"), but the data were reportedly lost in a computer crash.

---

NOTES:

1. Data and code for the Mitchell et al. 2005 analyses are here: data file for verdicts, data file for sentences, R code for verdicts, and R code for sentences.

Tagged with: , ,

Researchers often have the flexibility to report only the results they want to report, so an important role for peer reviewers is to request that researchers report results that a reasonable skeptical reader might suspect have been strategically unreported. I'll discuss two publications where obvious peer review requests do not appear to have been made and, presuming these requests were not made, how requests might have helped readers better assess evidence in the publication.

---

Example 1. Ahlquist et al. 2014 "Alien Abduction and Voter Impersonation in the 2012 U.S. General Election: Evidence from a Survey List Experiment"

Ahlquist et al. 2014 reports on two list experiments: one list experiment is from December 2012 and has 1,000 cases, and another list experiment is from September 2013 and has 3,000 cases.

Figure 1 of Ahlquist et al. 2014 reports results for the 1,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election; the 95% confidence intervals for the full sample and for each reported subgroup cross zero. Figure 2 reports results for the full sample of the 3,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election, but Figure 2 did not include subgroup results. Readers are thus left to wonder why subgroup results were not reported for the larger sample that had more power to detect an effect among subgroups.

Moreover, the main voting irregularity list experiment reported in Ahlquist et al. 2014 concerned voter impersonation, but, in footnote 15, Ahlquist et al. discuss another voting irregularity list experiment that was part of the study, about whether political candidates or activists offered the participant money or a gift for their vote:

The other list experiment focused on vote buying and closely mimicked that described in Gonzalez-Ocantos et al. (2012). Although we did not anticipate discovering much vote buying in the USA we included this question as a check, since a similar question successfully discovered voting irregularities in Nicaragua. As expected we found no evidence of vote buying in the USA. We omit details here for space considerations, though results are available from the authors and in the online replication materials...

The phrasing of the footnote is not clear whether the inference of "no evidence of vote buying in the USA" is restricted to an analysis of the full sample or also covers analyses of subgroups.

So the article leaves at least two questions unanswered for a skeptical reader:

  1. Why report subgroup analyses for only the smaller sample?
  2. Why not report the overall estimate and subgroup analyses for the vote buying list experiment?

Sure, for question 2, Ahlquist et al. indicate that the details of the vote buying list experiment were omitted for "space considerations"; however, the 16-page Ahlquist et al. 2014 article is shorter than the other two articles in the journal issue, which are 17 pages and 24 pages.

Peer reviewer requests that could have helped readers were to request a detailed report on the vote buying list experiment and to request a report of subgroup analyses for the 3,000-person sample.

---

Example 2. Sen 2014 "How Judicial Qualification Ratings May Disadvantage Minority and Female Candidates"

Sen 2014 reports logit regression results in Table 3 for four models predicting the ABA rating given to U.S. District Court nominees from 1962 to 2002, with ratings dichotomized into (1) well qualified or exceptionally well qualified and (2) not qualified or qualified.

Model 1 includes a set of variables such as the nominee's sex, race, partisanship, and professional experience (e.g., law clerk, state judge). Compared to model 1, model 2 omits the partisanship variable and adds year dummies. Compared to model 2, model 3 adds district dummies and interaction terms for female*African American and female*Hispanic. And compared to model 3, model 4 removes the year dummies and adds a variable for years of practice and a variable for the nominee's estimated ideology.

The first question raised by the table is the omission of the partisanship variable for models 2, 3, and 4, with no indication of the reason for that omission. The partisanship variable is not statistically significant in model 1, and Sen 2014 notes that the partisanship variable "is never statistically significant under any model specification" (p. 44), but it is not clear why the partisanship variable is dropped in the other models because other variables appear in all four models and never reach statistical significance.

The second question raised by the table is why years of practice appears in only the fourth model, in which roughly one-third of cases are lost due to the inclusion of estimated nominee ideology. Sen 2014 Table 2 indicates that male and white nominees had substantially more years of practice than female and black nominees: men (16.87 years), women (11.02 years), whites (16.76 years), and blacks (10.08 years); therefore, any model assessing whether ABA ratings are biased should account for sex and race differences in years of practice, under the reasonable expectation that nominees should receive higher ratings for more experience.

Peer reviewer requests that could have helped readers were to request a discussion of the absence of the partisanship variable from models 2, 3, and 4, and to request that years of experience be included in more of the models.

---

Does it matter?

Data for Ahlquist et al. 2014 are posted here. I reported on my analysis of the data in a manuscript rejected after peer review by the journal that published Ahlquist et al. 2014.

My analysis indicated that the weighted list experiment estimate of vote buying for the 3,000-person sample was 5 percent (p=0.387), with a 95% confidence interval of [-7%, 18%]. I'll echo my earlier criticism and note that a 25-percentage-point-wide confidence interval is not informative about the prevalence of voting irregularities in the United States because all plausible estimates of U.S. voting irregularities fall within 12.5 percentage points of zero.

Ahlquist et al. 2014 footnote 14 suggests that imputed data on participant voter registration were available, so a peer reviewer could have requested reporting of the vote buying list experiments restricted to registered voters, given that only registered voters have a vote to trade. I did not see a variable for registration in the dataset for the 1,000-person sample, but the list experiment for the 3,000-person sample produced the weighted point estimate that 12 percent of persons listed as registered to vote were contacted by political candidates or activists around the 2012 U.S. general election with an offer to exchange money or gifts for a vote (p=0.018).

I don't believe that this estimate is close to correct, and, given sufficient subgroup analyses, some subgroup analyses would be expected to produce implausible or impossible results, but peer reviewers requesting these data might have produced a more tentative interpretation of the list experiments.

---

For Sen 2014, my analysis indicated that the estimates and standard errors for the partisanship variable (coded 1 for nomination by a Republican president) inflate unusually high when that variable is included in models 2, 3, and 4: the coefficient and standard error for the partisanship variable are 0.02 and 0.11 in model 1, but inflate to 15.87 and 535.41 in model 2, 17.90 and 1,455.40 in model 3, and 18.21 and 2,399.54 in model 4.

The Sen 2014 dataset had variables named Bench.Years, Trial.Years, and Private.Practice.Years. The years of experience for these variables overlap (e.g., nominee James Gilstrap was born in 1957 and respectively has 13, 30, and 30 years for these variables); therefore, the variables cannot be summed to construct a variable for total years of legal experience that does not include double- or triple-counting for some cases. Bench.Years correlates with Trial.Years at -0.47 and with Private.Practice.Years at -0.39, but Trial.Years and Private.Practice.Years correlate at 0.93, so I'll include only Bench.Years and Trial.Years, given that Trial.Years appears more relevant for judicial ratings than Private.Practice.Years.

My analysis indicated that women and blacks had a higher Bench.Years average than men and whites: men (4.05 years), women (5.02 years), whites (4.02 years), and blacks (5.88 years). Restricting the analysis to nominees with nonmissing nonzero Bench.Years, men had slightly more experience than women (9.19 years to 8.36 years) and blacks had slightly more experience than whites (9.33 years to 9.13 years).

Adding Bench.Years and Trial.Years to the four Table 3 models did not produce any meaningful difference in results for the African American, Hispanic, and Female variables, but the p-value for the Hispanic main effect fell to 0.065 in model 4 with Bench.Years added.

---

I estimated a simplified model with the following variables predicting the dichotomous ABA rating variable for each nominee with available data: African American nominee, Hispanic nominee, female nominee, Republican nominee, nominee age, law clerk experience, law school tier (from 1 to 6), Bench0 and Trial0 (no bench or trial experience respectively), Bench.Years, and Trial.Years. These variables reflect demographics, nominee quality, and nominee experience, with a presumed penalty for nominees who lack bench and/or trial experience. Results are below:

aba1The female coefficient was not statistically significant in the above model (p=0.789), but the coefficient was much closer to statistical significance when adding a control for the year of the nomination:

aba2District.Court.Nomination.Year was positively related to the dichotomous ABA rating variable (r=0.16) and to the female variable (r=0.29), and the ABA rating increased faster over time for women than for men (but not at a statistically-significant level: p=0.167), so I estimated a model that interacted District.Court.Nomination.Year with Female and with the race/ethnicity variables:

aba3The model above provides some evidence for an over-time reduction of the sex gap (p=0.095) and the black/white gap (0.099).

The next model is the second model reported above, but with estimated nominee ideology added, coded with higher values indicating higher levels of conservatism:

aba4So there is at least one reasonable model specification that produces evidence of bias against conservative nominees, at least to the extent that the models provide evidence of bias. After all, ABA ratings are based on three criteria—integrity, professional competence, and judicial temperament—but the models include information for only professional competence, so a sex, race, and ideological gap in the models could indicate bias and/or could indicate a sex, race, and ideological gap in nonbiased ABA evaluations of integrity and/or judicial temperament and/or elements of professional competence that are not reflected in the model measures. Sen addressed the possibility of gaps in these other criteria, starting on page 47 of the article.

For what it's worth, evidence of the bias against conservatives is stronger when excluding the partisanship control:

aba5---

The above models for the Sen reanalysis should be interpreted to reflect the fact that there are many reasonable models that could be reported. My assessment from the models that I estimated is that the black/white gap is extremely if not completely robust, the Hispanic/white gap is less robust but still very robust, the female/male gap is less robust but still somewhat robust, and the ideology gap is the least robust of the group.

I'd have liked for the peer reviewers on Sen 2014 to have requested results for the peer reviewers' preferred model, with requested models based only on available data and results reported in at least an online supplement. This would provide reasonable robustness checks for an analysis for which there are many reasonable model specifications. Maybe that happened: the appendix table in the working paper version of Sen 2014 is somewhat different than the published logit regression table. In any event, indicating which models were suggested by peer reviewers might help reduce skepticism about the robustness of reported models, to the extent that models suggested by a peer reviewer have not been volunteered by the researchers.

---

NOTES FOR AHLQUIST ET AL. 2014:

1. Subgroup analyses might have been reported for only the smaller 1,000-person sample because the smaller sample was collected first. However, that does not mean that the earlier sample should be the only sample for which subgroup analyses are reported.

2. Non-disaggregated results for the 3,000-person vote buying list experiment and disaggregated results for the 1,000-person vote buying list experiment were reported in a prior version of Ahlquist et al. 2014, which Dr. Ahlquist sent me. However, a reader of Ahlquist et al. 2014 might not be aware of these results, so Ahlquist et al. 2014 might have been improved by including these results.

---

NOTES FOR SEN 2014:

1. Ideally, models would include a control for twelve years of experience, given that the ABA Standing Committee on the Federal Judiciary "...believes that a prospective nominee to the federal bench ordinarily should have at least twelve years' experience in the practice of law" (p. 3, here). Sen 2014 reports results for a matching analysis that reflects the 12 years threshold, at least for the Trial.Years variable, but I'm less confident in matching results, given the loss of cases (e.g., from 304 women in Table 1 to 65 women in Table 4) and the loss of information (e.g., cases appear to be matched so that nominees with anywhere from 0 to 12 years on Trial.Years are matched on Trial.Years).

2. I contacted the ABA and sent at least one email to the ABA liaison for the ABA committee that handles ratings for federal judicial nominations, asking whether data could be made available for nominee integrity and judicial temperament, such as a dichotomous indication whether an interviewee had raised concerns about the nominee's integrity or judicial temperament. The ABA Standing Committee on the Federal Judiciary prepares a written statement (e.g., here) that describes such concerns for nominees rated as not qualified, if the ABA committee is asked to testify at a Senate Judiciary Committee hearing for the nominee (see p. 8 here). I have not yet received a reply to my inquiries.

---

GENERAL NOTES

1. Data for Ahlquist et al. 2014 are here. Code for my additional analyses is here.

2. Dr. Sen sent me data and R code, but the Sen 2014 data and code do not appear to be online now. Maya Sen's Dataverse is available here. R code for the supplemental Sen models described above is here.

Tagged with: , , , , ,

The above tweet links to this article discussing a study of hiring outcomes for 598 job finalists in finalist groups of 3 to 11 members.

The finalist groups in the sample ranged from 3 to 11 members, but the data in the figure are restricted to an unreported number of groups with exactly 4 members. The likelihoods in the figure of 0%, 50%, and 67% did not suggest large samples, so I emailed the faculty authors at Stefanie.Johnson [at] colorado.edu (on April 26) and david.hekman [at] colorado.edu (on May 2) asking for the data or for information on the sample sizes for the figure likelihoods. I also asked whether a woman was hired from a pool of any size in which only one finalist was a woman. I later tweeted a question to the faculty author who I found on Twitter.

I have not yet received a reply from either of these faculty authors.

I acknowledge researchers who provide data, code, and/or information upon request, so I thought it would be a good idea to note the researchers who don't.

Tagged with: ,

I happened across the Saucier et al. 2005 meta-analysis "Differences in Helping Whites and Blacks: A Meta-Analysis" (ungated), and I decided to plot the effect size against the standard error in a funnel plot to assess the possibility of publication bias.The funnel plot is below.

Saucier wt al. 2005 Funnel PlotFunnel plot asymmetry was not detected in Begg's test (p=0.486) but was detected in the higher-powered Egger's test (p=0.009)

---

NOTE:

1. Saucier et al. 2005 reported sample sizes but not effect sizes standard errors for each study, so I estimated the standard errors with formula 7.30 of Hunter and Schmidt (2004: 286).

2. Code here.

Tagged with: ,

I previously discussed Filindra and Kaplan 2016 in terms of the current state of political science research transparency, but this post will discuss the article more substantively.

Let's start with a re-quote regarding the purpose and research design of the Filindra and Kaplan 2016 experiment:

To determine whether racial prejudice depresses white support for gun control, we designed a priming experiment which exposed respondents to pictures of blacks and whites drawn from the IAT. Results show that exposure to the prime suppressed support for gun control compared to the control, conditional upon a respondent's level of racial resentment (p. 255).

Under the guise of a cognitive test, we exposed 600 survey participants who self-identified as white to three pictures of the faces of black individuals and another three of white individuals (p. 261).

For predicting the two gun-related outcome variable scales for the experiment, Table 1 indicates in separate models that the treatment alone, the treatment and a measure of symbolic racism alone, and the interaction of the treatment and symbolic racism all reach statistical significance at at least p<0.10 with a two-tailed test.

But the outcome variable scales are built from a subset of measured gun-related items. Filindra and Kaplan 2016 reported an exploratory factor analysis used to select items for outcome variable scales: 7 of 13 policy items about guns and 8 of 9 belief items about guns were selected for inclusion in the scales. The dataset for the article uploaded to the Dataverse did not contain data for the omitted policy and belief items, so I requested these data from Dr. Filindra. I did not receive access to these data.

It's reasonable to use factor analysis to decide which items to include in a scale, but this permits researcher flexibility about whether to perform the factor analysis in the first place and, if so, about whether to place all items in a single factor analysis or to, as in Filindra and Kaplan 2016, separate the items into groups and conduct a factor analysis for each group.

---

But the main problem with the experiment is not the flexibility in building the outcome variable scales. The main problem is that the research design does not permit an inference of racial prejudice.

The Filindra and Kaplan 2016 experimental design of a control and a single treatment of the black/white photo combination permits at most only the inference of a "causal relationship between racial considerations and gun policy preferences among whites" (p. 263, emphasis added). However, Filindra and Kaplan 2016 also discussed the experiment as if the treatment had been only photos of blacks (p. 263):

Our priming experiment shows that mere short exposure to pictures of blacks can drive opposition to gun control.

The Filindra and Kaplan experimental design does not permit assigning the measured effect to the photos of blacks isolated from the photos of whites, so I'm not sure why peer reviewers would have permitted that claim, which appeared in exactly the same form on page 9 of Filindra and Kaplan's 2015 MPSA paper.

---

Filindra and Kaplan 2016 supplement the experiment with a correlational study using symbolic racism to predict the ANES gun control item. But, as other researchers and I have noted, there is an inferential problem using symbolic racism in correlational studies, because symbolic racism conflates racial prejudice and nonracial attitudes; for example, knowing only that a person believes that blacks should not receive special favors cannot tell us whether that person's belief is motivated by antiblack bias, nonracial opposition to special favors, or some combination of the two.

My article here provides a sense of how strong a residual post-statistical-control correlation between symbolic racism and an outcome variable must be before one can confidently claim that the correlation is tapping antiblack bias. To illustrate this, I used linear regression on the 2012 ANES Time Series Study data, weighted and limited to white respondents, to predict responses to the gun control item, which was coded on a standardized scale so that the lowest value is the response that the federal government should make it more difficult to buy a gun, the middle response is that the rules should be kept the same, and the highest value is that the federal government should make it easier to buy a gun.

The standardized symbolic racism scale produced a 0.068 (p=0.012) residual correlation with the standardized gun control item, with the model including the full set of statistical control as described in the note below. That was about the same residual correlation as for predicting a standardized scale measuring conservative attitudes toward women (0.108, p<0.001), about the same residual correlation as for predicting a standardized abortion scale (-0.087, p<0.001), and about the same residual correlation as for predicting a standardized item about whether people should be permitted to place Social Security payroll taxes into personal accounts (0.070, p=0.007).

So, based on these data alone, racial prejudice as measured with symbolic racism has about as much "effect" on attitudes about gun control as it does on attitudes about women, abortion, and private accounts for Social Security. I think it's unlikely that bias against blacks causes conservative attitudes toward women, so I don't think that the 2012 ANES data can resolve whether or the extent to which bias against blacks causes support for gun control.

I would bet that there is some connection between antiblack prejudice and gun control, but I wouldn't argue that Filindra and Kaplan 2016 provide convincing evidence of this. Of course, it looks like a version of the Filindra and Kaplan 2016 paper won a national award, so what do I know?

---

NOTES:

1. Code for my analysis reported above is here.

2. The full set of statistical control has controls for: respondent sex, marital status, age group, education level, household income, employment status, Republican party membership, Democratic Party membership, self-reported political ideology, and items measuring attitudes about whether jobs should be guaranteed, limited government, moral traditionalism, authoritarianism, and egalitarianism.

3. Filindra and Kaplan 2016 Table 2 reports a larger effect size for symbolic racism in the 2004 and 2008 ANES data than in the 2012 ANES data, with respective values for the maximum change in probability of support of -0.23, -0.25, and -0.16. The mean of the 2004 and 2008 estimate is 50% larger than the 2012 estimate, so increasing the 2012 residual correlation of 0.068 by 50% produces 0.102, which is still about the same residual correlation as for conservative attitudes about women. Based on Table 6 of my article, I would not be comfortable alleging an effect for racial bias with anything under a 0.15 residual correlation with a full set of statistical control.

Tagged with: , , ,

Journals requiring the posting of data and code for published articles is a major improvement in the conduct of social science because it increases the ability of researchers to assess the correctness and robustness of reported results and because it presumably produces more careful analyses by researchers aware that their data and code will be made public.

But the DA-RT agreement to "[r]equire authors to ensure that cited data are available at the time of publication through a trusted digital repository" does not address selective reporting. For example, the current replication policy for the journal Political Behavior requires only that "[a]uthors of accepted manuscripts will be required to deposit all of the data and script files needed to replicate the published results in a trusted data repository such as ICPSR or Dataverse" (emphasis added).

This permits researchers to selectively report experiments, experimental conditions, and potential outcome variables, and to then delete the corresponding data from the dataset that is made public. Readers thus often cannot be sure whether the reported research has been selectively reported.

---

Consider uncertainty about the survey experiment reported in Filindra and Kaplan 2016, described in the article's abstract as follows (p. 255):

To determine whether racial prejudice depresses white support for gun control, we designed a priming experiment which exposed respondents to pictures of blacks and whites drawn from the IAT. Results show that exposure to the prime suppressed support for gun control compared to the control, conditional upon a respondent's level of racial resentment.

But here is a description of the experimental treatment (p. 261):

Under the guise of a cognitive test, we exposed 600 survey participants who self-identified as white to three pictures of the faces of black individuals and another three of white individuals.

I wasn't sure why a survey experiment intended "[t]o determine whether racial prejudice depresses white support for gun control" would have as its only treatment a prime that consisted of photos of both blacks and whites. It seems more logical for a "racial prejudice" experiment to have one condition in which participants were shown photos of blacks and another condition in which participants were shown photos of whites; then responses to gun control items that followed the photo primes could be compared for the black photo and white photo conditions.

Readers of Filindra and Kaplan 2016 might suspect that there were unreported experimental conditions in which participants were shown photos of blacks or were shown photos of whites. But readers cannot know from the article whether there were unreported conditions.

---

I didn't know of an easier way to eliminate the uncertainty about whether there were unreported conditions in Filindra and Kaplan 2016 other than asking the researchers, so I sent the corresponding author an email asking about the presence of unreported experimental conditions involving items about guns and photos of blacks and/or whites. Dr. Filindra indicated that there were no unreported conditions involving photos of blacks and/or whites, but there were unreported conditions for non-photo conditions that are planned for forthcoming work.

---

My correspondence with Dr. Filindra made me more confident in their reported results, but such correspondence is a suboptimal way to increase confidence in reported results: it took time from Drs. Filindra and Kaplan and from me, and the information from our correspondence is, as far as I am aware, available only to persons reading this blog post.

There are multiple ways for journals and researchers to remove uncertainty about selective reporting and thus increase research transparency, such as journals requiring the posting of all collected data, journals requiring researchers to make disclosures about the lack of selective reporting, and researchers preregistering plans to collect and analyze data.

Tagged with: , ,

Pursuant to a request from Nathaniel Bechhofer, in this post I discuss the research reported in "The Effect of Gender Norms in Sitcoms on Support for Access to Abortion and Contraception", by Nathaniel Swigger. See here for a post about the study and here for the publication.

---

Disclosure: For what it's worth, I met Nathaniel Swigger when I was on the job market.

---

1. I agree with Nathaniel Bechhofer that the Limitations section of Swigger 2016 is good.

2. The article does a good job with disclosures, at least implied disclosures:

I don't think that there are omitted outcome variables because the bottom paragraph of page 9 and Table 1 report on multiple outcome variables that do not reach statistical significance (the first Results paragraph reports the lack of statistical significance for the items about federal insurance paying for abortion and spending on women's shelters). After reading the blog post, I thought it was odd to devote seven items to abortion and one item to contraception insurance, but in a prior publication Swigger used seven items for abortion, one item for contraception insurance, and items for government insurance for abortion.

I don't think that there are omitted conditions. The logic of the experiment does not suggest a missing condition (like here). Moreover, the article notes that results are "not quite in the way anticipated by the hypotheses" (p. 11), so I'm generally not skeptical about underreporting for this experiment, especially given the disclosure of items for which a difference was not detected.

3. I'm less certain that this was the only experiment ever conducted testing these hypotheses, but I'm basing this on underreporting in social science generally and not on any evidence regarding this experiment. I'd like for political science journals to adopt the requirement for—or for researchers to offer—disclosure regarding the completeness of the reporting of experimental conditions, potential outcome and explanatory variables, and stopping rules for data collection.

4. The estimated effect size for the abortion index is very large. Based on Table 1, the standard deviation for the abortion index was 4.82 (from a simple mean of the conditions because I did not see an indication of the number of cases per condition). For the full sample, the difference between the How I Met Your Mother and Parks and Recreation conditions was 5.57 for the abortion index, which corresponds to an estimate of d of 1.16, which—based on this source—falls between the effect size for men being heavier than women (d=1.04) and liberals liking Michelle Obama more than conservatives do (d=1.26). For another comparison, the How I Met Your Mother versus Parks and Recreation difference caused a 5.57 difference on the abortion index, which is less than the 4.47 difference between Catholics and persons who are not Christian or Muslim.

The experiment had 87 participants after exclusions, across three conditions. A power calculation indicated that 29 participants per condition would permit detection of a relatively large d=0.74 effect size 80 percent of the time. Another way to think of the observed d=1.16 effect size is that, if the experiment were conducted over and over again with 29 participants per condition, 99 times of 100 the experiment would be expected to detect a difference on the abortion index between the How I Met Your Mother and Parks and Recreation conditions.

Table 3 output for the dichotomous contraception insurance item is in logit coefficients, but Table 1 indicates the effect sizes more intuitively, with means for the How I Met Your Mother and Parks and Recreation conditions of 0.19 and 0.50, which is about a difference of a factor of 2.6. The control condition mean is 0.69, which corresponds to a factor of 3.6 difference compared to the How I Met Your Mother condition.

---

In conclusion, I don't see anything out of the ordinary in the reported analyses, but the effect sizes are larger than I would expect. Theoretically, the article notes on page 7 that the How I Met Your Mother and Parks and Recreation stimuli differ in many ways, so it's impossible to isolate the reason for any detected effect, so it's probably best to describe the results in more general terms about the effect of sitcoms, as Sean McElwee did.

Tagged with: