I discussed here some weird things that SPSS does with regard to weighting. Here's another weird thing, this time in Stata:

StataQ1trunc

The variable Q1 has a minimum of 0 and a maximum of 99,999. For this particular survey question, 99,999 is not a believable response; so, instead of letting 99,999 and other unbelievable responses influence the results, I truncated Q1 at 100, so that all responses above 100 equaled 100. There are other ways of handling unbelievable responses, but this can work as a first pass to assess whether the unbelievable responses influenced results.

The command replace Q1trunc = 100 if Q1 > 100 tells Stata to replace all responses over 100 with a response of 100; but notice that this replacement increased the number of observations from 2008 to 2065; that's because Stata  treated the 57 missing values as positive infinity and replaced these 57 missing values with 100.

Here's a line from Stata's help missing documentation:

all nonmissing numbers < . < .a < .b < ... < .z

Stata has a reason for treating missing values as positive infinity, as explained here. But -- unless users are told of this -- it is not obvious that Stata treats missing values as positive infinity, so this appears to be a source of potential error for code with a > sign and missing values.

Here's how to recode the command so that missing values remains missing: replace Q1trunc = 100 if Q1 > 100 & if Q1 < .

Tagged with: ,

This post presents selected excerpts from Jesper W. Schneider’s 2014 Scientometrics article, "Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations" [ungated version here]. For the following excerpts, most citations have been removed, and page numbers references to the article have not been included because my copy of the article lacked page numbers.

The first excerpt notes that the common procedure followed in most social science research is a mishmash of two separate procedures:

What is generally misunderstood is that what today is known, taught and practiced as NHST [null hypothesis significance testing] is actually an anonymous hybrid or mix-up of two divergent classical statistical theories, R. A. Fisher’s 'significance test' and Neyman's and Pearson's 'hypothesis test'. Even though NHST is presented somewhat differently in statistical textbooks, most of them do present p values, null hypotheses (H0), alternative hypotheses (HA), Type I (α) and II (β) error rates as well as statistical power, as if these concepts belong to one coherent theory of statistical inference, but this is not the case. Only null hypotheses and p values are present in Fisher's model. In Neyman–Pearson's model, p values are absent, but contrary to Fisher, two hypotheses are present, as well as Type I and II error rates and statistical power.

The next two excerpts contrast the two procedures:

In Fisher's view, the p value is an epistemic measure of evidence from a single experiment and not a long-run error probability, and he also stressed that 'significance' depends strongly on the context of the experiment and whether prior knowledge about the phenomenon under study is available. To Fisher, a 'significant' result provides evidence against H0, whereas a non-significant result simply suspends judgment—nothing can be said about H0.

They [Neyman and Pearson] specifically rejected Fisher’s quasi-Bayesian interpretation of the 'evidential' p value, stressing that if we want to use only objective probability, we cannot infer from a single experiment anything about the truth of a hypothesis.

The next excerpt reports evidence that p-values are overstated. I have retained the reference citations here:

Using both likelihood and Bayesian methods, more recent research have demonstrated that p values overstate the evidence against H0, especially in the interval between significance levels 0.01 and 0.05, and therefore can be highly misleading measures of evidence (e.g., Berger and Sellke 1987; Berger and Berry 1988; Goodman 1999a; Sellke et al. 2001; Hubbard and Lindsay 2008; Wetzels et al. 2011). What these studies show is that p values and true evidential measures only converge at very low p values. Goodman (1999a, p. 1008) suggests that only p values less than 0.001 represent strong to very strong evidence against H0.

This next excerpt emphasizes the difference between p and alpha:

Hubbard (2004) has referred to p < α as an 'alphabet soup', that blurs the distinctions between evidence (p) and error (α), but the distinction is crucial as it reveals the basic differences underlying Fisher’s ideas on 'significance testing' and 'inductive inference', and Neyman–Pearson views on 'hypothesis testing' and 'inductive behavior'.

The next excerpt contains a caution against use of p-values in observational research:

In reality therefore, inferences from observational studies are very often based on single non-replicable results which at the same time no doubt also contain other biases besides potential sampling bias. In this respect, frequentist analyses of observational data seems to depend on unlikely assumptions that too often turn out to be so wrong as to deliver unreliable inferences, and hairsplitting interpretations of p values becomes even more problematic.

The next excerpt cautions against incorrect interpretation of p-values:

Many regard p values as a statement about the probability of a null hypothesis being true or conversely, 1 − p as the probability of the alternative hypothesis being true. But a p value cannot be a statement about the probability of the truth or falsity of any hypothesis because the calculation of p is based on the assumption that the null hypothesis is true in the population.

The final excerpt is a hopeful note that the importance attached to p-values will wane:

Once researchers recognize that most of their research questions are really ones of parameter estimation, the appeal of NHST will wane. It is argued that researchers will find it much more important to report estimates of effect sizes with CIs [confidence intervals] and to discuss in greater detail the sampling process and perhaps even other possible biases such as measurement errors.

The Schneider article is worthwhile for background and information on p-values. I'd also recommend this article on p-value misconceptions.

Tagged with:

Jeremy Freese recently linked to a Jason Mitchell essay that discussed perceived problems with replications. Mitchell discussed many facets of replication, but I will restrict this post to Mitchell's claim that "[r]ecent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value."

Mitchell's claim appears to be based on a perceived asymmetry between positive and negative findings: "When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings. But when an experiment fails, we can only wallow in uncertainty about whether a phenomenon simply does not exist or, rather, whether we were just a bit too human that time around."

Mitchell is correct that a null finding can be caused by experimental error, but Mitchell appears to overlook the fact that positive findings can also be caused by experimental error.

---

Mitchell also appears to confront only the possible "ex post" value of replications, but there is a possible "ex ante" value to replications.

Ward Farnsworth discussed ex post and ex ante thinking using the example of a person who accidentally builds a house that extends onto a neighbor's property: ex post thinking concerns how to best resolve the situation at hand, but ex ante thinking concerns how to make this problem less likely to occur in the future; tearing down the house is a wasteful decision through the perspective of ex post thinking, but it is a good decision from the ex ante perspective because it incentivizes more careful construction in the future.

In a similar way, the threat of replication incentivizes more careful social science. Rational replicators should gravitate toward research for which the evidence appears to be relatively fragile: all else equal, the value of a replication is higher for replicating a study based on 83 undergraduates at one particular college than for replicating a study based on a nationally-representative sample of 1,000 persons; all else equal, a replicator should pass on replicating a stereotype threat study in which the dependent variable is percent correct in favor of replicating a study in which the stereotype effect was detected only using the more unusual measure of percent accuracy, measured as the percent correct of the problems that the respondent attempted.

Mitchell is correct that there is a real possibility that a researcher's positive finding will not be replicated because of error on the part of the replicator, but, as a silver lining, this negative possibility incentivizes researchers concerned about failed replications to produce higher-quality research that reduces the chance that a replicator targets their research in the first place.

Tagged with: ,