I agree with Andrew — your skepticism that you have a substantial effect rises with the p value unambiguously (holding the effect size constant.) However, there is little research on whether the recent growth of income inequality is associated with this rise of … Ultimately, the opposite conclusion and recommendation may or may not be reached. from the abstract “..clear reductions were evident in the intervention arm for concussion incidence (RR=0.71, 0.48 to 1.05)”, See: http://bjsm.bmj.com/content/early/2017/05/08/bjsports-2016-097434, Exploring this further, they claim to be using “magnitude based inference”, which, as far as I can see, is only used within sports medicine, and seems to be a more permissive form of NHST – there is a commentary on the method with some responses here: https://www.ncbi.nlm.nih.gov/pubmed/25051387, Statistical Modeling, Causal Inference, and Social Science, “P-hacking” and the intention-to-cheat effect, https://twitter.com/jamesheathers/status/859284639600570368, http://fooledbyrandomness.com/pvalues.pdf, http://journals.sagepub.com/doi/abs/10.1111/j.1467-9280.1994.tb00281.x, https://www.ncbi.nlm.nih.gov/pubmed/12933636, http://www.stat.columbia.edu/~gelman/research/unpublished/notrump_falk_gelman_icml.pdf, http://bjsm.bmj.com/content/early/2017/05/08/bjsports-2016-097434, https://www.ncbi.nlm.nih.gov/pubmed/25051387. Now if you can just work in "She sells sea shells on the sea shore…." into your next statistical paper! Holy moly, that was twelve and a half years ago. We've spent a lot of time during the past few years discussing the difficulty of interpreting "p less than .05" results from noisy studies. The median estimated duration of an episode was 5 to 7 days, depending on the scenario. A Type M error is an error of magnitude. I make a Type M error by claiming with confidence that theta is small in magnitude when it is in fact large, or by claiming with confidence that theta is large in magnitude when it is in fact small. One could argue that the "true effect" of the supplements is the effect when the patients do indeed take their vitamins/calcium. And as soon as they start the subgroup analyses, the power takes a big hit. The generalizations are advanced on the basis of statistical data from the population, The trap and its surrounding plots comprise a, To conduct such work, information derived from single, Tropical forest tree mortality, recruitment and turnover rates: calculation, interpretation and comparison when, Both the profiles of neighborhood experience and a measure of, This was also used to define the actual (de hecho) and legal (de derecho) populations when the, To accomplish this, we first listed the other, Linked with marriage statistics subsequent to the, Many scholars have theorized on the emergence of, Finally, some issues arise with regard to the interpretation of the. 'max': 30, “In many situations there’s no real “precise” null we can all agree on for comparison purposes. Irish: when not of … Don’t we know it precisely for the measured effect size under the null? Little is known about the characteristics of areas in Idaho with high suicide rates. This leaves simulation studies without programming errors…. Your second point about randomized studies seems (to me) to be about things that get wrapped up in the error term. The word in the example sentence does not match the entry word. So 30% with p=.04, I'd be skeptical but p=.06 maybe I should not be as skeptical. I don't think the experiment should be thrown away, but I see where the journal is coming from, in emphasizing that there's no clear evidence from these data alone.". My null hypothesis of interest is that this lake has the same distribution as previously seen in the other 100 lakes…. The problem, as I see it, is not that the journal made any mistakes in conveying the evidence; rather, the problem is with the attitude that a single noisy study should be considered as dispositive. To put it another way: Had the original paper reported an effect size estimate of 30% with p=.04, I'd be skeptical: I'd say that I'd guess the 30% was an overestimate and that we should be aware that treatment effects can vary. The p value is the probability that something more extreme than the observed data would come out of a particular random number generator. To aid suicide prevention efforts in the state, we sought to identify and characterize spatial clusters of suicide. You're absolutely correct that a lot of thought should go into which hypothesis is interesting to look at. The point of my paper with Loken is not about intent-to-treat or anything like that; rather, it's a general issue that when noise is added to a study (for example, from noncompliance), this increases standard errors and thus increases the sense in which a statistically significant estimate (or, in this case, a nearly statistically significant estimate) gives an overestimate of the magnitude of the effect size. You've described two very different hypotheses to test. With p=.06, I'm just very slightly more skeptical (or maybe less skeptical; see discussion here). Our finding indicate that the effect of vitamin and calcium supplement is probably no larger than 25% (or how much the equivalence test gives you). Jeff points us to a recent example, presented in this letter from Elizabeth Hatch, Lauren Wise, and Kenneth Rothman: I'm not so sure. Acknowledging that there are really a bazillion possible meaningless hypotheses you could choose to compare your data to is another way of putting the "Garden Of Forking Paths" concept. Cliff, the thing is that p values by themselves are unassailable mathematical facts: "If you generated random numbers using my chosen RNG program "NullHypothesis(i)" you would rarely see a dataset stranger than the data D[i] as measured by test statistic t(Data), (p = 0.0122)". that 30% is not zero; on the other hand, I wouldn't expect to see an effect as large as 30% in a replication. Or to put it a fourth way – we don't actually know the p value, but only an estimate of it. I was thinking of the binary case in which the effect size implies the standard error. Again, I'm having trouble with your point. If you collect trillions of data points and find no correlation that would be interesting. Have you actually had feedback from reviewers saying something like "please state you had 2 successful replications and 2 failures to replicate" in the above scenario? It's the slope of the regression when x and y have been standardized. Looking forward, it seems to me that the next step is to explicitly include more information in the decision process. Or are there more specific issues you are referring to? What Andrew is saying (correct me if I'm wrong) is that "statistically significant" results anywhere near the boundary suffer from the Type M and Type S errors of the significance filter (I agree) but that "statistically insignificant" results just on the other side of the border suffer from the exact same problems, only slightly more so. Mass shootings are an increasingly common phenomenon in the United States. Young voters showed up in never-before-seen levels in 2018, with 36% of those who were eligible participating, according to the U.S. Census. The problem is the choice of null hypothesis and designing studies focused on testing that choice. I hope you're right, but am not convinced that you are. You and others write about skepticism with respect to such p-values, but would it be better to switch more completely to an estimation framework? Never mind… I take it back since you're measuring the standard error. Yep, the p-value is just the current arbitrary meaningless pedantic calculation, you can substitute any other such calculation and get the same results (as long as it will yield "success" infrequently enough to seem like an achievement). You shouldn't calculate them. Another person can come along and say "I've measured cadmium in 100 lakes in this state and I've found that cadmium content of a sample has a power law distribution with lots of near zero measurements, but long tails as most of the cadmium comes from small regions of each lake. I attended many talks about oncology themed trials during my forty year career. Medication studies generally report results on two sub-groups of recruited subjects, Intent-to-Treat and Per-Protocol. Just about everybody seems to think that a p-value of .01 is much different from a p-value of .10, but this difference is all in the noise too, as Hal Stern and I discussed in our paper. Introduction In 2015, Idaho had the fifth highest suicide rate in the United States. There's not really a plausible reason why this intervention would increase the risk of cancer, as far as I know – we just don't have compelling evidence that it decreases it. But p-values come from the population distribution (or perhaps hypothetical population data), and you only have an estimate of that. I wonder how many of impressive results are down to Type S and Type M error. Just a small gloss on that — A former colleague argued that the reason 0.05 is used as a filter is that a result that can't be p-hacked to get below 0.05 shows that the effect can't possibly be there! And that is the problem that Hatch and others are rightly highlighting, whether they mean to or not. NHST starts with the opposite principle, that it would be somehow surprising if any two things were correlated at all… no it isn't. Many roads become smooth in asymptopia but not those haveingbumps and curves from systematic errors and mis-specification. I make a Type S error by claiming with confidence that theta is positive when it is, in fact, negative, or by claiming with confidence that theta is negative when it is, in fact, positive. The p < .05 framework would ask me to report these as mixed support for the hypothesis. Kenneth Rothman has long advocated confidence intervals and disparaged hypothesis testing in epidemiological research. Similarly skeptical of their statistical reasons for wanting a near-significant result to be viewed as better, but I think the point in this specific case is that there's value in considering the costs of treatment and in thinking more carefully about the actual problem domain. So they want to say that without the Per-Protocol results, we can't say much about the impact of the drug. For example, null = normal(0, sd) which sd should we choose?". "The p < .05 framework would ask me to report these as mixed support for the hypothesis.". Have you actually had feedback from reviewers saying something like "please state you had 2 successful replications and 2 failures to replicate" in the above scenario? Or maybe I should ask: skeptical of what? I looked at the linked article, Keith, which I like a lot, but I'm not sure I get your point. In many situations there's no real "precise" null we can all agree on for comparison purposes. are so bothered by the journal's characterization of the results. The confusion between “this is a mathematical fact” and “this is a scientific fact” are at the heart of everything that is wrong with current practice.

