MAGIC in reporting statistical results edit

Notes on Robert Abelson’s Statistics as Principled Argument[1]

MAGIC is an acronym for a set of principles in reporting and evaluating results.

Magnitude edit

Refers to the size of the effect, and the practical importance of it. Reporting effect sizes is something we should routinely do, in addition to (or maybe even instead of?) reporting statistical significance. There are a variety of different effect sizes available, and they can be converted from one to another (which helps a lot with meta-analysis).

The rubric for small, medium, and large effect sizes is a start, but there also are alternate effect sizes that are worth considering. Evidence Based Medicine developed the Number Needed to Treat (NNT), Number Needed to Harm (NNH), and similar effect sizes for categorical outcomes. Rosenthal’s Binomial Effect Size Display (BESD) and Cohen’s non-overlap measures are other examples of ways of re-expressing effect sizes.

The other key point is the practical significance. If the outcome is life or death, then even small effects have clinical relevance. The protective effects of low-dose aspirin against heart attack are a frequently-cited example; expressed as a point-biserial correlation, the r < .001, but it is statistically significant, and extrapolated across millions of people, it translates into hundreds or thousands of lives saved.

 
Employee showing what not to do in slicing the salami thinly

Articulation edit

How detailed are the analyses? Is there sufficient supporting detail to contextualize the findings? Did the investigator look for possible interactions (with sex, or with other subgroups?). A good degree of articulation adds depth and nuance without undermining the narrative. The opposite would be “slicing the salami” thinly and doing a series of papers that present narrow subsets of analyses or subgroups -- the “least publishable unit.”

Generalizability edit

Are the results likely to replicate? Do they apply to a general population? Or are they likely to be limited to a particular sample? A common critique of a lot of psychology research (especially social psychology and personality work) is that it has been the study of undergraduate psychology majors. Ouch.

When the researchers go to the effort of getting a sample that is more representative of the population that they want to make inferences about, that enhances generalizability (here also sometimes referred to as "external validity"). If one wants to talk about how personality relates to mood disorders, which sample would be more representative? College undergraduates taking a psychology course, or people coming to a clinic seeking counseling and therapy?

Generalizability also is improved when researchers do "replications and extensions" in samples that differ in some way from the samples used in previous work. For example, if most projects have been done in the United States, then doing similar study (i.e., "replication") in a different country (i.e., the "extension") also would be a way of testing the generalizability.

Interest edit

Is the topic intrinsically interesting? Is there a practical application? Do the authors do a good job of engaging the reader and conveying the value of the work? The ABT idea[2] from Olson is a technique for clarifying the message. Abelson’s idea of “interest” also focuses on the practical importance of the finding.

Credibility edit

 
Friendly pufferfish asking itself how "fishy" the work is and how much puffery has been used

Is the work looking trustworthy? Or is it “fishy?” Abelson describes a continuum from conservative to liberal approaches to analysis. The conservative extreme would be to write down the hypothesis and the analytic plan ahead of time (a priori), only test one primary outcome (or use a conservative analysis to adjust for multiple testing, such as a Bonferroni correction or alpha <.01), and not run or report any additional analyses. The pre-registration movement (ClinicalTrials.gov, and the replication project at OSF) are examples of the conservative approach.

The liberal approach is to do lots of exploration, sensitivity and subgroup analyses, and report it. If framed as “exploratory analysis” or “context of discovery,” it could be okay. Machine learning algorithms and “data mining” usually are taking this more liberal approach.

Where it gets sketchy is when people run lots of analyses (liberal methodology), but misreport it as if it were conservative analysis. Only reporting the most significant results as if they were the a priori hypotheses, or failing to disclose how many analyses were run, is when “liberal” turns into p-hacking.

Some reflections edit

It is probably not possible to score high on all of the MAGIC facets with a single study. There are limitations to any study, whether it be sample size, representativeness of the sample, quality of measurement, or adequacy of the analyses and presentation, among many other possibilities. Many of the principles are in tension with each other. More expensive methodology drives up the cost of the study and usually leads to a smaller sample size, whereas very large samples use less rigorous methodology out of necessity. This could be expressed as a trade-off between reliability and validity of measurement versus sample size, both of which affect the precision of estimates and thus the statistical power to detect effects. There also can be a risk of sacrificing credibility and conservative approaches to analysis in a rush to hype the "interest" of findings.

Abelson was writing before there was a replication crisis, before OSF existed…. The possibility of sharing code, sharing data, and publishing the analyses make it possible to adopt a more liberal approach to analysis, yet also do it responsibly, with transparency and the opportunity for others to understand and work through the analysis to see if they would arrive at similar conclusions.

Machine learning methods are another development in the field, where the computer automates running a wide range of models, and then looks at both the fit (described as accuracy of prediction, or reducing “bias” in prediction) and the stability of the model across resamplings (often described as the variance of the estimates across k-fold cross validation) as a way of trying to balance “discovery” with reproducibility.

External Links edit

References edit

  1. Abelson, Robert P. (1995). Statistics as principled argument. John W. Tukey. Hillsdale, N.J.: L. Erlbaum Associates. ISBN 0-8058-0527-3. OCLC 31011850. https://www.worldcat.org/oclc/31011850. 
  2. Olson, Randy (2015). Houston, we have a narrative : why science needs story. Chicago. ISBN 978-0-226-27070-8. OCLC 907206179. https://www.worldcat.org/oclc/907206179.