{"id":2866,"date":"2018-10-17T10:24:17","date_gmt":"2018-10-17T09:24:17","guid":{"rendered":"http:\/\/blogs.kcl.ac.uk\/editlab\/?p=2866"},"modified":"2018-10-17T11:35:16","modified_gmt":"2018-10-17T10:35:16","slug":"s-for-statistical-significance","status":"publish","type":"post","link":"https:\/\/blogs.kcl.ac.uk\/editlab\/2018\/10\/17\/s-for-statistical-significance\/","title":{"rendered":"S for Statistical Significance"},"content":{"rendered":"<div>\n<p><span style=\"font-family: Calibri,sans-serif;font-size: small\"><em><strong>This week for our S blog, we bring a post on the important issue of statistical significance, written by a guest blogger from the<a href=\"http:\/\/www.thedunnlab.com\/blog\/\"> Said &amp; Dunn blog<\/a>, led by Dr Erin Dunn. This post is by Khalil Zlaoui, a graduate student in the Dunn Lab. We are very grateful to him for sharing this content with us. The standard measure of statistical significance in science (i.e. whether you believe a result is true) is when your effect is associated with a p value of less than 0.05. What this means is that you would expect that finding by chance no more than one in twenty times. Here Khalil discuss various problems with that approach and suggestions that have been made to address them.<\/strong><\/em><br \/>\n<\/span><\/p>\n<\/div>\n<p><!--more--><a href=\"http:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/khalil.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-thumbnail wp-image-2867 alignright\" src=\"http:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/khalil-150x150.png\" alt=\"\" width=\"150\" height=\"150\" srcset=\"https:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/khalil-150x150.png 150w, https:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/khalil-50x50.png 50w, https:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/khalil-100x100.png 100w\" sizes=\"auto, (max-width: 150px) 100vw, 150px\" \/><\/a><\/p>\n<div>\n<div class=\"sqs-block-content\">\n<hr \/>\n<p>&nbsp;<\/p>\n<p>Scientific studies often begin with a hypothesis about the way the world works. The hope is that by analyzing data we can find empirical evidence to uncover the truth, by either supporting or refuting our original hypothesis.<\/p>\n<p>For decades, the scientific community has relied on the\u00a0<em>p-value\u00a0<\/em>as the sole indicator of that truth.\u00a0\u00a0But a significant\u00a0<em>p-value\u00a0<\/em>is not proof of strong evidence, and in fact, it was never intended to be used as such. Its misuse and misinterpretation has led to serious problems, including an interdisciplinary replication crisis that we describe in a previous\u00a0<a href=\"http:\/\/www.thedunnlab.com\/updates\/2018\/4\/11\/research-replication-revolution\">Said&amp;Dunn post\u00a0<\/a>.<\/p>\n<blockquote><p>&#8220;Should more journals ban\u00a0<em>p-values\u00a0<\/em>altogether?&#8221;<\/p><\/blockquote>\n<p><a href=\"https:\/\/www.nature.com\/articles\/s41562-017-0189-z\">Benjamin et al. (2017)<\/a>\u00a0recently made a proposal to change the default\u00a0<em>p-value\u00a0<\/em>threshold (from 0.05 to 0.005) for claims of new discoveries, a proposal met with some\u00a0<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC5749128\/\">controversy<\/a>.\u00a0\u00a0Some journals, like\u00a0<em>Basic and Applied Social Psychology<\/em>, have also gone so far as to ban the use of\u00a0<em>p-values<\/em>.<\/p>\n<p><a href=\"http:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/Jelly-beans-p-value.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2868\" src=\"http:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/Jelly-beans-p-value.png\" alt=\"\" width=\"297\" height=\"372\" srcset=\"https:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/Jelly-beans-p-value.png 297w, https:\/\/blogs.kcl.ac.uk\/editlab\/files\/2018\/10\/Jelly-beans-p-value-240x300.png 240w\" sizes=\"auto, (max-width: 297px) 100vw, 297px\" \/><\/a><\/p>\n<p>Is tightening the\u00a0<em>p-value\u00a0<\/em>threshold a good idea? Should more journals ban\u00a0<em>p-values\u00a0<\/em>altogether? After wading into the waters around the promises and pitfalls of\u00a0<em>p-values<\/em>, here are four main conclusions we have drawn.<\/p>\n<h3><strong>1)\u00a0\u00a0<\/strong><strong><em>P-VALUES\u00a0<\/em><\/strong><strong>SHOULD BE INTERPRETED WITH CAUTION<\/strong><\/h3>\n<p>In 2016, the\u00a0<a href=\"http:\/\/www.amstat.org\/\">American Statistical Association<\/a>\u00a0(ASA) argued that while the\u00a0<em>p-value\u00a0<\/em>is a useful measure, it has been misused and misinterpreted in weighing evidence. The ASA board therefore released six principles to guard against common misconceptions of\u00a0<em>p-values<\/em>:<\/p>\n<ol>\n<li><strong><em>P<\/em><em>-values\u00a0<\/em>can indicate how incompatible the data are with a specified statistical model.<\/strong>\u00a0\u00a0As models are constructed under a set of assumptions, a small\u00a0<em>p-value\u00a0<\/em>indicates a model that is incompatible with the null hypothesis, as long as these assumptions hold.<\/li>\n<li>A common misuse of\u00a0<em>p-values\u00a0<\/em>is that they are often turned into statements about the truth of the null hypothesis.\u00a0<strong><em>P<\/em><em>-values\u00a0<\/em>do not measure the probability that the studied hypothesis is true.\u00a0\u00a0<\/strong>They also do not indicate the probability that data were produced by random chance alone.<\/li>\n<li><strong>Scientific conclusions and business or policy decisions should not be based only on whether a\u00a0<em>p-value\u00a0<\/em>passes a specific threshold.\u00a0<\/strong>Conclusions based solely on\u00a0<em>p-values\u00a0<\/em>can pose a threat to public-health. In addition to model design and estimation, factors to be considered in decision-making include study design and measurement quality.<\/li>\n<li><strong>\u00a0<\/strong><strong>Proper inference requires full reporting and transparency.<\/strong>\u00a0\u00a0Conducting several tests of association in order to identify a significant\u00a0<em>p-value\u00a0<\/em>leads to spurious results.<\/li>\n<li><strong>A\u00a0<em>p<\/em><em>-value<\/em>, or statistical significance, does not measure the size of an effect or the importance of a result.\u00a0<\/strong>A smaller\u00a0<em>p-value\u00a0<\/em>is not an indicator for a larger effect.<\/li>\n<li><strong>By itself, a\u00a0<em>p<\/em><em>-value\u00a0<\/em>does not provide a good measure of evidence regarding a model or a hypothesis<\/strong>. A\u00a0<em>p-value\u00a0<\/em>near 0.05 is only weak evidence against the null.<\/li>\n<\/ol>\n<h3><strong>2) \u00a0<\/strong><strong>THERE ARE ALTERNATIVES TO\u00a0<em>P-VALUES<\/em><\/strong><\/h3>\n<p>An interesting alternative to p-values is using a Bayesian approach, a method of statistical inference that includes a subjective \u201cprior\u201d belief about the hypothesis, based on Bayes theorem.<\/p>\n<\/div>\n<div class=\"sqs-block-content\">\n<p>From Bayes theorem:\u00a0<em>POST<\/em><sub>H1<\/sub>=<em>PRIOR<\/em><sub>H1<\/sub>\u00a0X\u00a0<em>BF<\/em>, where\u00a0<em>POST<\/em>\u00a0<sub>H1<\/sub>\u00a0is the posterior odds in favor of H<sub>1<\/sub>\u00a0(<strong>the alternative hypothesis<\/strong>) and\u00a0<em>BF<\/em>\u00a0=\u00a0<span style=\"text-decoration: underline\"><ins>sampling density of data under H<sub>1<\/sub><\/ins>\u00a0 \u00a0\/ sampling density of data under H<sub>0\u00a0<\/sub>is the Bayes factor.\u00a0 \u00a0<\/span><\/p>\n<\/div>\n<div class=\"sqs-block-content\">\n<p>Interestingly, Bayes Factors can be equated with\u00a0<em>p-values<\/em>. The correspondence between\u00a0<em>p-values\u00a0<\/em>in the frequentist world (meaning the statistical inference framework that is most commonly used) and Bayes Factors in the Bayesian world can\u00a0<a href=\"http:\/\/www.pnas.org\/content\/pnas\/110\/48\/19313.full.pdf\">reshape the debate<\/a>\u00a0about\u00a0<em>p-values\u00a0<\/em>and help reconsider how strongly they can support evidence to reject the null.<\/p>\n<p>Under some reasonable assumptions, a\u00a0<em>p-value\u00a0<\/em>of 0.05 in the frequentist world corresponds to Bayes Factors in favor of the alternative hypothesis ranging from 2.5 to 3.4. In the Bayesian world, this is considered as weak evidence against the null. Based on the correspondence between p-values and Bayes Factors, Benjamin et al. (2017) proposed to\u00a0<a href=\"https:\/\/www.nature.com\/articles\/s41562-017-0189-z\">redefine statistical significance at 0.005<\/a>.<\/p>\n<h3><strong>3) \u00a0<\/strong><strong>TIGHTENING THE\u00a0<em>P-VALUE\u00a0<\/em>THRESHOLD DOESN\u2019T FULLY SOLVE THE REPLICATION CRISIS AND COULD LEAD TO OTHER PROBLEMS<\/strong><\/h3>\n<blockquote><p><em>&#8220;p-values\u00a0<\/em>are not the only cause for the lack of reproducibility in science&#8221;<\/p><\/blockquote>\n<p>A two-sided 0.005\u00a0<em>p-value\u00a0<\/em>corresponds to Bayes Factors in favor of the alternative ranging from 14 to 26, which in Bayesian considerations corresponds to substantial to strong evidence.\u00a0\u00a0Benjamin et al.\u2019s proposal to change the p-value threshold was made to help address the replication crisis, where too few studies were able to replicate the findings of the original study.\u00a0\u00a0But moving to this more stringent threshold comes at a price \u2013 the need for larger samples and possibly unacceptable false negative rates.\u00a0\u00a0Is such a trade-off worth it?<\/p>\n<p>It\u2019s important to keep in mind that\u00a0<em>p-values\u00a0<\/em>are not the only cause for the lack of reproducibility in science.\u00a0\u00a0While\u00a0<em>p-values\u00a0<\/em>might be an important contributor, there are other\u00a0<a href=\"https:\/\/errorstatistics.com\/\">real issues affecting replication<\/a>, including: selection effects, trends towards multiple testing, hunting for significance or p-hacking, violated statistical assumptions, and so on. Tightening the p-value from 0.05 to 0.005 will not necessarily address these issues.<\/p>\n<h3><strong>4) \u00a0BUT THERE ARE ALTERNATIVES WE SHOULD ALL BE USING<\/strong><\/h3>\n<p>Some statisticians have argued in favor of estimation (putting emphasis on the parameter to estimate an effect) over testing (putting emphasis on rejecting or accepting a hypothesis based on\u00a0<em>p-values<\/em>). If the interest lies in testing an effect, researchers could instead rely on confidence, credibility, or prediction intervals.<\/p>\n<p>So in the end, we think that rather than ditch the p-value altogether, we should shift our focus from\u00a0<em>p-values\u00a0<\/em>to study design, effect size and confidence intervals, which we hope can help us better understand the evidence to support our hypotheses and ultimately uncover the truth about the way the world works.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<div id=\"block-yui_3_17_2_1_1530318013670_55747\" class=\"sqs-block image-block sqs-block-image sqs-text-ready\">\n<div id=\"yui_3_17_2_1_1539767794163_94\" class=\"sqs-block-content\">\n<div id=\"yui_3_17_2_1_1539767794163_93\" class=\"image-block-outer-wrapper layout-caption-below design-layout-inline combination-animation-none individual-animation-none individual-text-animation-none\">\n<div id=\"yui_3_17_2_1_1539767794163_92\" class=\"intrinsic\">\n<p id=\"yui_3_17_2_1_1539767794163_91\" class=\"image-block-wrapper has-aspect-ratio\">\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This week for our S blog, we bring a post on the important issue of statistical significance, written by a guest blogger from the Said &amp; Dunn blog, led by Dr Erin Dunn. This post is by Khalil Zlaoui, a graduate student in the Dunn Lab. We are very grateful&#8230;<\/p>\n","protected":false},"author":218,"featured_media":2868,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[248],"tags":[290,230,78,219],"class_list":["post-2866","post","type-post","status-publish","format-standard","has-post-thumbnail","category-a-z","tag-methods","tag-replication","tag-research","tag-statistics"],"_links":{"self":[{"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/posts\/2866","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/users\/218"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/comments?post=2866"}],"version-history":[{"count":6,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/posts\/2866\/revisions"}],"predecessor-version":[{"id":2874,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/posts\/2866\/revisions\/2874"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/media\/2868"}],"wp:attachment":[{"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/media?parent=2866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/categories?post=2866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.kcl.ac.uk\/editlab\/wp-json\/wp\/v2\/tags?post=2866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}