Drs Bermon, Lindén Hirschberg, Kowalski & Eklund have responded to our, and others’ criticism of studies published in 2017 in the British Journal of Sports Medicine; 51:1309-1314.(Bermon et al., 2017). We continue to disagree with many of the statistical approaches used by the authors, including in their recent response (open-access) in BJSM. But we’d like to avoid a detailed discussion of all of these details here. Many of these issues are discussed in a comprehensive blog post here. We also note that the veracity of the underlying data has been called into question, and calls have been made for the authors to make the data publicly available, or to follow more open scientific procedures. Here, instead, we continue to argue that results in the original paper, even if taken at face value, are extremely weak, and cannot be sensibly be used to formulate policy. In particular the IAAF policy of targeting some events, but not others, partly on the basis of the significant correlations in the study, is nonsensical.
The stakes in this debate are extremely high. The new IAAF Eligibility Regulations for Female Classification (IAAF, 2018), published on April 23rd 2018 and due to come into force November 1st, 2018. There is clear evidence that these regulations were developed, at least in part, on the basis of the results in Bermon & Garnier (2017) (see Explanatory Notes pp 2-3). The translation of scientific evidence to international regulations requires a considerably higher standard of analysis than applied in this study.
We stand by our original contention that the results in the paper do not stand up to appropriate standards of statistical scrutiny. The authors claim that their original methods are justified because “we presented an exploratory study, without no attempt to claim confirmatory results”, but that the results in the study are strong. This cannot be right. If the purpose of a study is to test the hypothesis “that elevated testosterone levels enhance performance in certain events but not others” (as the authors say it is) then multiple hypothesis testing corrections are unambiguously necessary. The results cannot be confirmatory, nor can the exploratory evidence be said to be “strong”, without passing those tests. If these corrections are not performed, at least some significant effects are likely to be false positives. Any statistician would agree with this: many similar criticisms of the study have been made along these lines.
The authors claim that it is unlikely to find five significant effects out of twenty-one tests by chance. Even if this were the case, it is still very likely that at least one (and possibly many) of the five significant results independently occurred by chance. Contrary to the author’s claim, controlling for the false discovery rate, as we did in our paper, is not at all conservative. The more conservative family-wise error rate control (that is, controlling the probability of making one or more false discovery) would actually be more appropriate. None of the significant correlations in the paper would survive either of these methods, as the authors concede. Hence, any statement about a correlation in any particular event rests on shaky ground.
We are making a distinction between two different types of claims: a claim that there is suggestive evidence of a correlation between androgen and performance in at least one event (even this claim we would disagree with), and the much stronger claim that there is a significant correlation in each of these specificfive events. The authors insist on that second stronger claim, despite the multiple hypothesis testing problem discussed above, which means that possibly some (more likely many) of those five correlations are false-positives.
Like many other commentators, we do not agree with the logic behind the IAAF policy of targeting events if, and only if, there is a correlation between androgens and performance in those events. But if one did, false positives would surely be of great concern: the consequence for athletes wrongly targeted on the basis of these false positives are enormous.
We urge the authors, and those interpreting the results in Bermon and Garnier (2017) and its 2018 linked article, to be more cautious in their interpretation of the available evidence, and to acknowledge the high likelihood of false positives and large confidence intervals around each of their estimates. The stated conclusion to the original article and its 2018 linked article is clearly unfounded. To use these estimates to inform life-changing IAAF regulations is, we believe, highly misguided.
Simon Franklin is a Postdoctoral Research Economist at the London School of Economics and received his PhD in Economics from the University of Oxford. His primary research interests include urban labour markets and housing policy, though he has a keen interest in elite athletic performance, and methodological questions related to statistical inference more generally.
Jonathan Ospina Betancurt (@JonathanOspinaB)is a lecturer in Sport Science & Physical Activity at Universidad Isabel I, Spain. His most recent research examines Sex-differences in elite-performance and hyperandrogenic athletes. His fields of research are DSD and transgender athletes, ethics and values in sport. He s a JHSE Associate Editor.
Dr Silvia Camporesi(@silviacamporesi) is a Senior Lecturer in Bioethics & Society at King’s College London, where she directs the MSc in Bioethics & Society. In the last ten years she has written extensively on eligibility of women with hyperandrogenism to compete in the female category. Her latest book, “Bioethics Genetics and Sport”, co-authored with Mike McNamee, was published by Routledge in March 2018.