Featured White Papers
- Oct. 14th: Simplified IT with Software-as-a-Service (SaaS) (ZDNet)
- PCI DSS therapy for the smaller retailer (McAfee)
- The rise of Web commuting (Citrix Online)
Statistics of the Natasha test: response to concerns and questions
Skeptical Inquirer, Sept-Oct, 2005 by Ray Hyman
The television program The Girl with the X-ray Eyes has appeared on the Discovery Channel in Europe and Asia. Although the program has yet to appear in the United States, it generated many reactions. These reactions focused on the test that Andrew Skolnick, Richard Wiseman, and I carried out on the seventeen-year-old girl in question, Natasha Demkina. The two reports of this test, the first by me, the second by Andrew Skolnick, both in the May/June 2005 SKEPTICAL INQUIRER, triggered many additional comments and criticisms [see Letters to Editor, this issue for examples].
Many of these reactions are based on misunderstandings. Obviously, the major question is whether Natasha's claim of diagnosis by X-ray vision is true. However, given the constraints of time and resources, we knew our test could not answer this question.
The test was designed and intended to be the first step in a potential sequence of tests. At each step, if she failed to pass the criterion, we would go no further. If she could pass the test at a given stage, this would tell us that it was worth continuing to the next stage. We wrote this into the protocol and made it clear to the Discovery Channel producer. We fully expected that the television program would make the test goals clear to the viewers. Although the television crew did an otherwise excellent job, they completely omitted our important comments about the limited goals for the test. This has aggravated the misconceptions and criticisms of our testing procedures.
The comments and criticisms about our test focused, for the most part, on these issues: 1) the choice of our criterion of five out seven correct matches; 2) the alleged lack of power of our statistical test; and 3) the claim that we should have declared her four correct matches "significant." I do not have space to deal adequately with each of these issues in this brief response. I urge interested readers to consult my longer account (6,200 words) at www.csicop.org/specialarticles/natasha2.html.
Here I will deal with the issues succinctly.
1. The criterion: We set the criterion for "passing" the test as five or more correct matches out of seven. Many commentators said that this was too high. These critics rely on abstract ideas about what is a big or a meaningful "effect." Such abstract measures are misleading in our test. In her typical diagnosis, Natasha supposedly has no prior knowledge of the specific ailment that plagues her client. She has to scan the entire body. If her claim is true, this scan requires X-ray vision of extremely high resolution. She has to examine large organs for gross defects. She also has to look for subtle changes in color and texture plus look at processes at the cellular level.
By contrast, our test presented her with a greatly simplified task. We restricted the targets to unambiguous, easily detected deformities: a large hole in the head covered by a metal plate; metal staples in the chest; a large portion of one lung missing; an artificial hip; and the like. On each trial we told her both what to look for and where to look. Natasha herself informed the producer before the test that not having to scan the entire body would make her task much less demanding. In the report posted on the Web site listed above, I give additional reasons why this test is much simpler than her typical reading. If Natasha has anything like the ability she claims, she should have easily matched each condition to the appropriate subject. Her claims imply a type of X-ray vision of extremely high resolution. Our test required X-ray vision of very low resolution.
Consequently, if her claim is true, we should expect her to match all seven conditions correctly. We set the criterion at five, rather than at seven, because we wanted to give her some leeway. Getting only four correct matches is highly inconsistent with her claim.
2. The power of our test: Some critics claim that our test lacked sufficient power. By this they mean that even if Natasha's claim is true, our test had little chance to show it. In fact, the power of our test was adequate and much higher than the critics maintain. Based on calculations done for me by Professors Persi Diaconis and Susan Holmes of Stanford University's Department of Statistics, the odds of detecting the alternative hypothesis were better than 3:1.
Of course, we would have preferred to have even greater power. We could have increased the odds of detecting the alternative hypothesis by lowering our criterion to four. However, this would have increased the probability of falsely rejecting the null hypothesis. Any statistical test has to cope with two types of possible errors. The Type I error is that of falsely rejecting the null hypothesis (that the results are due to chance). The Type II error is that of failing to accept the alternative hypothesis when, in fact, it is true. For a given set of resources and sample size, anything one does to lower the possibility of one error will increase the probability of the other error. The investigator tries to choose the criterion to achieve an optimal balance between these two types of errors. Many critics seem more concerned about avoiding the Type II error rather than the Type I.