Why does the mystical p-value sometimes do not tell the truth?

P-values seem to carry an importance in science paper, which the layman does not understand. Instead of going into the details of its definition, i'm using coin tosses as an example. Most coins are balanced, which means each side is equally likely to end up in a coin toss - which makes them perfect for the use in random experiments.

The experiment

My experiment is the prediction of a coin toss. Other experiments are less random, like measuring the height or intelligence of a group of people. The big question is what is random or not?

The correct order of business would be:

  1. Stating the hypothesis - here predicting the outcome to be heads.
  2. Carrying out the experiment - here flipping the coin.
  3. Observing the results - here:  (just for the sake), it was tails.
  4. Publishing the results - here: writing a paper in any case, even when the hypothesis got not confirmed.
What would we see in the scientific literature? About 50 percent of the papers are reporting heads, the other 50 percent reporting tails and everybody would understand that the outcome of flipping coins is random. This is something we seldom see in the scientific literature. 

The successful scientist

If my career depends on the publication of successful papers, then papers with negative results are not desirable. My reputation would get damaged, if i publish a report where the experiment does not support my hypothesis. This pressure results in papers, where the outcome would be tails, if the hypothesis was tails; or heads if the hypothesis was heads.

Another way to ensure a successful paper is to postpone the making of a hypothesis until after the analysis is carried out. This procedure is also known as the Texas sharpshooter fallacy. Obviously this is wrong. However, data dredging is quite common.

In my case, i made a hypothesis on the outcome of coin flipping. I predict heads assuming that the coin will always land this way - which is my hypothesis. The null hypothesis assumes that the outcome is random, giving me a 50% chance to be right. 

Making it significant

My paper only counts if my findings are statistically significant. Often a threshold of 0.05 is employed. In order to reach that level, i would have to predict the outcome of 5 subsequent coin tosses. Flipping the coin 5 times gives a probability of 1/2*1/2*1/2*1/2*1/2=1/32 or a chance of 3.125%. Hence, the more details i predict, the more significant would a success be. 

Some scientific papers do execute an extensive data analysis, finding not yet seen details. He who seeks finds. If one looks for details shared by only 5% of the studied subjects, then one of such details is statistically (p<0.05) significant. A detail shared by only 1% becomes statistically (p<0.01) significant. 

In my case, i would postpone the hypothesis making until after my data analysis is carried out. Many journals do not require me to state the order of business, hence i get away with a vague description how my hypothesis came into existence. As i continued the coin flipping business, the result became tails-tails-heads-tails-heads (TTHTH). Publishing this would give me a desired publication, but 31 out of 32 replication trials will be unable to repeat my experiment.

Some experiments are expensive to carry out. Sharing lab-results is a nice gesture among scientists. It is obvious that using my "lab-results" will enable you to confirm my results. If you insist on using your lab-results, it would come with the disastrous results! Since my results were statistically significant, everybody would assume that you are an inexperienced scientist. 
One in twenty details should turn out to be p<0.05 statistically significant.

Give me a hypothesis or lab-results, but not both!

P-values give an indication how likely a outcome is, if it were random. It is calculated like a probability. Using the correct formula does not ensure its correctness. How likely is a certain outcome of 5 coin tosses? It is 1/32. However, if i know that the first one is tails, it changes. Outcomes starting with heads are no longer likely, all others increased their likelihood to 1/16. At this point, my results would no longer be statistically significant. This change of probabilities is the achilles heal of conditional probabilities - they change when new information is obtained.

One could claim to have calculated the outcomes of 5 coin tosses before the experiment, hence they numbers should still be valid. However, i have picked out one special number (TTHTH). By singling out this number, i applied an after-the-fact selection - which is an additional condition, hence the value should change. In my case, the value increases to 1.0.  This is the probability to see tails-tails-heads-tails-heads, when my coin fell tails-tails-heads-tails-heads.

What is left?

As a reader of a scientific paper i have to skip statements about statistical significance, when the authors were too lazy to describe when exactly they came up with the hypothesis. Without the p-values, only the existence of details can be acknowledged. It could be a property of the sample, or it could indeed indicate some correlation. 

If a certain scientist only publishes papers on successful experiments, then it could mean two things:
  • the scientist is cheating, making up the hypothesis after the fact,
  • the outcome is less random than assumed.
An unlucky scientist studying a random phenomenon should publish results according to the random distribution. In the coin flipping science field, halv the papers should report a failing. Assuming all scientists are using the same coin, and only 25% report a failure - then the assumption of a balanced coin seems to be wrong. In this case, the coin is likely to heavier on one side. Even though the reports were pair-wise contradicting, together they would give a more correct picture of the randomness.

My checklist:
  • When was the hypothesis made?
  • Has data material used for hypothesis making reused in the analysis?
  • How much data dredging has been carried out?
  • Have the author been picking out details from the examination?
  • How many similar details are there?
  • Where are the failures?


There are some scientific papers out, which try to identify pedophiles as a group. As it seems the authors are not aware of the pitfalls of data dredging and the Texas shooter fallacy. As a results some characteristics of pedophiles are given as "statistically significant", which are at least doubtful. Since those results contribute to the stigmatization of pedophiles, such papers are in my opinion unethical.

1 comment :