← Back to Home

Reproducibility Crisis - The biggest problem no one is talking about

By

Foreword

I'm not anti-science. I have a respect for the scientific method (when it is conducted properly) I've published papers, co-authored patents (check the IP section of my site), my concern isn't with science itself—it's with the system we've built around it that has broken the trustworthiness of the research.

I also want to note that I made the title of this piece deliberately dramatic - people ARE definitely talking about it, I just don't feel like it has hit the mainstream yet, and when I mention it to my friends they are completely unaware that this is going on, in fact, I'd go so far as to say that outside of academia it is completely ignored - and people just blindly assume that if 'science says it' then it must be true which is scary af tbh, I grew up feeling exactly the same way until I studied post-grad stats, wrote my own papers, and learned all this.

Finally, I recognize that I'm in a privileged position to even critique this system. I've benefited from it, and have the security to question it because I build companies rather than work in academia, being published means nothing to me. Many people aware of this issue don't have that luxury—their next grant, their tenure, their entire career depends on playing by the current rules. That's precisely why the system needs to change.

Finally finally. Marc Andreessen did a podcast on this once like 2 years ago, I cannot for the life of me find the episode, I can't remember which show it was on or anything, but he broke it down really well. If you find it, please send it to me! The best I could find was this tweet of his here.

1. Modern Academia Has a Major Issue

There's something loudly broken in modern science. Most published research findings don't hold up when someone tries to repeat them. This is the reproducibility crisis, and it's everywhere: psychology, medicine, economics, and even physics.

Does this scare you? It should. In a 2016 Nature survey, over 70% of researchers said they'd tried and failed to reproduce someone else's results, and about half couldn't even reproduce their own findings. That's a credibility nightmare in a field built on trust.

We base policy, medicine, and billions in funding on "what the research shows." But if the research is wrong, or just fragile, we're building skyscrapers on wet sand. Terrible analogy but you get the point, we grew up thinking that science was the perfect ultimate object truth finding mechanism only for it now to be unveiled that it has been broken for decades.

I don't know if I have the energy to weave morality into this argument, but, people who probably started out with noble intentions aretrapped in a system that rewards bad habits (the improper use of statistics in their research under the pressure that they must find statistically significant results).


2. The Incentive to mess with stats to get published - how to P hack, effect hack, and data dredge

If you're an academic, your career hinges on getting published. Grants, jobs, and tenure all depend on it.

But not just any publication counts: you need significant results (significance here meaning statistically significant - which hopeuflly you remember from high school stats). Journals don't want "we found nothing." They want "we discovered something new." So researchers face a subtle but powerful pressure: find significance or don't get published.

So over time - increasingly, researchers have been massaging their analyses until they hit that magic number: p < 0.05.

Maybe you remove a few outliers because they "look weird." Maybe you run five different models and pick the one that looks best. Maybe you collect data until the p-value dips below 0.05 and call it a day.

This is p-hacking, and it's everywhere.

In their landmark 2011 paper False-Positive Psychology, Joseph Simmons, Leif Nelson, and Uri Simonsohn demonstrated how easy it is to "prove" nonsense. They literally showed that listening to certain songs could make people younger: a statistically significant result that was, of course, false. (Simmons et al., 2011, Psychological Science)

Their point was simple: the more flexibility you allow in how you analyze data, the more likely you are to find something "significant" even when nothing is real.

You don't have to cheat to fool yourself. You just have to keep tweaking until the numbers tell the story you want.

Run enough tests, and one will pop significant by luck. That's just math. A significance threshold of 0.05 means you'll get a false positive one in 20 times even if nothing's happening. But most studies test far more than 20 things: multiple measures, conditions, and models, and only report the winners.

Simmons et al. showed this can push false-positive rates from 5% to over 60%. And because null findings rarely see the light of day, the literature ends up stacked with chance results dressed up as discoveries.


b. Misunderstanding the p-value

Here's the big misconception:

p < 0.05 does not mean there's a 95% chance the finding is true.

It actually means there's a 5% chance you'd that in a world where the null hypothesis is true, the results you observed occured by random chance alone.

The American Statistical Association even issued a statement in 2016 warning that treating p < 0.05 as a stamp of truth "leads to considerable distortion of the scientific process."


c. Neglecting effect size

Something can be statistically significant and utterly trivial. With big samples, even tiny effects become "significant." A medication that lowers blood pressure by 0.3 mmHg can hit p < 0.05 if you test enough people, but that doesn't mean it's useful.

Effect sizes and confidence intervals tell us whether something actually matters. For decades, journals ignored that nuance, turning significance into a false synonym for importance.


d. Underpowered studies and the sample size trap

Another major problem: tiny sample sizes.

Small-n studies produce noisy data, and that noise inflates apparent effects. When researchers do manage to hit "significance" with small samples, the effect size is usually exaggerated. Later replications, with proper power, tend to find much smaller effects or none at all.

So what's "small"? It depends on the effect size you expect, but there are real numbers here. To achieve the standard 80% power (an 80% chance of detecting a real effect if it exists):

(Mindhacks, 2015)

If your total n < 50 per condition and you're claiming a new psychological or biological effect, you're probably chasing noise.

Low power is the silent killer of reproducibility. It produces too many false negatives (missing real effects) and too many inflated positives (overstated results).


f. The 0.05 Problem

The p < 0.05 cutoff was never meant to be gospel. It was a convention. But somewhere along the way, "statistically significant" became "true." Also if you've ever done a study you know how easy it is to hit .05, like, it should really be way way lower, when you see a result that's truly signicant it's usually closer to .00001

In 2017, a group of 72 statisticians and scientists proposed lowering the bar to p < 0.005 for new discoveries (Benjamin et al., 2017, Nature Human Behaviour). Why? Because p = 0.05 only gives about 3:1 odds that a finding is real under typical conditions. At p = 0.005, that rises to ~20:1 odds, far stronger evidence.

As John Ioannidis argued in Why Most Published Research Findings Are False (2005), when you combine low power, small effects, and publication bias, most "significant" findings are probably wrong. (Ioannidis, 2005, PLOS Medicine)


3. Famous Papers That Didn't Hold Up

Psychology's "Power Pose"

In 2010, Amy Cuddy, Dana Carney, and Andy Yap published a paper claiming that holding a "high power" pose for two minutes could boost testosterone, reduce stress hormones, and make you more confident. The idea exploded: TED Talks, business workshops, viral fame.

But replication studies found no hormonal changes. One of the original authors, Carney, later publicly disavowed the effect, admitting she no longer believed it was real. It probably wasn't deliberate fraud; just small samples (n = 42), flexible analyses, and wishful thinking, but it's a good example - cause papers like this get used as the postulate for more papers. What if it had been a more serious study, such as, cancer research? Well...


Medicine's Cancer Crisis

In 2012, C. Glenn Begley and Lee Ellis at Amgen tried to reproduce 53 landmark cancer studies. They could only confirm 6 of them, about 11% (Begley & Ellis, 2012, Nature). Bayer scientists found similar results: roughly 75% of "promising" preclinical findings didn't hold up in their labs.

These weren't obscure papers; they were the foundations of drug programs worth billions. The problem wasn't just bad luck; it was low sample sizes, unblinded analyses, and selective reporting. Why was this allowed to happen? Because the research was published, and the companies were able to get funding for new drugs based on the research. Probably. I'm just guessing, but it's scary as hell to consider that the general population is being given drugs based on research that is not reproducible.


Physics' Faster-Than-Light Neutrinos

Even physics, the gold standard of rigor, had its moment. In 2011, the OPERA experiment in Italy reported that neutrinos had traveled faster than light. The story made global headlines: Einstein was wrong!

Then other labs couldn't replicate the result. Eventually, investigators found the cause: a loose fiber-optic cable skewing the timing system. The finding vanished overnight.

No fraud, no scandal. Just an honest mistake caught by replication. Physics self-corrected because replication is baked in to its culture. That's what every field needs. Sadly medicine does not have this luxury right now - we should change that.


4. How We Got Here — and How to Fix It

Tbh this is a major issue and we need to break the problem down into two different parts in order to solve it. There's the stats element and the culture element:


A. Statistical Fixes

1. Pre-registration

Write down your hypotheses, design, and analysis plan before collecting data, and make it public. This simple step kills most p-hacking before it starts. Psychology and medicine now use registries like OSF and ClinicalTrials.gov to timestamp studies before they run.

2. Report effect sizes and uncertainty

Stop pretending p < 0.05 is the goal. Report the size of your effect (Cohen's d, odds ratios, etc.) and confidence intervals so readers can judge practical importance. A statistically significant blip is not a meaningful discovery.

3. Use Bayesian methods

Bayesian statistics let you update your belief in a hypothesis as new evidence arrives. It shifts focus from "Did we hit 0.05?" to "How strong is our evidence?" It's more intuitive and less prone to false positives.

4. Bigger samples and adequate power

Underpowered studies are dead ends. Aim for at least 70 per group for medium effects and hundreds per group for small ones. If that's not feasible, collaborate or rethink your question.

Publishing a 20-person study with a "significant" p-value should be treated with skepticism, not celebration. I admit this is hard af, I had close friends doing alzheimer's research where they literally needed brain samples to run their tests, and finding enough samples was a major challenge. They settled on 35 samples, barely enough for a central tendence measure, but I felt for them cause this was all they could get. This sort of leads to deeper questions around the fact that if samples are so hard to come by in this field, the paper will almost never be reproducible because how can a team be expected to go and find another 35 brains in great condition for alzheimers research when that is SO rare in the first place.

5. Rethink the 0.05 threshold

As Benjamin et al. (2017) argued, we should raise the bar: p < 0.005 for new claims. That doesn't mean small studies are useless; it means extraordinary claims need extraordinary evidence. A p = 0.04 shouldn't be a headline, it is barely significant; it should be a hint to look closer.


B. Cultural Fixes

1. Reward rigor, not novelty

Universities and journals should value reproducibility over flash. Publish well-designed studies even if the result is null. I've long thought that it should be JUST AS POWERFUL to publish a paper showing that nothing was found, as it is to publish one with a significant finding. You're doing the same work, you're saving others time. It's just as important to know nothing was found as it is to show that something was imo. Reward scientists who verify others' work too! This should be a core part of research culture.

2. Open everything

Make data, code, and methods public by default. Transparency makes fraud harder and collaboration easier. Mistakes caught early are far cheaper than full-blown replication crises.

3. Registered Reports

Some journals now review and accept studies before results are known. That flips the incentive: publication is guaranteed for solid design, not surprising outcomes. I really like this.

4. Teach real statistical literacy

Most researchers get just enough training to run a study but don't really understand what they are doing under the hood. I actually remember when we learned undergrad stats - in our lab they just gave us a series of steps to follow to run a test, with NO explanation of why. Every research program should include modern data ethics, power analysis, and reproducibility. Understanding probability should be as basic as understanding your own lab equipment.

5. Make replication essential BEFORE publishing

Replication isn't an insult; it's maintenance. The Reproducibility Project: Psychology (Open Science Collaboration, 2015) proved how powerful coordinated replication can be. If every lab treated replication as part of the job, false findings would die quickly instead of lingering for decades.


5. Conclusion

I care deeply about this subject because I just don't think enough people are talking about it, and I'm tired of people blindly trusting 'science' and assuming anything 'scientifically proven' is true. Fortunately, things are improving:

As John Ioannidis put it, "Most published research findings are false, but that can change if we fix the system."


Sources