Placebo Calibration Tests

20 falsification tests across the pre-treatment period (2026-01-08 to 2026-03-15)

Generated 2026-07-02 19:36 · Model calibration check · Data Jan 08, 2026 to Jul 02, 2026 · Day 109 post-treatment

HRV (RMSSD) FPR

Watch

100%(20/20)

Liberal (FPR=100%, binom p=0.000)

Mann-Whitney false positive rate at alpha=0.05

Lowest HR FPR

Watch

100%(20/20)

Liberal (FPR=100%, binom p=0.000)

Mann-Whitney false positive rate at alpha=0.05

Average HR FPR

Watch

100%(20/20)

Liberal (FPR=100%, binom p=0.000)

Mann-Whitney false positive rate at alpha=0.05

Sleep Efficiency FPR

Watch

60%(12/20)

Liberal (FPR=60%, binom p=0.000)

Mann-Whitney false positive rate at alpha=0.05

SUMMARY

Calibration Summary

Mean false positive rate across metrics: 90.0%. This is substantially above the nominal 5% level. The model is overconfident. Real treatment-effect p-values should be interpreted very cautiously, and stricter significance thresholds (e.g., p < 0.01) may be appropriate.

DETAIL_TABLE

Placebo Test Results (All Dates)

Placebo Date	Day #	HRV (RMSSD) p	Lowest HR p	Average HR p	Sleep Efficiency p
2026-01-23	15	p<0.001	p<0.001	p<0.001	p=0.066
2026-01-24	16	p<0.001	p<0.001	p<0.001	p=0.028
2026-01-26	18	p<0.001	p<0.001	p<0.001	p=0.023
2026-01-27	19	p<0.001	p<0.001	p<0.001	p=0.018
2026-02-01	24	p<0.001	p<0.001	p<0.001	p=0.039
2026-02-05	28	p<0.001	p<0.001	p<0.001	p=0.035
2026-02-06	29	p<0.001	p<0.001	p<0.001	p=0.027
2026-02-07	30	p<0.001	p<0.001	p<0.001	p=0.049
2026-02-08	31	p<0.001	p<0.001	p<0.001	p=0.084
2026-02-09	32	p<0.001	p<0.001	p<0.001	p=0.195
2026-02-12	35	p<0.001	p<0.001	p<0.001	p=0.101
2026-02-14	37	p<0.001	p<0.001	p<0.001	p=0.041
2026-02-15	38	p<0.001	p<0.001	p<0.001	p=0.017
2026-02-16	39	p<0.001	p<0.001	p<0.001	p=0.012
2026-02-18	41	p<0.001	p<0.001	p<0.001	p=0.044
2026-02-19	42	p<0.001	p<0.001	p<0.001	p=0.102
2026-02-21	44	p<0.001	p<0.001	p<0.001	p=0.026
2026-02-22	45	p<0.001	p<0.001	p<0.001	p=0.066
2026-02-26	49	p<0.001	p<0.001	p<0.001	p=0.076
2026-02-28	51	p<0.001	p<0.001	p<0.001	p=0.232

FPR_SUMMARY

False Positive Rate Summary

Metric	Significant	Total	FPR	Assessment
HRV (RMSSD)	20	20	100%	Liberal (FPR=100%, binom p=0.000)
Lowest HR	20	20	100%	Liberal (FPR=100%, binom p=0.000)
Average HR	20	20	100%	Liberal (FPR=100%, binom p=0.000)
Sleep Efficiency	12	20	60%	Liberal (FPR=60%, binom p=0.000)

PVALUE_DIST

P-Value Distribution Under Null

Under correct calibration, p-values should be approximately uniformly distributed (flat histogram). A spike near 0 suggests the model is liberal.

METHODOLOGY

Methodology

Purpose: Placebo (falsification) tests check whether the statistical methods used for treatment-effect estimation produce false positives at the expected nominal rate. If they do, the p-values from the real analysis are trustworthy.

Method: 20 random dates were drawn (seed=42) from the pre-treatment period (2026-01-08 to 2026-03-15), each at least 14 days from the window edges. At each placebo date, the pre-treatment data was split and a two-sided Mann-Whitney U test was performed for each metric. CausalImpact was not available in this environment.

Expected result: ~5% of placebo tests should be significant (1 out of 20). If the observed FPR is much higher, the real treatment-effect p-values may be overconfident.

Interpretation:

Well-calibrated: FPR within ~2 percentage points of 5%
Conservative: FPR near 0% (tests are too strict, may miss real effects)
Liberal: FPR significantly above 5% (tests find "effects" where none exist)