Oura Ring Gen 4 sensor data, not clinical measurementsN=1 case study, not validated for clinical decisionsHEV diagnosed Mar 18; Day 109 post-ruxolitinibMore
Consumer wearable data can support exploratory review only. The HEV diagnosis, temporally confounded with treatment start, remains a material confounder.

Placebo Calibration Tests

20 falsification tests across the pre-treatment period (2026-01-08 to 2026-03-15)
HRV (RMSSD) FPR
Watch
100%(20/20)
Liberal (FPR=100%, binom p=0.000)
Mann-Whitney false positive rate at alpha=0.05
Lowest HR FPR
Watch
100%(20/20)
Liberal (FPR=100%, binom p=0.000)
Mann-Whitney false positive rate at alpha=0.05
Average HR FPR
Watch
100%(20/20)
Liberal (FPR=100%, binom p=0.000)
Mann-Whitney false positive rate at alpha=0.05
Sleep Efficiency FPR
Watch
60%(12/20)
Liberal (FPR=60%, binom p=0.000)
Mann-Whitney false positive rate at alpha=0.05
SUMMARY

Calibration Summary

Mean false positive rate across metrics: 90.0%. This is substantially above the nominal 5% level. The model is overconfident. Real treatment-effect p-values should be interpreted very cautiously, and stricter significance thresholds (e.g., p < 0.01) may be appropriate.
DETAIL_TABLE

Placebo Test Results (All Dates)

Placebo DateDay #HRV (RMSSD) pLowest HR pAverage HR pSleep Efficiency p
2026-01-2315p<0.001p<0.001p<0.001p=0.066
2026-01-2416p<0.001p<0.001p<0.001p=0.028
2026-01-2618p<0.001p<0.001p<0.001p=0.023
2026-01-2719p<0.001p<0.001p<0.001p=0.018
2026-02-0124p<0.001p<0.001p<0.001p=0.039
2026-02-0528p<0.001p<0.001p<0.001p=0.035
2026-02-0629p<0.001p<0.001p<0.001p=0.027
2026-02-0730p<0.001p<0.001p<0.001p=0.049
2026-02-0831p<0.001p<0.001p<0.001p=0.084
2026-02-0932p<0.001p<0.001p<0.001p=0.195
2026-02-1235p<0.001p<0.001p<0.001p=0.101
2026-02-1437p<0.001p<0.001p<0.001p=0.041
2026-02-1538p<0.001p<0.001p<0.001p=0.017
2026-02-1639p<0.001p<0.001p<0.001p=0.012
2026-02-1841p<0.001p<0.001p<0.001p=0.044
2026-02-1942p<0.001p<0.001p<0.001p=0.102
2026-02-2144p<0.001p<0.001p<0.001p=0.026
2026-02-2245p<0.001p<0.001p<0.001p=0.066
2026-02-2649p<0.001p<0.001p<0.001p=0.076
2026-02-2851p<0.001p<0.001p<0.001p=0.232
FPR_SUMMARY

False Positive Rate Summary

MetricSignificantTotalFPRAssessment
HRV (RMSSD)2020100%Liberal (FPR=100%, binom p=0.000)
Lowest HR2020100%Liberal (FPR=100%, binom p=0.000)
Average HR2020100%Liberal (FPR=100%, binom p=0.000)
Sleep Efficiency122060%Liberal (FPR=60%, binom p=0.000)
PVALUE_DIST

P-Value Distribution Under Null

Under correct calibration, p-values should be approximately uniformly distributed (flat histogram). A spike near 0 suggests the model is liberal.
METHODOLOGY

Methodology

Purpose: Placebo (falsification) tests check whether the statistical methods used for treatment-effect estimation produce false positives at the expected nominal rate. If they do, the p-values from the real analysis are trustworthy.

Method: 20 random dates were drawn (seed=42) from the pre-treatment period (2026-01-08 to 2026-03-15), each at least 14 days from the window edges. At each placebo date, the pre-treatment data was split and a two-sided Mann-Whitney U test was performed for each metric. CausalImpact was not available in this environment.

Expected result: ~5% of placebo tests should be significant (1 out of 20). If the observed FPR is much higher, the real treatment-effect p-values may be overconfident.

Interpretation:

  • Well-calibrated: FPR within ~2 percentage points of 5%
  • Conservative: FPR near 0% (tests are too strict, may miss real effects)
  • Liberal: FPR significantly above 5% (tests find "effects" where none exist)