Leakage-Safe Research Design
The 50-company public package is designed as an auditable empirical finance workflow: filings are timestamped, features are fit only on training windows, models are tuned only on validation windows, and portfolio outputs are treated as diagnostics.
Method pipeline
From Filing Time to Claim Boundary
Controls
Research Controls
Event-Time Alignment
Filing timestamps are mapped to prediction times before labels are constructed, so future price windows cannot enter features.
Rolling Splits
Experiments use rolling train, validation, and test windows with purge records for forward-label overlap checks.
Train-Window-Only TF-IDF
TF-IDF/SVD vocabularies are fit inside each training window and tracked with vocabulary manifests and hashes.
Validation-Only Tuning
Ridge and XGBoost hyperparameters are selected on validation Rank IC; test metrics are not used for tuning.
Preregistered Rules
Primary prediction and portfolio specifications are separated from robustness and exploratory comparisons.
Multiple Testing Disclosure
The run reports 568 tested specifications with Bonferroni, Holm, and Benjamini-Hochberg FDR adjustments.
Audit chart
Coverage Waterfall
Raw label coverage includes labels outside the configured out-of-sample windows. Eligible OOS prediction coverage is the relevant model-completeness metric.
Prediction Boundary
The preregistered primary prediction supports exploratory
volatility-forecasting evidence: Ridge on
realized_volatility_1_20 reaches Rank IC
0.2606 with raw p-value 0.00017.
Trading Boundary
Portfolio results remain diagnostic. The current package does not establish formal tradable alpha, investment advice, or CRSP/WRDS-equivalent asset-pricing evidence.