Draft:Purged Cross-Validation

Review waiting, please be patient.

This may take 3 months or more, since drafts are reviewed in no specific order. There are 2,763 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · Purged Cross-Validation (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Wikipedia) · Submitted 10 days ago by Dsr02014 (talk: D · +) · Last edited 10 days ago by Dsr02014

Purged and Embargoed Cross-Validation is a time-series-aware model validation technique designed to address information leakage in financial machine learning. It is especially applicable when labels span time intervals rather than individual points in time. The method modifies standard k-fold cross-validation by incorporating two critical mechanisms: purging and embargoing.

Motivation

Traditional cross-validation assumes independent and identically distributed (i.i.d.) observations, which is violated in financial time series. In many financial labeling schemes, each observation represents an event with a starting time and an ending time (e.g., a position held over several days). If a model is trained on data whose label intervals overlap with those in the test set, then future information is inadvertently included in training—this is known as look-ahead bias or information leakage.^[1]^[2]

Purged cross-validation was introduced ensure that the training set is uncontaminated by test information.^[3]^[4]

Purging

Purging removes from the training set any observation whose label end time overlaps with the start of the test set. To handle overlapping labels in financial time series, the following notation is used:

t1: A pandas.Series that maps each observation to the end time of its label.
[i, j): The index range of the test set in a given fold.
t0: The start time of the test set, i.e., the timestamp at index i.
test_max: The latest end time reached by any label in the test set, defined as max(t1[k]) for k in [i, j).

To prevent information leakage, training samples must satisfy:

t1[k] ≤ t0   OR   k > index.searchsorted(test_max) + embargo

This ensures two things:

No training label ends after the test set starts (purging).
No training sample falls within an embargo period after the test set ends.

Embargoing

Embargoing addresses a more subtle form of leakage: even if an observation does not directly overlap the test set, it may still be affected by test events due to market reaction lag or downstream dependencies. To guard against this, a percentage-based embargo is imposed after each test fold. For example, with a 5% embargo and 1000 observations, the 50 observations following each test fold are excluded from training.

Example

To illustrate the effect of purging and embargoing, consider the figures below. Both diagrams show the structure of 5-fold cross-validation over a 20-day period. In each row, blue squares indicate training samples and red squares denote test samples. Each label is defined based on the value of the next two observations, hence creating an overlap. If this overlap is left untreated, test set information leaks into the train set.

Standard K-Fold Cross-Validation: test samples are randomly partitioned with no attention to label overlap or time ordering. This can lead to contamination of the training set with future information.

Purged K-Fold Cross-Validation: training samples that overlap with the test label horizon are removed. Embargoing is applied to prevent leakage from immediately adjacent samples.

The second figure applies the CPCV procedure. Notice how purging removes overlapping observations from the training set and the embargo widens the gap between test and training data. This approach ensures that the evaluation more closely resembles a true out-of-sample test and reduces the risk of backtest overfitting.

Comparison to Standard K-Fold

Feature	Standard K-Fold	Purged & Embargoed CV
Assumes i.i.d.	Yes	No
Handles overlapping labels	No	Yes
Prevents information leakage	No	Yes
Suitable for financial time series	Poorly	Well

Applications

Purged and embargoed cross-validation is especially useful in:

Backtesting of trading strategies^[5]
Validation of classifiers on labeled event-driven returns^[4]
Any machine learning task with overlapping label horizons^[1]^[6]

References

^ ^a ^b Lopez de Prado, M. (2018): "The 10 Reasons Most Machine Learning Funds Fail." The Journal of Portfolio Management, 44(6), pp. 120 - 133. DOI: 10.3905/jpm.2018.44.6.120
^ Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014): "The Probability of Backtest Overfitting." Journal of Computational Finance. 20(4).
^ López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. ISBN 978-1-119-48208-6.
^ ^a ^b Lopez de Prado, M. (2020): Machine Learning for Asset Managers. Cambridge University Press. https://www.amazon.com/Machine-Learning-Managers-Elements-Quantitative/dp/1108792898
^ Joubert, J. & Sestovic, D. & Barziy I. & Distaso, W. & Lopez de Prado, M. (2024): "Enhanced Backtesting for Practitioners." The Journal of Portfolio Management, Quantitative Tools 51(2), pp. 12 - 27. DOI: 10.3905/jpm.2024.1.637
^ López de Prado, M. & Zoonekynd, V. (2025):"Correcting the Factor Mirage: A Research Protocol for Causal Factor Investing." Available at SSRN: https://ssrn.com/abstract=4697929 or http://dx.doi.org/10.2139/ssrn.4697929

[The10-1] Lopez de Prado, M. (2018): "The 10 Reasons Most Machine Learning Funds Fail." The Journal of Portfolio Management, 44(6), pp. 120 - 133. DOI: 10.3905/jpm.2018.44.6.120

[JCF-2] Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014): "The Probability of Backtest Overfitting." Journal of Computational Finance. 20(4).

[AFML-3] López de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. ISBN 978-1-119-48208-6.

[Cambridge-4] Lopez de Prado, M. (2020): Machine Learning for Asset Managers. Cambridge University Press. https://www.amazon.com/Machine-Learning-Managers-Elements-Quantitative/dp/1108792898

[JPM-5] Joubert, J. & Sestovic, D. & Barziy I. & Distaso, W. & Lopez de Prado, M. (2024): "Enhanced Backtesting for Practitioners." The Journal of Portfolio Management, Quantitative Tools 51(2), pp. 12 - 27. DOI: 10.3905/jpm.2024.1.637

[Factor-6] López de Prado, M. & Zoonekynd, V. (2025):"Correcting the Factor Mirage: A Research Protocol for Causal Factor Investing." Available at SSRN: https://ssrn.com/abstract=4697929 or http://dx.doi.org/10.2139/ssrn.4697929

[1]

[2]

[3]

[4]

[5]

[6]