Data Leakage

What is Data Leakage?

Data leakage is when a machine learning model is trained using data that was not meant to be training data. It is a big problem as it will show up in the final results and will likely negatively affect predictions on unseen data, resulting in overly optimistic models or just invalid results.

These issues typically occur with more complex data, so if you aren’t already familiar with this concept, it is probably because the datasets you’ve used thus far for your models have been relatively simple. It is crucial to detect and prevent data leakage though, since it results in a biased model, rendering it unusable.

Types of Data Leakage

Target Leakage

The first type of data leakage is when some kind of information from your target variable unintentionally enters your training dataset. In other words, if data that will not be available when you make predictions is used to train the model, then you have target leakage. The most obvious example of this would be if you used the target variable as an input feature itself. Of course, this would be data leakage since you wouldn’t have the target variable when making predictions in a real-world scenario.

Let’s consider another example: a fraud detection problem. Suppose one of the features is “number of chargebacks initiated by the customer.” If this feature was calculated by using the information of whether or not a transaction was fraudulent or not, then it would act as a proxy variable for the target, and thus introduce data leakage. Again, any features that are based on information that won’t be available at prediction time cannot be used for model training.

Train-Test Contamination

The other main type of data leakage is when information specifically from the validation or test set ends up in the training data. A common occurrence of this type of data leakage is when statistical calculations are used to adjust or create features in your training dataset, but instead of making these calculations on just the training data, they are done over the training plus the validation/test data. The model would then be learning information from the validation/test set and in a real-world scenario, this is information it would not have at prediction time. 

For a more concrete example, consider the situation in which you are predicting housing prices. One of the features is the square footage of the house, so you preprocess it by standardizing it, but accidentally calculated the mean using the entire dataset. This will cause data leakage since the model can learn from that information and result in overfitting, or perhaps really good performance, but poor performance on unseen data. 

How to Detect Data Leakage

There are numerous ways to detect data leakage, and more than one method will likely be needed, depending on your given situation of course. Let’s run down some more common ways: 

Cross-Validation: Indeed, this well-known technique can help detect data leakage. A discrepancy between model performance between different cross-validation folds would be something to look out for here. 

Hold-out Dataset: While cross-validation is great, it might be a good idea to separate another validation set just in case. Note this is not the same as the test dataset, as that is more so meant for comparing across multiple models, whereas the hold-out dataset is to evaluate the model’s performance one last time before moving to production.

Double Check Preprocessing: To prevent data leakage it is important to make sure no data from the test set got into the training set and vice versa – so a simple double-check might help discover the reason behind data leakage. For best practice, consider always separating your test set before you start data analysis/preprocessing.

Correlation Analysis: Check feature importance so you can also check how correlated features are to the target variable. It doesn’t necessarily have to be the target variable either, but rather, another feature that you already know to be susceptible to leakage. If a feature with a high correlation to either of these exists, well then, maybe think twice before using it.

Visualize: If it’s possible, some data visualization might help, which you can get from explainability methods like SHAP or LIME. 

Time-Series: This is specific to this type of problem, but a common mistake is to split time-series data into training and testing sets randomly, where you should have the training data on set to be from past time periods, and the rest should be reserved for the testing data. This will ensure you are predicting future outcomes. 

Domain Knowledge: If the model performance seems too good, or for whatever reason you think data leakage is occurring, sometimes it’s best to leave it to the experts. This means using domain knowledge or consulting domain experts can help discern the root cause. Also, code reviews are standard practice in software engineering, so maybe take a lesson from their book and have a fellow data scientist check your work – maybe they can spot the issue!

Preventing Data Leakage

Of course, it’s best to prevent data leakage in the first place, and while they may seem similar to the detection methods above, they are more so about guidelines and best practices to consider adhering to:

Strict Data Separation: Ensure your train, validation, and test sets are all separated so that none of them can get mixed. 

Feature Engineering: Make sure all your features will realistically be available during prediction time. Also, extending on the point above, make sure no information from the validation and test sets somehow gets into the training dataset during feature engineering.

Preprocessing: Similar to the above points again, but make sure when handling missing values or performing data transformations, you do so without being informed of information of the validation and test sets. Also, remember to set a random seed for reproducibility!

Domain Expertise: It is also good to let domain experts guide you as they can spot any potential for bias or leakage during EDA and preprocessing phases.

Conclusion

While data leakage is not commonly covered in beginner data science or machine learning courses, likely due to the scope of the practice projects they come with, it is essential to understand for the sake of good continual model performance. More specifically, awareness of data leakage ensures that models are built on sound principles and maintain their performance in real-world scenarios, thus mitigating risk of leakage.

This article covered prevention and detection methods at only an introductory level. Indeed, the implementation methodology will vary from problem to problem, but hopefully, this article can serve as a nice supplementary reference check.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *