Cross-validation in Scikit-learn is important because it gives us not only the means to train our model, but also to score its effectiveness. Without cross-validation we would test the model on the very same data we used to train it, likely resulting in a condition called overfitting.
The process for cross-validation involves splitting the data into unequal parts, using the larger set to train the model and the smaller set to test and score. Thus, the algorithm isn’t testing results on the very same data it was trained with.
Random Splitting with train_test_split
The simplest way to split our data would be this:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3)
Where df is the dataset, y is the Target, and test_size indicates what percentage of the data will be the test set. This randomly divides our data into an 70/30 split of training data vs testing data, which is sufficient for most cases.
The underlying assumption of a random split like this one is that the test data set (and future data, for that matter) is exchangeable with the training data. Unfortunately that isn’t always the case.
One notable use where test_train_split would not be appropriate is time-series. For example, if you are using past sales data to predict future sales, a random split is the last thing you want. Rather, the best practice would be to take at least a year of past sales in chronological order to train the model. You would then test the resulting model on the following six months to gauge the model’s effectiveness. For time-series data, dividing the data based on a point of time is the only correct method.
For cases that benefit from random splits, a much more effective (and complicated) way to use cross-validation is the cross_val_score. This function divides the data into a specified number of ‘folds’ and holds out one fold as the test set, then iterates through them five times and tests each result. Visually represented, the process looks like this:
This process is called K-fold, with ‘K’ being the number of folds.
Coding cross_val_score is a lot more involved than train_test_split because we have to involve the algorithm you’ll be using to analyze the data. Assuming we have loaded a dataset called “df” the code would look as follows:
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
#Create variables X and y for our features and target
X = df.features
y = df.target
#Assume we’ll be using a form of regression called Lasso, and apply cross_val_score
lasso = linear_model.Lasso()
scores = cross_val_score(lasso, X, y, cv=5)
“cv” is the number of folds. If you want to see the scores, just print the variable ‘scores.’
Again, this would work in most situations but not time-series.