It’s worth keeping in mind that train-and-test is common, but not the only widely used approach in machine learning. Two of the more coming alternatives are the hold-out approach and statistical approach methods.
- Hold-out approach: The hold-out approach is like train-and-test, but instead of splitting a dataset into two, it’s split into three: training, test (also known as validation) and hold-out. The training and test datasets are as we’ve described before. The hold-out dataset is a kind of test set that is used only once, when we are ready to deploy our model for real world use.
- Statistical approach: Simpler models that have originated in statistics often don’t need test datasets. Instead, we can calculate to what degree the model is overfit directly as statistical significance: a p-value.
These statistical methods are powerful, well established, and form the foundation of modern science. The advantage is that the training set doesn’t ever need to be split, and we get a much more precise understanding of how confident we can be about a model. For example, a p-value of 0.01 means there’s a very small chance that our model has found a relationship that doesn’t actually exist in the real world. By contrast, a p-value of 0.5 means that while our model might look good with our training data, it will be no better than flipping a coin in the real world.
The downside to these approaches is that they’re only easily applied to certain model types, such as the linear regression models. For all but the simplest models, these calculations can be extremely complex to perform properly.
We should try and evaluate different train/test splits when building machine learning models, and that generally splits that favor the train set with more data will yield better results.
Scikit-Learn is a free machine learning library for Python. It supports both supervised and unsupervised machine learning, providing diverse algorithms for classification, regression, clustering, and dimensionality reduction.
#Python library
from sklearn.model_selection import train_test_split
#create train and test dataset
train, test = train_test_split(data, test_size=0.3, random_state=2)