ArticlesMachine Learning

Understanding Train Test Split Model Validation


Train Test Split

Train-test split model validation is a simple and common technique in machine learning. Simply, it evaluates the performance of a model. The idea is to split the dataset into two subsets: one subset for training and one for testing. However, it’s important to note that the performance estimate obtained using this method can vary. Depending on how the data is split, especially when dealing with small datasets. The split differs for every scenario (project). But very often, we split the date into 80% training and 20% testing or 70% training and 30% testing. For more reliable performance testing estimation, we recommend cross validation techniques.

Workings behind the method

  1. Data Preparation – you start by splitting your dataset into two parts: a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the model’s performance.
  2. Model Training – you train your machine learning model on the training set. This involves feeding the features and corresponding labels (or target variable) to the model. As a result, this way it can learn the underlying patterns in the data.
  3. Model Evaluation – after training the model, you use the test set to evaluate its performance. The model makes predictions on the test set, and you compare these predictions with the actual labels. As a result, the goal is to assess how well the model generalises to new, unseen data.
  4. Performance Metrics – depending on the type of problem (classification or regression), you can use different performance metrics to evaluate the model’s performance on the test set. For example, in classification tasks, you might use metrics like accuracy, precision, recall, F1 score, etc. In regression tasks, common metrics include mean squared error (MSE), mean absolute error (MAE), R-squared, etc.
  5. Iterative Process – train-test split validation can be an iterative process. You may need to adjust the model hyperparameters, feature selection, or other aspects of the model. However, it’s essential to avoid “peeking” at the testset too many times or making decisions based solely on its performance. As this can lead to overfitting to the test set.

Example

We have a dataset of housing prices with features such as the number of bedrooms, square footage, and neighborhood. In addition, it includes the target variable which is the price of the house. As a result, we want to build a machine learning model that can predict house prices. The price prediction works on the basis of all other features.

  1. Data preparation – we split the dataset into 2 subsets, training set (80% of the data) and testing set (20% of the data).
  2. Model training – we feed to features (bedrooms, size, etc.) and target variables (price) into the machine learning model.
  3. Model evaluation – we use the test set to evaluate the performance of the model. The test set predictions are compared against the actual prices.
  4. Performance metrics – we use various metrics to evaluate the model’s performance on the test set. Metrics such as MSE, MAE, and R-squared.
  5. Iterative process – we interpret the outcome of the model’s performance, if not satisfied with the result, we adjust the model and iterate again.