Unraveling the Mysteries of Underfitting and Overfitting in Data Science Projects

Serdar Tafralı
5 min readAug 24, 2023

In a data science project, after creating a suitable model and implementing it, we often face certain problems when evaluating the model’s success. Two such issues are underfitting and overfitting.

The goal of the model is to extract the relationship and meaning between dependent and independent variables. In this case, we expect the model to learn the structure of the data rather than memorizing it.

📌 Underfitting: When the model doesn’t learn the data well enough. It has high bias and low variance, and its generalization ability is weak.

📌 Overfitting: When the model memorizes the data. It has high variance and low bias. Its accuracy decreases when faced with different data in the test set, as the predictions function closely represents the actual values.

The ideal model has low bias and low variance. It represents the structure, pattern, or relationship in the dataset without memorizing or misrepresenting the data.

How can we detect overfitting?

We can identify overfitting by jointly evaluating the training and test sets in terms of model complexity and prediction error. Analyze the change in error in both sets.

When the two errors start to diverge (where the fork begins), overfitting has begun.

How Do We Prevent Overfitting?

We focus on enhancing the features of a model to increase its accuracy and make more detailed predictions. This process varies for different methods like linear models, tree methods, and neural networks. However, this situation means increasing the complexity of the model, thus increasing the likelihood of encountering the overfitting problem. The main philosophy to prevent overfitting is to reduce model complexity (training duration, iteration duration, etc.). Increasing the model’s complexity will reduce errors up to a certain point, but after the point of optimum model performance, there will be memorization in the training set. Consequently, when analyses are conducted with the test set, there will be an increase in errors, which will lead to overfitting.

It’s also crucial to analyze correlations, missing values, and outliers to prevent overfitting. For instance, the existence of independent variables in a dataset with high correlation can lead to both bias and overfitting due to them carrying the same information. Hence, it might be necessary to re-examine variables that show high correlation in the dataset.

To address the overfitting issue, various methods can be employed. Some of these methods include:

  • Regularization: A technique that helps prevent overfitting by reducing the complexity of the model. There are various regularization techniques available, such as L1 (Lasso) and L2 (Ridge) regularization. L1 regularization brings the weights of features in the model closer to zero, reducing the influence of insignificant features and improving the model’s generalization capability.
  • Bagging (bootstrap aggregating): This method increases the generalization capability of the model by using multiple base learners to train on newly created data samples through random sampling and then aggregates the results. It’s especially effective for high-variance models like decision trees. Random forest is one of the most well-known examples of this approach.
  • Data Augmentation: By expanding the dataset, this method offers the model more learning opportunities. Techniques like transforming and rotating samples aim to enable the model to generalize better to new data. This method is frequently employed in areas like deep learning and image recognition.
  • Early Stopping: This technique prevents overfitting by stopping the training process the moment the test error starts to increase. This prevents the model from underperforming when generalizing to unseen data.
  • Cross-validation: This method divides the dataset into multiple parts and evaluates the model’s performance by using each segment as a test set in turn. By assessing the model’s performance on different subsets of the data, the risk of overfitting is minimized. K-fold and stratified k-fold cross-validation are the most commonly used methods.

Different machine learning models have various variables that could lead to overfitting. Let’s provide techniques specific to some models:

  • Linear Methods: Adding exponential terms to a model, i.e., refining the model, means making the model capable of making more detailed predictions. In other words, it’s referred to as complicating the model.
  • Tree Methods: Branching technique is used to complicate the model. In optimization-based tree methods (like LightGBM), the iteration count is used as a model complexity parameter. For example, increasing the iteration count to 100, 500, or 1000 might reduce the error during training but increase it during testing.
  • Artificial Neural Networks: When the number of layers, cell count, iteration count increases, or adjustments or options are made in parameters like the learning rate, the error will decrease in the training set up to a point, but then the error in the test set will increase.

To illustrate the overfitting problem with a real-life business example:

In facial recognition applications, it’s crucial to prevent overfitting. For instance, a security camera system might want to automatically regulate the entrance of certain individuals into a building using facial recognition. The facial images in the training set are limited to various light conditions, angles, and facial expressions. In this case, the model memorizing the training set and generalizing poorly to new, unseen facial images becomes a significant issue.

To solve this problem, data augmentation and cross-validation methods can be used. With data augmentation, the facial images in the training dataset are enriched by rotating, resizing, and viewing them under different light conditions. This way, the model is trained to generalize better under various conditions. With cross-validation, the model’s performance is continuously monitored, reducing the risk of overfitting.

In conclusion, various techniques and methods can be employed to address the overfitting problem. The ideal solution can vary depending on the project and the machine learning model used. Implementing these techniques to prevent overfitting enhances the model’s performance, enabling it to make more reliable and accurate predictions in real life.

For training on overfitting issues and machine learning, you can check out the content offered by Miuul. With the expert team and support of Miuul, you can progress confidently in your data science career.

Sources:

Towards Data Science, Overfitting

Wikipedia, Overfitting

Veri Bilimi Okulu, Aşırı öğrenme (overfitting)

Miuul, Makine öğrenmesi

GeeksforGeeks, ML | Underfitting and Overfitting

--

--

Serdar Tafralı

Dedicated and eager lifelong learner with a very solid mathematical background. Deeply passionate about Data Science and Artificial Intelligence