Fitting Your Model

Underfitting and Overfitting



When doing predictive modeling you build a model or models that hopefully tells you something useful about future events. You can do this with some data that represents the outcome of past events and some features associated with those events. 

When you build your model it technically doesn’t know anything about the future. However, if we can reasonably assume that certain features that have been present in the past will be useful for predicting the future, we might find that we can build a model that will generalize well to future data. 

The idea that our model generalizes well to future data just means that when we feed it specific information from the past, it is able to take new information and, with reasonable accuracy, describe it in some way. 

Once we have all our data and understand it well, we can begin model selection. This is where we test our data on different models to see how well it will generalize. 

Some models won’t describe our data well, this is known as underfitting. In this case your model will exhibit high bias, and low variance. Essentially, the model does not capture the underlying structure of our data. An example of this might be trying to fit a linear model on data that is not linear. A common reason for underfitting a model is not thoroughly understanding the data. This is part of the reason why we often need to do exploratory data analysis first!

On the other hand, our model might describe our data too accurately, to the point where it does not generalize. This is known as overfitting. Overfitted models exhibits by high variance and low bias. The red line below demonstrates a model that may overfits the data. Instead, future data is more likely to be better described by the dashed line, which represents a simple ordinary least squares linear regression model.

source

In practice the red line might actually do an okay job of fitting the data, but the important part to remember is that when we build a model that very accurately describes our past data, we run the risk of overfitting, causing our model to poorly describe future data. There are cases where overfitting your model will not be as forgiving as the image depicted above. For instance, if the variance here were much higher it would be much more likely that your model would not generalize. Another example might be failing to limit the depth or number of leaves in a decision tree. As an extreme example, you could end up with a similar number of leaves as data points. 

This should give a basic intuition for how under and overfitting occur. It’s important to understand the different cases in which your chosen model may under or overfit the data. Doing so will help you understand which models you shouldn’t use and how to improve the performance of models that are overfitting your data.