Imagine that you’re developing a deep learning model. You’re in the prototyping stage and, as good a data scientist, you’re following the practical design process proposed by Goodfellow. You’re totally focused on having a working end-to-end pipeline as soon as you can.
In your dungeon, the sound of the keyboard breaks the silence. You keep hitting it violently, key after key, looking for the secret code. It’s dark outside. The night goes on and the smell of coffee denounces the sin of sleepiness. But you did it.
You run the code using Shift-Enter and your first deep learning model comes to life. You’re now a creator.
As you lie down on your chair, another coffee comes to you. It’s now time to see how well your creation is performing. Following what an obscure blog says, you start by plotting some learning curves. They look more or less like this:
What about now? How well is your model doing? Is it a useful model? Is it wrong? Can you improve it?
This is the time when you realize that you’re free to create, but you’re also responsible for everything you do. You’re living the curse of Existentialism and now you need to read this blog post to learn how to interpret learning curves (creepy laugh).
Learning curves show the performance of our models on training and validation sets, as a function of the number of training iterations. This means that learning depends on the number of iterations. In general, the more iterations, the better. Just like anything else in life.
In this post, we will go through the basics of two types of learning curves:
- Loss curves, which compare the error of the loss function in the training and validation sets.
- Accuracy curves, which compare the performance of the model according to a specific metric (accuracy) on training and validation sets.
As you know, machine learning is a sausage machine. The only difference is that you input data instead of meat, and you output predictions instead of sausages. Figure 3 illustrates this idea of sausage machine learning.
According to this perspective, your machine learning model works well when it is able to make accurate predictions with the data that you input. This means that your machine learning model is perfect when there’s no difference between the actual output and the predicted output.
We call ‘error’ to the difference between the actual output and the predicted output. The error is a measure of a model’s performance and it depends on several parameters (e.g. network’s weights).
When we have a quantity depending on other quantities, we have a function. Let’s call ‘loss function’ to the function that relates the error with other quantities.
At this point, we have a function – loss function – that gives us the error of the model. Moreover, we know that we want to minimize the loss function because the best model is the one that has the least error. So, what’s missing? The only thing that it’s missing to get to heaven is a way to minimize (optimize) the loss function.
That’s what machine learning algorithms are for. They have clever mechanisms to minimize (optimize) the loss function.
Now that I feel that I revealed a Freemason’s secret, let’s see what happens when we are minimizing the loss function. Pardon me for the lack of rigor that characterizes the following lines, but I’m optimizing the text to give you the intuition.
So, imagine that you have a function that has a minimum (Figure 4).
Now, we will consecutively try to find that minimum. We start at a random point of the function (Figure 5)…
… and every iteration we get closer to the minimum (Figure 6).
Since you’re a mighty creator (and not a creature), you decide how fast do you want to search for the minimum:
- You can go crawling.
- Or you can drive a Ferrari.
To the search speed, we call ‘learning rate’. When defining the learning rate, you should be aware that:
- If you go too slowly, it takes you ages to find the minimum.
- If you go to fast, you’ll pass the minimum without noticing it.
In a nutshell, this is all you need to know about error, loss function, and learning rates.
But wait! Why are we interested in this? We are interested in this because now we have the basics to understand the logic of any loss curve. Sounds powerful, isn’t it? Let’s see if that’s true.
In Figure 11, we have what I affectionately call the ‘fundamental picture of loss function’.
If you know the original figure, you noticed that I hid the labels of the curves. Yes, I’m a nasty guy. But it’s for your own benefit. I read somewhere that pretesting improves subsequent learning of the pretested information. If you want, we can talk more about that one of these days.
Coming back to what brings us here, tell me. Which curve do you think that corresponds to:
- A low learning rate? Hint: Remember that low learning rates take ages to find the minimum.
- A high learning rate? Hint: There are two curves that correspond to a high learning rate.
- A good learning rate? Hint: Do the math. There’s just one curve missing.
After 30 seconds you’re free to check the answer here. Don’t open the link before the 30 seconds are over. Otherwise, you’ll break an old Tibetan chain of trust and you’ll be cursed with unsexiness for 7 years.
Ok, now we all know the answer. And do we understand it? Sure.
- When our learning rate is low, we are slowly approaching the minimum of the loss function, which means that the value of the loss goes slowly towards zero.
- When our learning rate is high, we will probably not find the minimum because we move so fast pass that we pass by it and don’t notice it.
- Finally, we know that our learning rate is good when it converges to zero in a reasonable amount of time.
Great! We now know how to read loss curves. Considering that our main goal is to understand what’s going on with our model, this is a nice super-power to have.
But wait a minute… I know I guy that knows a guy that is dating his cousin, that got a model in which the loss was decreasing but the accuracy wasn’t increasing. If the loss is decreasing, it means that we are getting closer to the perfect model. Accordingly, shouldn’t accuracy increase?
Maybe not. Let’s move to the accuracy chapter and talk about that.
Hey! Welcome to the other side of the moon. In the last section, we were discussing the case in which the loss decreases but the accuracy doesn’t. That’s a good starting point to discuss the difference between loss and accuracy.
Let’s start with an example. Imagine that we define a man with over 1.75m as tall. Now, we will try to predict the height of a Portuguese guy that supports Benfica and has 32 years old (me). We feed our model with these data and it tells us ‘this guy must have 1.90m, so he’s tall’. Well… I’ve 1.77m so the model completely missed my height. However, the model got the right result. For the threshold that we defined (1.75m), I’m a tall person.
This is the difference between loss and function. While loss gives us the ______________ (fill the blank is another way to boost learning), accuracy gives us the number of correct predictions in relation to the total number of predictions. Loss is about how much right the model is, accuracy is just about the model being right or wrong.
Thus, we can tell to the guy that knows a guy that is dating his cousin, that loss and accuracy are different concepts and don’t necessarily need to move in the same direction (although they usually do, I confess).
Now that we are clear about the difference between loss and accuracy, let’s focus on accuracy. In Figure 12 we have the ‘fundamental picture of accuracy’.
This figure is fundamental because it explains almost everything that you need to know about accuracy. In the end, it’s all about two concepts:
- Overfitting. A model is said to overfit when the training accuracy is high and the validation accuracy is low (red and blue lines).
- Underfitting. A model is said to underfit when the training accuracy is low and the validation accuracy is low as well (blue line and an imaginary red line slightly above it).
As you can deduce, a model is said to have a good fit if the gap between training and validation accuracy is small.
In general, you want to see something like what is shown in the red and green curves:
- Accuracy increasing with the number of iterations (epoch). This means that the model is getting better and better (learning).
- Training accuracy above the validation accuracy. Validation accuracy measures model’s performance on unseen data, which is harder.
- A small gap between the _____ and the _____ sets. That’s what we defined as a good fit above (fill the blanks is great for lazy writers).
Eventually, our beloved model will find a plateau and we will be nostalgic about those days when it was growing up too fast. Fortunately, not so nostalgic as we are now that we realize that this post is reaching the end.
Putting it together
Interpreting learning curves is more an art than a science. That’s a cliché, I know. But that’s exactly how it feels. Even the curves look like a beautiful form of art (I’m with Karpathy!).
Every time I enter into this art vs. science dichotomy, I remember my Bridges teacher (I was a Civil Engineer in a previous life). He used to have a theorem: ‘What looks wrong, usually is wrong.’ (there was also a corollary, but only applicable to students: ‘what looks right, usually is also wrong’).
So, why do we say that interpreting curves is like an art? I think it’s because we can’t precisely explain why something looks right or wrong. It just looks. It’s our intuition working and, although intuition is a huge source of complex knowledge (see António Damásio), we always see it as a minor product of emotions and feelings.
The million dollars question is then: how to develop intuition? And the million dollars answer is: practice deliberately. Look at the learning curves, draft an explanation, confront it with the reality, review your explanation if necessary, and go for the next iteration. Just like you did to train your machine learning model.
That’s the irony of this post. We end up realizing that to learn how to read learning curves, we need to operate as they do. That’s funny. We all are a machine (learning).