Writing a Custom Metric for the fastai Library
The fastai library provides several standard evaluation metrics for classification problems. But what if we want to use a custom metric instead? In this tutorial I show how to do this.
In an earlier blog post I described how to classify images of distracted drivers using the fastai library. The dataset, that I used to train my classification model with, came from the Distracted Driver Kaggle competition. During model training I used the accuracy metric to report the performance of my model. However, if we take a closer look at the description of the Distracted Driver Kaggle competition, we can see that accuracy is actually not the evaluation metric that is supposed to be used for this competition. It is the multi-class logarithmic loss (log loss) instead! Wait! Don't we already use the log loss for optimizing our model? Yes, we do! But we can also use it as an evaluation metric. Kaggle actually does this for many of its competitions. Okay, but what's the difference compared to accuracy? Well, accuracy counts the predictions that were actually correct. You can find more information about it here and here. However, accuracy is not in every case a good evaluation metric because of its yes or no nature. The log loss on the other hand takes the uncertainty of the prediction into account regarding how far it is away from the actual target. You can find out more about it here and here.
So, let's use the log loss as our evaluation metric. However, when we take a look into fastai's documentation, we can see that fastai actually doesn't offer a predefined function calculating the log loss metric. Before, when we were using the accuracy, we could simply use the predefined accuracy metric function from fastai. Now, we have to implement the log loss metric by ourselves and add it to our learner. Well, actually we don't have to add the log loss as evaluation metric at all, since the log loss is also used for optimizing our model and thus fastai already shows it to us. Nevertheless, I still want to show how we can write a custom metric for fastai using the example of the log loss. Later we can use the same way as shown here to write other kinds of custom metrics for fastai.
I'm using Google Colab here. If you have an own machine with a CUDA suitable GPU, you can also use that one of course. Okay, let's load the libraries.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import torch
import torchvision
import fastai
import platform
print('python version: {}'.format(platform.python_version()))
print('torch version: {}'.format(torch.__version__))
print('torchvision version: {}'.format(torchvision.__version__))
print('fastai version: {}'.format(fastai.__version__))
print('cuda available: {}'.format(torch.cuda.is_available()))
print('num gpus: {}'.format(torch.cuda.device_count()))
print('gpu: {}'.format(torch.cuda.get_device_name(0)))
from fastai.vision import *
from fastai.metrics import accuracy
from torch.nn.functional import softmax
Okay, let's implement our metric. The fastai documentation shows how this can be done. At first I was a bit confused but they actually describe it quite clearly. First of all, we need to think about whether our metric is an average over all the elements in our dataset. Is this the case for the log loss? Let's take a look at the log loss formula.
$$logloss = -\frac{1}{N}\sum_{i=1}^N{\sum_{j=1}^M}y_{ij}log(p_{ij})$$
$N$ is the total number of samples (in our case images), $M$ is the number of categories, $log$ is the natural logarithm, $y_{ij}$ is an indicator functions that returns 1 if sample $i$ belongs to catgeory $j$ (0 otherwise) and $p_{ij}$ is the predicted probability that sample $i$ belongs to category $j$. As we can see, we sum over all samples $i$ and divide that sum by the total number of samples $N$. This means the log loss metric is indeed an average over all elements of our dataset. However, we usually don't show all samples of our dataset to our learner at once. We show them in batches instead. So, let's write a function that takes the final network predictions of each sample of a batch as well as their corresponding target categories as input and that outputs the log loss of that batch (average of the log losses of all sample of the batch).
def log_loss(preds: Tensor, targs: Tensor) -> Rank0Tensor:
"Computes accuracy with `targs` when `preds` is bs * n_classes."
epsilon = 1e-15
# calculate probs
p = softmax(preds, dim=1)
p = p / p.sum(dim=1).reshape(-1,1)
# apply min max rule
p = torch.clamp(p, min=epsilon)
p = torch.clamp(p, max=(1-epsilon))
# calculate log probs
p = torch.log(p)
# convert targs as one-hot vectors
n = len(targs)
m = p.shape[1]
zeros = torch.zeros(n, m)
if preds.is_cuda:
zeros = zeros.cuda()
y = zeros.scatter(1, targs.unsqueeze(1),1.).float()
# calculate and return log loss
return (-1 / n) * (y * p).sum()
preds
are the network predictions for each sample of the batch (tensor of size batch size
x number of categories
). targs
are the target category labels for each sample of the batch (tensor of size batch size
x 1
). Since our network predictions preds
are the predictions before applying the Softmax function to them (in PyTorch the Softmax is part of the CrossEntropy loss function as alluded here and here), we need to apply Softmax manually to our predictions preds
to receive our probabilities $p_{ij}$. Furthermore, due to potential computational inaccuracies we normalize all probabilites for each sample again (i.e. we divide each probability of a category by the sum of the probabilities of all categories) to make sure they really add up to 1 (as alluded by Kaggle). I'm actually not sure whether this is absolutely necessary, since I believe PyTorch's Softmax function should already take care of it, but it doesn't hurt either. Then, due to mathematical reasons we also need to apply the MinMax rule on our probabilities.
Now, we can use this function as metric for our learner. But why do we have to define our function only on a batch and not the whole dataset? Well, we only have to define it for the batch, because fastai applies our function to the whole dataset automatically. For every batch fastai calls our funcion and receives the log loss for that batch in return. As a result, if we have e.g. 50 batches, fastai generates 50 log loss values. One for each batch. Finally, fastai takes the average of these 50 log loss values and receives the final log loss value for the whole dataset. This makes sense, since the average of a bunch of averages is the overall average.
However, not every metric is an average over the whole dataset. This is actually the case for e.g. the Precision metric. Or we could even come up with totally different metrics like e.g. a metric that caluclates the maximal RAM usage during training. To implement metrics, that are not an average over all samples of a dataset, we need to use a callback. Check out the fastai documentation to find out how this can be implemented. Our custom log loss metric could be written using a callback as follows.
class LogLoss2(Callback):
def on_epoch_begin(self, **kwargs):
self.sum, self.total = 0., 0.
def on_batch_end(self, last_output, last_target, **kwargs):
epsilon = 1e-15
# calculate probs
p = softmax(last_output, dim=1)
p = p / p.sum(dim=1).reshape(-1,1)
# apply min max rule
p = torch.clamp(p, min=epsilon)
p = torch.clamp(p, max=(1-epsilon))
# calculate log probs
p = torch.log(p)
# convert targs as one-hot vectors
n = len(last_target)
m = p.shape[1]
zeros = torch.zeros(n, m)
if last_output.is_cuda:
zeros = zeros.cuda()
y = zeros.scatter(1, last_target.unsqueeze(1),1.).float()
# update sum and total count
self.sum += -1 * (y * p).sum()
self.total += n
def on_epoch_end(self, last_metrics, **kwargs):
return add_metrics(last_metrics, self.sum/self.total)
The function on_batch_end
contains actually almost the same code as our metric function from above. However, now we have to take care ourselves that the final log loss is calulcated over all batches after the last epoch of training. We do this using the on_epoch_end
method.
Okay, now let's test our custom log loss metric. Let's put our two versions of it in a list and let's also add the accuracy for completeness.
metrics = [log_loss, LogLoss2(), accuracy]
Let's use a sample of the MNIST dataset for testing. First, we need to download the dataset.
path = untar_data(URLs.MNIST_SAMPLE); path
Then, we load the dataset into a DataBunch object.
tfms = get_transforms(do_flip=False)
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=26, num_workers=1, bs=64)
Let's look at a few samples.
data.show_batch(rows=3, figsize=(5,5))
It only shows samples of the digit 7 and the digit 3, because the dataset we use here is just a subset of the real MNIST dataset. However, a subset is enough to have a first test of our custom metric. Now, let's create a learner object, add our metric to it and start training a model for two epochs.
learn = cnn_learner(data, models.resnet18, metrics=metrics)
learn.fit(2)
What we can see is that our custom log loss metric log_loss
as well as its callback version log_loss2
produced the same results as the loss on the validation set valid_loss
after each epoch. This means that our implementations is probably correct.
Next, let's try a more complicated dataset and more training epochs. Let's choose the PETs dataset. First, we need to download it and load it into a DataBunch object as we did for MNIST.
path = untar_data(URLs.PETS)
path_anno = path/'annotations'
path_img = path/'images'
fnames = get_image_files(path_img)
pat = r'/([^/]+)_\d+.jpg$'
data = ImageDataBunch.from_name_re(
path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=64
).normalize(imagenet_stats)
Let's look at a few samples.
data.show_batch(rows=3, figsize=(7,6))
Now, let's create a learner object, add our metrics to it and start training a model for eight epochs.
learn = cnn_learner(data, models.resnet50, metrics=metrics)
learn.fit_one_cycle(8, max_lr=2e-02)
As we can see, the loss on the validation set valid_loss
matches our two versions of the log loss metric after almost every epoch exactly. Only after epoch 1
there is a small difference. However, the values still match roughly. I assume the difference is based on a slight variation of the implementation that calculates the valid_loss
compared to our own implementation. However, I haven't been able two find out yet what this variation is. Nevertheless, since the difference is not huge, I believe my implementation of the log loss metric is correct. So, we are done!