In an earlier blog post I described how to classify images of distracted drivers using the fastai library. The dataset, that I used to train my classification model with, came from the Distracted Driver Kaggle competition. During model training I used the accuracy metric to report the performance of my model. However, if we take a closer look at the description of the Distracted Driver Kaggle competition, we can see that accuracy is actually not the evaluation metric that is supposed to be used for this competition. It is the multi-class logarithmic loss (log loss) instead! Wait! Don't we already use the log loss for optimizing our model? Yes, we do! But we can also use it as an evaluation metric. Kaggle actually does this for many of its competitions. Okay, but what's the difference compared to accuracy? Well, accuracy counts the predictions that were actually correct. You can find more information about it here and here. However, accuracy is not in every case a good evaluation metric because of its yes or no nature. The log loss on the other hand takes the uncertainty of the prediction into account regarding how far it is away from the actual target. You can find out more about it here and here.

Important: The code shown in the blog post uses the old fastai v1 version. For the current fastai version check out the fastai docs.

So, let's use the log loss as our evaluation metric. However, when we take a look into fastai's documentation, we can see that fastai actually doesn't offer a predefined function calculating the log loss metric. Before, when we were using the accuracy, we could simply use the predefined accuracy metric function from fastai. Now, we have to implement the log loss metric by ourselves and add it to our learner. Well, actually we don't have to add the log loss as evaluation metric at all, since the log loss is also used for optimizing our model and thus fastai already shows it to us. Nevertheless, I still want to show how we can write a custom metric for fastai using the example of the log loss. Later we can use the same way as shown here to write other kinds of custom metrics for fastai.

I'm using Google Colab here. If you have an own machine with a CUDA suitable GPU, you can also use that one of course. Okay, let's load the libraries.

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import torch
import torchvision
import fastai
import platform

print('python version:      {}'.format(platform.python_version()))
print('torch version:       {}'.format(torch.__version__))
print('torchvision version: {}'.format(torchvision.__version__))
print('fastai version:      {}'.format(fastai.__version__))
print('cuda available:      {}'.format(torch.cuda.is_available()))
print('num gpus:            {}'.format(torch.cuda.device_count()))
print('gpu:                 {}'.format(torch.cuda.get_device_name(0)))

from fastai.vision import *
from fastai.metrics import accuracy
from torch.nn.functional import softmax

python version:      3.6.9
torch version:       1.4.0
torchvision version: 0.5.0
fastai version:      1.0.61
cuda available:      True
num gpus:            1
gpu:                 Tesla P100-PCIE-16GB

Note: If you also use Google Colab, keep in my that the GPU could be different for you, since Google Colab assigns you a random GPU every time you start a new session. However, for our example here it actually doesn’t matter what kind of GPU from Google Colab we use.

Okay, let's implement our metric. The fastai documentation shows how this can be done. At first I was a bit confused but they actually describe it quite clearly. First of all, we need to think about whether our metric is an average over all the elements in our dataset. Is this the case for the log loss? Let's take a look at the log loss formula.

$$logloss = -\frac{1}{N}\sum_{i=1}^N{\sum_{j=1}^M}y_{ij}log(p_{ij})$$

$N$ is the total number of samples (in our case images), $M$ is the number of categories, $log$ is the natural logarithm, $y_{ij}$ is an indicator functions that returns 1 if sample $i$ belongs to catgeory $j$ (0 otherwise) and $p_{ij}$ is the predicted probability that sample $i$ belongs to category $j$. As we can see, we sum over all samples $i$ and divide that sum by the total number of samples $N$. This means the log loss metric is indeed an average over all elements of our dataset. However, we usually don't show all samples of our dataset to our learner at once. We show them in batches instead. So, let's write a function that takes the final network predictions of each sample of a batch as well as their corresponding target categories as input and that outputs the log loss of that batch (average of the log losses of all sample of the batch).

def log_loss(preds: Tensor, targs: Tensor) -> Rank0Tensor:
    "Computes accuracy with `targs` when `preds` is bs * n_classes."  
    epsilon = 1e-15

    # calculate probs
    p = softmax(preds, dim=1)
    p = p / p.sum(dim=1).reshape(-1,1)

    # apply min max rule
    p = torch.clamp(p, min=epsilon)
    p = torch.clamp(p, max=(1-epsilon))

    # calculate log probs
    p = torch.log(p)

    # convert targs as one-hot vectors
    n = len(targs)
    m = p.shape[1]
    zeros = torch.zeros(n, m)
    if preds.is_cuda:
        zeros = zeros.cuda()
    y = zeros.scatter(1, targs.unsqueeze(1),1.).float()

    # calculate and return log loss
    return (-1 / n) * (y * p).sum()

preds are the network predictions for each sample of the batch (tensor of size batch size x number of categories). targs are the target category labels for each sample of the batch (tensor of size batch size x 1). Since our network predictions preds are the predictions before applying the Softmax function to them (in PyTorch the Softmax is part of the CrossEntropy loss function as alluded here and here), we need to apply Softmax manually to our predictions preds to receive our probabilities $p_{ij}$. Furthermore, due to potential computational inaccuracies we normalize all probabilites for each sample again (i.e. we divide each probability of a category by the sum of the probabilities of all categories) to make sure they really add up to 1 (as alluded by Kaggle). I'm actually not sure whether this is absolutely necessary, since I believe PyTorch's Softmax function should already take care of it, but it doesn't hurt either. Then, due to mathematical reasons we also need to apply the MinMax rule on our probabilities.

Now, we can use this function as metric for our learner. But why do we have to define our function only on a batch and not the whole dataset? Well, we only have to define it for the batch, because fastai applies our function to the whole dataset automatically. For every batch fastai calls our funcion and receives the log loss for that batch in return. As a result, if we have e.g. 50 batches, fastai generates 50 log loss values. One for each batch. Finally, fastai takes the average of these 50 log loss values and receives the final log loss value for the whole dataset. This makes sense, since the average of a bunch of averages is the overall average.

However, not every metric is an average over the whole dataset. This is actually the case for e.g. the Precision metric. Or we could even come up with totally different metrics like e.g. a metric that caluclates the maximal RAM usage during training. To implement metrics, that are not an average over all samples of a dataset, we need to use a callback. Check out the fastai documentation to find out how this can be implemented. Our custom log loss metric could be written using a callback as follows.

class LogLoss2(Callback):
    
    def on_epoch_begin(self, **kwargs):
        self.sum, self.total = 0., 0.
    
    def on_batch_end(self, last_output, last_target, **kwargs):
        epsilon = 1e-15

        # calculate probs
        p = softmax(last_output, dim=1)
        p = p / p.sum(dim=1).reshape(-1,1)

        # apply min max rule
        p = torch.clamp(p, min=epsilon)
        p = torch.clamp(p, max=(1-epsilon))

        # calculate log probs
        p = torch.log(p)

        # convert targs as one-hot vectors
        n = len(last_target)
        m = p.shape[1]
        zeros = torch.zeros(n, m)
        if last_output.is_cuda:
            zeros = zeros.cuda()
        y = zeros.scatter(1, last_target.unsqueeze(1),1.).float()

        # update sum and total count
        self.sum += -1 * (y * p).sum()
        self.total += n
    
    def on_epoch_end(self, last_metrics, **kwargs):
        return add_metrics(last_metrics, self.sum/self.total)

The function on_batch_end contains actually almost the same code as our metric function from above. However, now we have to take care ourselves that the final log loss is calulcated over all batches after the last epoch of training. We do this using the on_epoch_end method.

Okay, now let's test our custom log loss metric. Let's put our two versions of it in a list and let's also add the accuracy for completeness.

metrics = [log_loss, LogLoss2(), accuracy]

Let's use a sample of the MNIST dataset for testing. First, we need to download the dataset.

path = untar_data(URLs.MNIST_SAMPLE); path

PosixPath('/root/.fastai/data/mnist_sample')

Then, we load the dataset into a DataBunch object.

tfms = get_transforms(do_flip=False)
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, size=26, num_workers=1, bs=64)

Let's look at a few samples.

data.show_batch(rows=3, figsize=(5,5))

It only shows samples of the digit 7 and the digit 3, because the dataset we use here is just a subset of the real MNIST dataset. However, a subset is enough to have a first test of our custom metric. Now, let's create a learner object, add our metric to it and start training a model for two epochs.

learn = cnn_learner(data, models.resnet18, metrics=metrics)
learn.fit(2)

What we can see is that our custom log loss metric log_loss as well as its callback version log_loss2 produced the same results as the loss on the validation set valid_loss after each epoch. This means that our implementations is probably correct.

Next, let's try a more complicated dataset and more training epochs. Let's choose the PETs dataset. First, we need to download it and load it into a DataBunch object as we did for MNIST.

path = untar_data(URLs.PETS)

path_anno = path/'annotations'
path_img = path/'images'

fnames = get_image_files(path_img)
pat = r'/([^/]+)_\d+.jpg$'

data = ImageDataBunch.from_name_re(
    path_img, fnames, pat, ds_tfms=get_transforms(), size=224, bs=64
).normalize(imagenet_stats)

Let's look at a few samples.

data.show_batch(rows=3, figsize=(7,6))

Now, let's create a learner object, add our metrics to it and start training a model for eight epochs.

learn = cnn_learner(data, models.resnet50, metrics=metrics)
learn.fit_one_cycle(8, max_lr=2e-02)

As we can see, the loss on the validation set valid_loss matches our two versions of the log loss metric after almost every epoch exactly. Only after epoch 1 there is a small difference. However, the values still match roughly. I assume the difference is based on a slight variation of the implementation that calculates the valid_loss compared to our own implementation. However, I haven't been able two find out yet what this variation is. Nevertheless, since the difference is not huge, I believe my implementation of the log loss metric is correct. So, we are done!

epoch	train_loss	valid_loss	log_loss	log_loss2	accuracy	time
0	0.203924	0.107662	0.107662	0.107662	0.959764	00:19
1	0.117651	0.054363	0.054363	0.054363	0.980864	00:19

epoch	train_loss	valid_loss	log_loss	log_loss2	accuracy	time
0	0.692790	0.730190	0.730190	0.730190	0.813261	01:21
1	1.482390	4.922840	4.921004	4.921004	0.430311	01:19
2	1.097990	1.269556	1.269556	1.269555	0.695535	01:20
3	0.734669	0.812989	0.812989	0.812989	0.758457	01:20
4	0.563721	0.491822	0.491822	0.491822	0.859946	01:19
5	0.393704	0.355768	0.355768	0.355768	0.880920	01:20
6	0.265045	0.235767	0.235767	0.235767	0.926928	01:19
7	0.187495	0.227037	0.227037	0.227037	0.930988	01:19