In my last blog post I showed how to train a CNN based model for classifying images of distracted drivers using the fastai library. The images came from the training set of the State Farm Distracted Driver Detection competition on Kaggle. In this blog post I want to show how to create predictions for the images of the competition's test set using the trained model and how to submit these predictions to Kaggle.

Neural network based models usually need to be trained on a CUDA suitable GPU. Unfortunately, my local machine doesn't contain such a GPU. However, there are cloud providers that offer machines with such GPUs. Google Colab is one of these providers that even offers an option for free GPU access. As a result, I used Google Colab to train my model. You can read in my last blog post how I did this.

Now, I need to use my trained model to make predictions for the images of the test set. Theoretically I could do this on Google Colab as well. However, in practice I encountered a problem. The test set of the State Farm Distracted Driver Detection competition contains a huge amount of images in the test set folder. This is problematic, because Google Colab seems to have problems accessing folders with many files. Unfortunately, I haven't found a good solution to this problem, yet. The only options that came to my mind were: 1) Splitting the test set into multiple groups and store each group in an own folder or 2) download my trained model from Google Colab to my local machine to create the predictions there. The first option has the disadvantage that I need to do the splitting on my local machine and upload each folder to Google Colab then, since I can't even access the test set folder on Google Colab for the splitting. This seems to be a lot of work and it could take a long time to upload the folders. The second option has the disadvantage that it will take a long time to create the predictions, because I only can use the CPU for this and not a GPU. To avoid time consuming experiments to find out how many files in a folder can be handled by Google Colab, I decided for the second option. It's also not a good option, but let's try it.

Important: In this tutorial I switch from Google Colab to my local machine. As a result, I need to download my trained model from Google Colab as well as the dataset of the Kaggle competition to my local machine before running any code.

In the following the tutorial describes:

  1. How to load the test set using fastai
  2. How to create the predictions for the test set images using fastai
  3. How to submit the predicitons to Kaggle using the Kaggle API

Warning: Using the CPU to create predictions for a huge amount of test set images can take several hours.

Note: If you used your local machine or a remote machine with a CUDA suitable GPU instead of Google Colab to train your model, you can keep using it. There is no need to switch to another machine, since there shouldn’t be any problems to open the test set folder. You can still use the following instructions though. They will just work fine. They will even work more quickly for you, since you have access to a GPU.

Important: The code shown in the blog post uses the old fastai v1 version. For the current fastai version check out the fastai docs.

Load the Test Set

I described how to load the training data using the fastai library in my last blog post. Now, I want to show how to load the test data. However, since loading the test data is strongly coupled to loading the training data when using fastai, I'm going to briefly review how to load the training data as well.

First, we need to run some magics and check the versions of PyTorch and fastai as always. We should make sure that we use the same versions of PyTorch and fastai that we also used for training the model on Google Colab. I recommend installing the libraries using conda on your local machine. You can read more about how to do that here.

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import torch
import torchvision
import fastai

print('torch version:       {}'.format(torch.__version__))
print('torchvision version: {}'.format(torchvision.__version__))
print('fastai version:      {}'.format(fastai.__version__))
torch version:       1.4.0
torchvision version: 0.5.0
fastai version:      1.0.60

We can also check whether we have GPU access. On my local machine I don't. If you do, creating the predictions will run a lot faster for you later.

torch.cuda.is_available()
False

Then, we need to load some libraries.

from fastai.vision import *
from fastai.metrics import accuracy
import pandas as pd
import random

Creating the validation set or initializing the model parameters are based on some kind of randomness. However, there are options to create the same random numbers again and again. In Python/PyTorch we can achieve that through the following code.

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)

You can read more about reproducibility in PyTorch here.

After downloading the dataset and the model that I trained on Google Colab, I made a project folder. In this project folder I created a data and an nbs subfolder. In the nbs subfolder I put the jupyter notebook that I used to run the code shown in this tutorial here. In the data subfolder I put the downloaded dataset. Furthermore, I needed to create a folder named models in the training folder of the downloaded dataset and put my model in there. This is necessary, since fastai looks for models in that folder. When we saved our model on Google Colab, fastai created that folder and stored the model file in there automatically.

Let's specify the path to the training and the path to the test set folder.

base_dir = '../data/state-farm-ddd/'

train_ds_path = Path(base_dir + 'imgs/train')
test_ds_path  = Path(base_dir + 'imgs/test')

Let's take a look into the training folder.

train_ds_path.ls()
[PosixPath('../data/state-farm-ddd/imgs/train/c7'),
 PosixPath('../data/state-farm-ddd/imgs/train/c0'),
 PosixPath('../data/state-farm-ddd/imgs/train/c9'),
 PosixPath('../data/state-farm-ddd/imgs/train/c8'),
 PosixPath('../data/state-farm-ddd/imgs/train/c1'),
 PosixPath('../data/state-farm-ddd/imgs/train/c6'),
 PosixPath('../data/state-farm-ddd/imgs/train/models'),
 PosixPath('../data/state-farm-ddd/imgs/train/c3'),
 PosixPath('../data/state-farm-ddd/imgs/train/c4'),
 PosixPath('../data/state-farm-ddd/imgs/train/c5'),
 PosixPath('../data/state-farm-ddd/imgs/train/c2')]

As we can see, the training folder contains the 10 subfolders c0 to c9 as well as the models subfolder. In the subfolders c0 to c9 we can find the training images of the 10 classes of driver distraction. The models subfolder contains our model that we trained on Google Colab.

Let's also take a look into the test folder.

test_ds_path.ls()
[PosixPath('../data/state-farm-ddd/imgs/test/img_60161.jpg'),
 PosixPath('../data/state-farm-ddd/imgs/test/img_94786.jpg'),
 PosixPath('../data/state-farm-ddd/imgs/test/img_85853.jpg'),
 PosixPath('../data/state-farm-ddd/imgs/test/img_36327.jpg'),
 PosixPath('../data/state-farm-ddd/imgs/test/img_39014.jpg'),
 ...]

We have many images here. Notice, the test set is not split into multiple folders representing the image classes like the training set. The Kaggle competition asks us to predict these classes.

Then, we need to take a certain amount of images from the training set to create our validation set. To able to train a proper model we need to make sure that there isn't any driver that is shown on images of the training set and the validation set at the same time. See my last blog post for more details why this is important. To obtain the information which images contain which drivers (also called subjects) we need to load the driver_imgs_list.csv file into a pandas data frame.

df = pd.read_csv(base_dir + 'driver_imgs_list.csv'); df
subject classname img
0 p002 c0 img_44733.jpg
1 p002 c0 img_72999.jpg
2 p002 c0 img_25094.jpg
3 p002 c0 img_69092.jpg
4 p002 c0 img_92629.jpg
... ... ... ...
22419 p081 c9 img_56936.jpg
22420 p081 c9 img_46218.jpg
22421 p081 c9 img_25946.jpg
22422 p081 c9 img_67850.jpg
22423 p081 c9 img_9684.jpg

22424 rows × 3 columns

Let's see how many images our training set contains.

n = df.shape[0]
print('total: {}'.format(n))
total: 22424

Next, let's check how many different drivers our training set contains.

df_by_subject = df.groupby('subject')
unique_subjects = list(df_by_subject.groups.keys())
num_subjects = len(unique_subjects)

print('number of subjects: {}'.format(num_subjects))
print('subjects: {}'.format(unique_subjects))
number of subjects: 26
subjects: ['p002', 'p012', 'p014', 'p015', 'p016', 'p021', 'p022', 'p024', 'p026', 'p035', 'p039', 'p041', 'p042', 'p045', 'p047', 'p049', 'p050', 'p051', 'p052', 'p056', 'p061', 'p064', 'p066', 'p072', 'p075', 'p081']

Let's choose 20% of the drivers for our validation set.

valid_frac = 0.2
random.shuffle(unique_subjects)

valid_subjects = unique_subjects[:int(num_subjects * valid_frac)]

print('valid subjects: {}'.format(valid_subjects))
valid subjects: ['p047', 'p002', 'p072', 'p052', 'p022']

However, we also should make sure that:

  • The sum of the images showing these selected drivers is approximately 20% of all training images
  • The set of images showing the selected drivers contains images of all classes in a similar amount
df_valid = df.loc[df['subject'].isin(valid_subjects)]
n_valid = df_valid.shape[0]

print('valid total: {} ({}%)'.format(n_valid, round(100. * n_valid / n, 2)))
valid total: 3879 (17.3%)

It's not exactly 20%, but 17.3% should be okay.

df_valid.groupby(['classname']).size().reset_index(name='counts')
classname counts
0 c0 420
1 c1 427
2 c2 415
3 c3 400
4 c4 402
5 c5 371
6 c6 407
7 c7 325
8 c8 316
9 c9 396

Furthermore, our set of selected images indeed contains images of each class in a similar amount. As a result, we can use this set as validation set.

Then we need to adjust the dataframe in the following way to be able to use it for loading the training data:

  • The img column doesn't only need to contain the image file names but also the file paths
  • We need an additional column is_valid that specifies whether an image belongs to the validation set or not (if not, it belongs to the training set)
df['img'] =  df['classname'] + '/' + df['img']

# add is_valid column to indicate which images belong to the valid set
df['is_valid'] = df['subject'].isin(valid_subjects)

# remove subjects column
df = df.drop(columns=['subject']); df
classname img is_valid
0 c0 c0/img_44733.jpg True
1 c0 c0/img_72999.jpg True
2 c0 c0/img_25094.jpg True
3 c0 c0/img_69092.jpg True
4 c0 c0/img_92629.jpg True
... ... ... ...
22419 c9 c9/img_56936.jpg False
22420 c9 c9/img_46218.jpg False
22421 c9 c9/img_25946.jpg False
22422 c9 c9/img_67850.jpg False
22423 c9 c9/img_9684.jpg False

22424 rows × 3 columns

Next, we need to load the test data. We can load it as ImageList using the from_folder method.

test_ds = ImageList.from_folder(test_ds_path); test_ds
ImageList (79726 items)
Image (3, 480, 640),Image (3, 480, 640),Image (3, 480, 640),Image (3, 480, 640),Image (3, 480, 640)
Path: ../data/state-farm-ddd/imgs/test

Finally, we need to create a data bunch object using the data block API from fastai. The data bunch object contains the training, validation and test set as well as various data transformations. Actually, I created the data bunch object in my last blog post in almost the same way. However, now I also need to add add_test here to specify our test set.

bs = 32
tfms = get_transforms(do_flip=False)

data = (ImageList
  .from_df(df, train_ds_path, cols=1)
  .split_from_df(col=2)
  .label_from_df(cols=0)
  .add_test(test_ds) 
  .transform(tfms=tfms, size=299)
  .databunch(bs=bs)
).normalize(imagenet_stats)

Let's check which classes we have.

data.classes
['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9']

These are the 10 classes of driver distraction as expected. Let's also take a look at a few images of our loaded data.

data.show_batch(rows=3, figsize=(9,8))

Finally, let's check how many images our training, validation and test set contain.

len(data.train_ds)
18545
len(data.valid_ds)
3879
len(data.test_ds)
79726

Create Predictions

After loading the data we have to create the learner. We need to make sure that we use the same model architecture here that we already used for training our model on Google Colab. Otherwise we won't be able to load our model that we trained on Google Colab. I used a ResNet50.

learn = cnn_learner(data, models.resnet50, metrics=accuracy)

Now, we can load our model.

learn.load('stage-2');

To make sure that we loaded the model successfully we can check its performance on the validation set using the validate method. Since I only could use the CPU on my local machine to check model performance on the validation set, it took about 30 minutes to run through.

learn.validate()
[0.33628428, tensor(0.9330)]

As result we get two numbers. The first number is the loss reached on the validation set. The second number is the accuracy of our model on the validation set. Since an accuracy of more than 93% was also the accuracy of our final model last time, we can presume that the model was loaded successfully.

Now, let's use the model to make predictions for the test set images. Therefor we need to use the get_preds method. It took approximately 12 hours to run through.

probs, _ = learn.get_preds(ds_type=DatasetType.Test)
probs.shape
torch.Size([79726, 10])

As result we obtain the 10 class probabilities and the target (not prediction!) for each image. However, since we don't have the targets for our test images, we only get a zero target for each image. Thus, I omit the targets using _.

To obtain the predictions we simply need to check for each image which of the 10 class reached the highest probability.

probs_npy = probs.detach().numpy()
np.argmax(probs_npy, axis=1)
array([5, 6, 1, 1, ..., 4, 8, 1, 0])

To be on the safe side I also decided to store the predictions in a file. Since it took 12 hours to compute them, I wouldn't want to compute them again!

np.save('statefarm_probs.npy', probs_npy)

Submit to Kaggle

Now we are ready to submit our predictions to Kaggle. Let's check in which format the competition expects our predictions. Therefor we can check the competition's website or the sample submission file that is part of the downloaded dataset.

pd.read_csv('../data/state-farm-ddd/sample_submission.csv')
img c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 img_1.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
1 img_10.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 img_100.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
3 img_1000.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
4 img_100000.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
... ... ... ... ... ... ... ... ... ... ... ...
79721 img_99994.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
79722 img_99995.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
79723 img_99996.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
79724 img_99998.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
79725 img_99999.jpg 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

79726 rows × 11 columns

As we can see, we need to submit a CSV file to Kaggle containing the 10 probabilities for each test image as well as the test image file name. First, if necessary, let's load our prediction probabilities again.

probabilities = np.load('statefarm_probs.npy')
probabilities.shape
(79726, 10)

Next, we need to get the file names of the test images. The data bunch object contains the paths of the test images.

data.test_ds.items
array([PosixPath('../data/state-farm-ddd/imgs/test/img_60161.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_94786.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_85853.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_36327.jpg'), ...,
       PosixPath('../data/state-farm-ddd/imgs/test/img_68524.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_67617.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_30997.jpg'),
       PosixPath('../data/state-farm-ddd/imgs/test/img_21642.jpg')], dtype=object)

However, we only need the file names. Let's extract them from the paths.

get_file_name = lambda p: p.name
vfunc = np.vectorize(get_file_name) 

img_names = pd.Series(vfunc(data.test_ds.items)); img_names
0        img_60161.jpg
1        img_94786.jpg
2        img_85853.jpg
3        img_36327.jpg
4        img_39014.jpg
             ...      
79721    img_77404.jpg
79722    img_68524.jpg
79723    img_67617.jpg
79724    img_30997.jpg
79725    img_21642.jpg
Length: 79726, dtype: object

Next, let's put the probabilities into a pandas data frame.

df = pd.DataFrame(probabilities)
df.columns = data.classes

Then, let's add an additional column containing the image file names.

df['img'] = img_names; df
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 img
0 9.064603e-08 9.556512e-10 1.208677e-08 6.626375e-06 4.161629e-07 9.980783e-01 1.425728e-07 2.311186e-06 1.048160e-03 8.639126e-04 img_60161.jpg
1 1.341862e-11 1.325500e-11 2.737421e-09 1.233930e-09 2.569607e-10 2.989499e-08 1.000000e+00 2.460269e-11 8.222596e-11 1.081464e-11 img_94786.jpg
2 1.278042e-07 9.999648e-01 1.621474e-07 1.249684e-06 2.125959e-06 1.716223e-08 3.776129e-09 3.145694e-05 1.170369e-07 9.460091e-09 img_85853.jpg
3 1.983368e-07 9.999930e-01 6.280829e-06 1.208340e-08 4.862862e-08 5.451320e-09 7.702554e-08 2.440847e-09 2.714422e-07 1.117997e-07 img_36327.jpg
4 5.995354e-04 2.111001e-07 1.837754e-07 1.218117e-03 3.950001e-07 9.981633e-01 1.178386e-06 1.751037e-07 1.668345e-05 3.223242e-07 img_39014.jpg
... ... ... ... ... ... ... ... ... ... ... ...
79721 4.511207e-01 5.139207e-02 3.446462e-03 2.381562e-03 2.746080e-04 2.520865e-01 2.002078e-01 4.145120e-03 1.319973e-02 2.174546e-02 img_77404.jpg
79722 7.303311e-06 1.809456e-05 2.658817e-06 1.319719e-03 9.974797e-01 3.963906e-07 1.893191e-05 8.105447e-05 9.921284e-04 7.999525e-05 img_68524.jpg
79723 3.712073e-07 5.125133e-07 1.103094e-05 1.346865e-06 9.106328e-07 4.559149e-07 9.792532e-02 8.688941e-06 9.020439e-01 7.445801e-06 img_67617.jpg
79724 1.119303e-02 9.558834e-01 7.588323e-04 2.264025e-02 1.162238e-03 9.801583e-04 7.104538e-03 3.813615e-05 1.180822e-05 2.274428e-04 img_30997.jpg
79725 9.999969e-01 2.245683e-09 2.918841e-08 1.102728e-07 3.436966e-08 1.067264e-08 8.381981e-11 1.447656e-09 9.670017e-08 2.855863e-06 img_21642.jpg

79726 rows × 11 columns

However, the column containing the image file names must be the first column. Let's change the column order.

cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]; cols
['img', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9']
df = df.reindex(columns=cols); df
img c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 img_60161.jpg 9.064603e-08 9.556512e-10 1.208677e-08 6.626375e-06 4.161629e-07 9.980783e-01 1.425728e-07 2.311186e-06 1.048160e-03 8.639126e-04
1 img_94786.jpg 1.341862e-11 1.325500e-11 2.737421e-09 1.233930e-09 2.569607e-10 2.989499e-08 1.000000e+00 2.460269e-11 8.222596e-11 1.081464e-11
2 img_85853.jpg 1.278042e-07 9.999648e-01 1.621474e-07 1.249684e-06 2.125959e-06 1.716223e-08 3.776129e-09 3.145694e-05 1.170369e-07 9.460091e-09
3 img_36327.jpg 1.983368e-07 9.999930e-01 6.280829e-06 1.208340e-08 4.862862e-08 5.451320e-09 7.702554e-08 2.440847e-09 2.714422e-07 1.117997e-07
4 img_39014.jpg 5.995354e-04 2.111001e-07 1.837754e-07 1.218117e-03 3.950001e-07 9.981633e-01 1.178386e-06 1.751037e-07 1.668345e-05 3.223242e-07
... ... ... ... ... ... ... ... ... ... ... ...
79721 img_77404.jpg 4.511207e-01 5.139207e-02 3.446462e-03 2.381562e-03 2.746080e-04 2.520865e-01 2.002078e-01 4.145120e-03 1.319973e-02 2.174546e-02
79722 img_68524.jpg 7.303311e-06 1.809456e-05 2.658817e-06 1.319719e-03 9.974797e-01 3.963906e-07 1.893191e-05 8.105447e-05 9.921284e-04 7.999525e-05
79723 img_67617.jpg 3.712073e-07 5.125133e-07 1.103094e-05 1.346865e-06 9.106328e-07 4.559149e-07 9.792532e-02 8.688941e-06 9.020439e-01 7.445801e-06
79724 img_30997.jpg 1.119303e-02 9.558834e-01 7.588323e-04 2.264025e-02 1.162238e-03 9.801583e-04 7.104538e-03 3.813615e-05 1.180822e-05 2.274428e-04
79725 img_21642.jpg 9.999969e-01 2.245683e-09 2.918841e-08 1.102728e-07 3.436966e-08 1.067264e-08 8.381981e-11 1.447656e-09 9.670017e-08 2.855863e-06

79726 rows × 11 columns

Okay. Finally, our data frame looks as expected. Let's save it as a CSV file.

df.to_csv('submission.csv', index=False)

Let's load it again to make sure everything still looks fine.

submission = pd.read_csv('submission.csv'); submission
img c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
0 img_60161.jpg 9.064603e-08 9.556512e-10 1.208677e-08 6.626375e-06 4.161629e-07 9.980784e-01 1.425728e-07 2.311186e-06 1.048160e-03 8.639126e-04
1 img_94786.jpg 1.341862e-11 1.325500e-11 2.737421e-09 1.233930e-09 2.569607e-10 2.989499e-08 1.000000e+00 2.460269e-11 8.222596e-11 1.081464e-11
2 img_85853.jpg 1.278042e-07 9.999648e-01 1.621474e-07 1.249684e-06 2.125959e-06 1.716222e-08 3.776129e-09 3.145694e-05 1.170369e-07 9.460091e-09
3 img_36327.jpg 1.983368e-07 9.999930e-01 6.280829e-06 1.208340e-08 4.862863e-08 5.451320e-09 7.702554e-08 2.440847e-09 2.714422e-07 1.117997e-07
4 img_39014.jpg 5.995354e-04 2.111001e-07 1.837754e-07 1.218117e-03 3.950001e-07 9.981633e-01 1.178386e-06 1.751037e-07 1.668344e-05 3.223242e-07
... ... ... ... ... ... ... ... ... ... ... ...
79721 img_77404.jpg 4.511207e-01 5.139207e-02 3.446462e-03 2.381562e-03 2.746080e-04 2.520865e-01 2.002078e-01 4.145120e-03 1.319973e-02 2.174546e-02
79722 img_68524.jpg 7.303311e-06 1.809456e-05 2.658817e-06 1.319719e-03 9.974797e-01 3.963906e-07 1.893191e-05 8.105447e-05 9.921284e-04 7.999525e-05
79723 img_67617.jpg 3.712073e-07 5.125133e-07 1.103094e-05 1.346866e-06 9.106328e-07 4.559149e-07 9.792532e-02 8.688941e-06 9.020439e-01 7.445801e-06
79724 img_30997.jpg 1.119303e-02 9.558834e-01 7.588323e-04 2.264025e-02 1.162238e-03 9.801583e-04 7.104538e-03 3.813615e-05 1.180822e-05 2.274428e-04
79725 img_21642.jpg 9.999969e-01 2.245683e-09 2.918841e-08 1.102728e-07 3.436966e-08 1.067264e-08 8.381980e-11 1.447656e-09 9.670017e-08 2.855863e-06

79726 rows × 11 columns

Now, we can submit the created CSV file containing our predictions to Kaggle. To do this we usually have two options. We could either manually upload it over the website or we could use the Kaggle API. For some reason I couldn't manually upload over the website. Thus, I used the Kaggle API.

! kaggle competitions submit state-farm-distracted-driver-detection -f submission.csv -m "first submission"
100%|███████████████████████████████████████| 11.0M/11.0M [00:47<00:00, 242kB/s]
Successfully submitted to State Farm Distracted Driver Detection

I reached a score of 0.57751. This number doesn't express the accuracy but the multi-class logarithmic loss, which is the evaluation metric used for this competition. Since the competition is already closed, my submission doesn't appear on the competition's leaderboard. However, if we check the private leaderboard on the competition website, we can see that our model would be among the first 360 submissions. This is not a bad result for such a quick solution! There are a few options to improve our model. You can find a few ideas here, here and here.