Submitting to Kaggle
This tutorial shows how to create predictions for test set images using the fastai library and how to submit these predictions to Kaggle using the Kaggle API.
In my last blog post I showed how to train a CNN based model for classifying images of distracted drivers using the fastai library. The images came from the training set of the State Farm Distracted Driver Detection competition on Kaggle. In this blog post I want to show how to create predictions for the images of the competition's test set using the trained model and how to submit these predictions to Kaggle.
Neural network based models usually need to be trained on a CUDA suitable GPU. Unfortunately, my local machine doesn't contain such a GPU. However, there are cloud providers that offer machines with such GPUs. Google Colab is one of these providers that even offers an option for free GPU access. As a result, I used Google Colab to train my model. You can read in my last blog post how I did this.
Now, I need to use my trained model to make predictions for the images of the test set. Theoretically I could do this on Google Colab as well. However, in practice I encountered a problem. The test set of the State Farm Distracted Driver Detection competition contains a huge amount of images in the test set folder. This is problematic, because Google Colab seems to have problems accessing folders with many files. Unfortunately, I haven't found a good solution to this problem, yet. The only options that came to my mind were: 1) Splitting the test set into multiple groups and store each group in an own folder or 2) download my trained model from Google Colab to my local machine to create the predictions there. The first option has the disadvantage that I need to do the splitting on my local machine and upload each folder to Google Colab then, since I can't even access the test set folder on Google Colab for the splitting. This seems to be a lot of work and it could take a long time to upload the folders. The second option has the disadvantage that it will take a long time to create the predictions, because I only can use the CPU for this and not a GPU. To avoid time consuming experiments to find out how many files in a folder can be handled by Google Colab, I decided for the second option. It's also not a good option, but let's try it.
In the following the tutorial describes:
- How to load the test set using fastai
- How to create the predictions for the test set images using fastai
- How to submit the predicitons to Kaggle using the Kaggle API
I described how to load the training data using the fastai library in my last blog post. Now, I want to show how to load the test data. However, since loading the test data is strongly coupled to loading the training data when using fastai, I'm going to briefly review how to load the training data as well.
First, we need to run some magics and check the versions of PyTorch and fastai as always. We should make sure that we use the same versions of PyTorch and fastai that we also used for training the model on Google Colab. I recommend installing the libraries using conda on your local machine. You can read more about how to do that here.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import torch
import torchvision
import fastai
print('torch version: {}'.format(torch.__version__))
print('torchvision version: {}'.format(torchvision.__version__))
print('fastai version: {}'.format(fastai.__version__))
We can also check whether we have GPU access. On my local machine I don't. If you do, creating the predictions will run a lot faster for you later.
torch.cuda.is_available()
Then, we need to load some libraries.
from fastai.vision import *
from fastai.metrics import accuracy
import pandas as pd
import random
Creating the validation set or initializing the model parameters are based on some kind of randomness. However, there are options to create the same random numbers again and again. In Python/PyTorch we can achieve that through the following code.
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)
You can read more about reproducibility in PyTorch here.
After downloading the dataset and the model that I trained on Google Colab, I made a project folder. In this project folder I created a data
and an nbs
subfolder. In the nbs
subfolder I put the jupyter notebook that I used to run the code shown in this tutorial here. In the data
subfolder I put the downloaded dataset. Furthermore, I needed to create a folder named models
in the training folder of the downloaded dataset and put my model in there. This is necessary, since fastai looks for models in that folder. When we saved our model on Google Colab, fastai created that folder and stored the model file in there automatically.
Let's specify the path to the training and the path to the test set folder.
base_dir = '../data/state-farm-ddd/'
train_ds_path = Path(base_dir + 'imgs/train')
test_ds_path = Path(base_dir + 'imgs/test')
Let's take a look into the training folder.
train_ds_path.ls()
As we can see, the training folder contains the 10 subfolders c0
to c9
as well as the models
subfolder. In the subfolders c0
to c9
we can find the training images of the 10 classes of driver distraction. The models
subfolder contains our model that we trained on Google Colab.
Let's also take a look into the test folder.
test_ds_path.ls()
We have many images here. Notice, the test set is not split into multiple folders representing the image classes like the training set. The Kaggle competition asks us to predict these classes.
Then, we need to take a certain amount of images from the training set to create our validation set. To able to train a proper model we need to make sure that there isn't any driver that is shown on images of the training set and the validation set at the same time. See my last blog post for more details why this is important. To obtain the information which images contain which drivers (also called subjects) we need to load the driver_imgs_list.csv
file into a pandas data frame.
df = pd.read_csv(base_dir + 'driver_imgs_list.csv'); df
Let's see how many images our training set contains.
n = df.shape[0]
print('total: {}'.format(n))
Next, let's check how many different drivers our training set contains.
df_by_subject = df.groupby('subject')
unique_subjects = list(df_by_subject.groups.keys())
num_subjects = len(unique_subjects)
print('number of subjects: {}'.format(num_subjects))
print('subjects: {}'.format(unique_subjects))
Let's choose 20% of the drivers for our validation set.
valid_frac = 0.2
random.shuffle(unique_subjects)
valid_subjects = unique_subjects[:int(num_subjects * valid_frac)]
print('valid subjects: {}'.format(valid_subjects))
However, we also should make sure that:
- The sum of the images showing these selected drivers is approximately 20% of all training images
- The set of images showing the selected drivers contains images of all classes in a similar amount
df_valid = df.loc[df['subject'].isin(valid_subjects)]
n_valid = df_valid.shape[0]
print('valid total: {} ({}%)'.format(n_valid, round(100. * n_valid / n, 2)))
It's not exactly 20%, but 17.3% should be okay.
df_valid.groupby(['classname']).size().reset_index(name='counts')
Furthermore, our set of selected images indeed contains images of each class in a similar amount. As a result, we can use this set as validation set.
Then we need to adjust the dataframe in the following way to be able to use it for loading the training data:
- The
img
column doesn't only need to contain the image file names but also the file paths - We need an additional column
is_valid
that specifies whether an image belongs to the validation set or not (if not, it belongs to the training set)
df['img'] = df['classname'] + '/' + df['img']
# add is_valid column to indicate which images belong to the valid set
df['is_valid'] = df['subject'].isin(valid_subjects)
# remove subjects column
df = df.drop(columns=['subject']); df
Next, we need to load the test data. We can load it as ImageList using the from_folder method.
test_ds = ImageList.from_folder(test_ds_path); test_ds
Finally, we need to create a data bunch object using the data block API from fastai. The data bunch object contains the training, validation and test set as well as various data transformations. Actually, I created the data bunch object in my last blog post in almost the same way. However, now I also need to add add_test here to specify our test set.
bs = 32
tfms = get_transforms(do_flip=False)
data = (ImageList
.from_df(df, train_ds_path, cols=1)
.split_from_df(col=2)
.label_from_df(cols=0)
.add_test(test_ds)
.transform(tfms=tfms, size=299)
.databunch(bs=bs)
).normalize(imagenet_stats)
Let's check which classes we have.
data.classes
These are the 10 classes of driver distraction as expected. Let's also take a look at a few images of our loaded data.
data.show_batch(rows=3, figsize=(9,8))
Finally, let's check how many images our training, validation and test set contain.
len(data.train_ds)
len(data.valid_ds)
len(data.test_ds)
After loading the data we have to create the learner. We need to make sure that we use the same model architecture here that we already used for training our model on Google Colab. Otherwise we won't be able to load our model that we trained on Google Colab. I used a ResNet50.
learn = cnn_learner(data, models.resnet50, metrics=accuracy)
Now, we can load our model.
learn.load('stage-2');
To make sure that we loaded the model successfully we can check its performance on the validation set using the validate method. Since I only could use the CPU on my local machine to check model performance on the validation set, it took about 30 minutes to run through.
learn.validate()
As result we get two numbers. The first number is the loss reached on the validation set. The second number is the accuracy of our model on the validation set. Since an accuracy of more than 93% was also the accuracy of our final model last time, we can presume that the model was loaded successfully.
Now, let's use the model to make predictions for the test set images. Therefor we need to use the get_preds method. It took approximately 12 hours to run through.
probs, _ = learn.get_preds(ds_type=DatasetType.Test)
probs.shape
As result we obtain the 10 class probabilities and the target (not prediction!) for each image. However, since we don't have the targets for our test images, we only get a zero target for each image. Thus, I omit the targets using _
.
To obtain the predictions we simply need to check for each image which of the 10 class reached the highest probability.
probs_npy = probs.detach().numpy()
np.argmax(probs_npy, axis=1)
To be on the safe side I also decided to store the predictions in a file. Since it took 12 hours to compute them, I wouldn't want to compute them again!
np.save('statefarm_probs.npy', probs_npy)
Now we are ready to submit our predictions to Kaggle. Let's check in which format the competition expects our predictions. Therefor we can check the competition's website or the sample submission file that is part of the downloaded dataset.
pd.read_csv('../data/state-farm-ddd/sample_submission.csv')
As we can see, we need to submit a CSV file to Kaggle containing the 10 probabilities for each test image as well as the test image file name. First, if necessary, let's load our prediction probabilities again.
probabilities = np.load('statefarm_probs.npy')
probabilities.shape
Next, we need to get the file names of the test images. The data bunch object contains the paths of the test images.
data.test_ds.items
However, we only need the file names. Let's extract them from the paths.
get_file_name = lambda p: p.name
vfunc = np.vectorize(get_file_name)
img_names = pd.Series(vfunc(data.test_ds.items)); img_names
Next, let's put the probabilities into a pandas data frame.
df = pd.DataFrame(probabilities)
df.columns = data.classes
Then, let's add an additional column containing the image file names.
df['img'] = img_names; df
However, the column containing the image file names must be the first column. Let's change the column order.
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]; cols
df = df.reindex(columns=cols); df
Okay. Finally, our data frame looks as expected. Let's save it as a CSV file.
df.to_csv('submission.csv', index=False)
Let's load it again to make sure everything still looks fine.
submission = pd.read_csv('submission.csv'); submission
Now, we can submit the created CSV file containing our predictions to Kaggle. To do this we usually have two options. We could either manually upload it over the website or we could use the Kaggle API. For some reason I couldn't manually upload over the website. Thus, I used the Kaggle API.
! kaggle competitions submit state-farm-distracted-driver-detection -f submission.csv -m "first submission"
I reached a score of 0.57751
. This number doesn't express the accuracy but the multi-class logarithmic loss, which is the evaluation metric used for this competition. Since the competition is already closed, my submission doesn't appear on the competition's leaderboard. However, if we check the private leaderboard on the competition website, we can see that our model would be among the first 360 submissions. This is not a bad result for such a quick solution! There are a few options to improve our model. You can find a few ideas here, here and here.