Google Colab is a platform on which you can run GPU (and TPU) accelerated programs in a jupyter-notebook like environment. As a result it is ideal for machine learning education and basic research. The platform is free to use and it has tensorflow and fastai pre-installed.

However, before we can train any machine learning models we need to get data. Kaggle is a platform from which you can download a lot of different datasets, that can be used for machine learning. In this blog post I want to show how to download data from Kaggle on Google Colab. It consists of the following steps:

  1. Set up the Kaggle API
  2. Download the data

Set up the Kaggle API

There are several ways to download data from Kaggle. An easy way is to use the Kaggle API. To set up the Kaggle API on Google Colab we need to run several steps. First of all we need a Kaggle API token. If you already have one, you can simply use it. However, if you do not have one or you want to create a new one, you need to do the following:

  1. You need to log into Kaggle and go to My Account. Then scroll down to the API section and click on Expire API Token to remove previous tokens.
  2. Then click on Create New API Token. It will download a kaggle.json file on your local machine.

When you have the kaggle.json file, you can set up Kaggle on Google Colab. Therefor log into Google Colab and create a new notebook there. Then you need to execute the following steps.

First of all we need to install the kaggle package on Google Colab. Therefor run the following code in a Google Colab cell.

! pip install -q kaggle

Next we need to upload the kaggle.json file. We can do this by running the following code, which will trigger a prompt that let's you upload a file.

from google.colab import files
files.upload()

Since all the data we upload to Google Colab is lost after closing Google Colab, we need to save all data to our Google Drive. To mount our Google Drive space we need to run the following code. Goolge Colab will ask you to enter an authorization code. You can get one by clicking the corresponding link that appears after running the code.

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
Go to this URL in a browser: <URL>

Enter your authorization code:
··········
Mounted at /content/gdrive

The home folder of your Google Drive is located under /content/gdrive/My Drive. We create a folder named .kaggle in that home folder. There we want to store the kaggle.json file.

! mkdir /content/gdrive/My\ Drive/.kaggle/

After creating the .kaggle folder we can move the kaggle.json file there.

! mv kaggle.json /content/gdrive/My\ Drive/.kaggle/

Change the permisson of the file.

! chmod 600 /content/gdrive/My\ Drive/.kaggle/kaggle.json

However, the kaggle.json file is actually not in the correct location. The kaggle package looks for it under /root. I just wanted to store it in the home folder of our Google Drive, so that it is not lost after we close Google Colab. As a result we do not need to upload the kaggle.json file again next time we want to download data from Kaggle. We can simply copy it from the .kaggle folder in our Google Drive home folder instead.

Let's copy the file from there to /root now.

! cp -r /content/gdrive/My\ Drive/.kaggle/ /root/

Now everything should be ready. However, I had some issues with the kaggle package in the following. It did not seem to be installed correctly. If you get the same problem, you can usually solve it by re-installing the package. I used version 1.5.6 of the package here. If you need another version, you can look up which one you need here. Then simply adjust the following command by replacing 1.5.6 with the version number you picked. If you do not have any problems in the following, you can skip this step.

! pip uninstall -y kaggle
! pip install --upgrade pip
! pip install kaggle==1.5.6
! kaggle -v
Kaggle API 1.5.6

That's it! To check if we set up the Kaggle API correctly, we can run the following command. You should be able to see a list of the Kaggle datasets as output. If you have the a problem with the kaggle package as mentioned above, you need to re-install it as described there.

! kaggle datasets list
ref                                                         title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
----------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
sudalairajkumar/novel-corona-virus-2019-dataset             Novel Corona Virus 2019 Dataset                     338KB  2020-03-09 05:24:05          27345       1140  0.9705882        
kimjihoo/coronavirusdataset                                 Coronavirus-Dataset                                  29KB  2020-03-09 15:20:21           8220        405  1.0              
rupals/gpu-runtime                                          Segmentation GPU Kernel Performance Dataset           4MB  2020-03-01 10:04:27            115          8  0.8235294        
anlthms/dfdc-video-faces                                    DFDC video face crops - parts 4-8                     2GB  2020-03-03 01:22:58             43          7  0.5              
prakrutchauhan/indian-candidates-for-general-election-2019  Indian Candidates for General Election 2019         133KB  2020-03-03 07:01:53            471         33  0.7058824        
brunotly/foreign-exchange-rates-per-dollar-20002019         Foreign Exchange Rates 2000-2019                      1MB  2020-03-03 17:43:07            642         27  1.0              
shivamb/real-or-fake-fake-jobposting-prediction             [Real or Fake] Fake JobPosting Prediction            16MB  2020-02-29 08:23:34            797         50  1.0              
tapakah68/yandextoloka-water-meters-dataset                 Water Meters Dataset                                982MB  2020-02-29 10:59:49            113         11  0.9411765        
shank885/knife-dataset                                      Knife Dataset                                         1MB  2020-03-02 06:43:53            136          8  0.8125           
imdevskp/sars-outbreak-2003-complete-dataset                SARS 2003 Outbreak Complete Dataset                  10KB  2020-02-26 10:25:22            650         27  1.0              
imdevskp/ebola-outbreak-20142016-complete-dataset           Ebola 2014-2016 Outbreak Complete Dataset           101KB  2020-02-26 14:36:31            739         33  1.0              
gpiosenka/100-bird-species                                  130 Bird Species                                    899MB  2020-03-08 23:01:45            365         34  0.6875           
umangjpatel/pap-smear-datasets                              Pap Smear Datasets                                    6GB  2020-03-07 11:04:23             91          9  0.875            
jessemostipak/hotel-booking-demand                          Hotel booking demand                                  1MB  2020-02-13 01:27:20           7259        333  1.0              
tunguz/big-five-personality-test                            Big Five Personality Test                           159MB  2020-02-17 15:59:37           2145        161  0.9705882        
arindam235/startup-investments-crunchbase                   StartUp Investments (Crunchbase)                      3MB  2020-02-17 21:54:42           1704        106  0.88235295       
brendaso/2019-coronavirus-dataset-01212020-01262020         2019 Coronavirus dataset (January - February 2020)   53KB  2020-02-06 18:09:28           7602        395  0.7352941        
jamzing/sars-coronavirus-accession                          SARS CORONAVIRUS ACCESSION                            2MB  2020-02-18 15:49:34           2107        111  0.9411765        
timoboz/data-science-cheat-sheets                           Data Science Cheat Sheets                           596MB  2020-02-04 19:42:27           3730        240  0.875            
brandenciranni/democratic-debate-transcripts-2020           Democratic Debate Transcripts 2020                  565KB  2020-02-27 00:07:40            536         51  1.0              

Download the Data

Now we can download data from Kaggle. As an example I chose the distracted driver detection dataset. It has the following URL:

https://www.kaggle.com/c/state-farm-distracted-driver-detection

The name of the dataset is the last part of that URL: state-farm-distracted-driver-detection. We need the name for the command to download the data, which is shown in the following.

! kaggle competitions download -c state-farm-distracted-driver-detection
Downloading state-farm-distracted-driver-detection.zip to /content
100% 3.99G/4.00G [00:51<00:00, 55.9MB/s]
100% 4.00G/4.00G [00:51<00:00, 83.9MB/s]

A state-farm-distracted-driver-detection.zip should have been downloaded. Next let's create a folder in our Google Drive in which we want to put the data. The following path already existed in my Google Drive from previous projects: /content/gdrive/ My Drive/fastai-v3/data. I decided to store the distracted driver detection dataset under this data folder as well. However, you can put the data where ever you want as long as it is in your Google Drive.

! mkdir /content/gdrive/My\ Drive/fastai-v3/data/state-farm-ddd

Move the zip file to the created folder.

! mv state-farm-distracted-driver-detection.zip /content/gdrive/My\ Drive/fastai-v3/data/state-farm-ddd

Go to that folder. !cd did not work for me here but %cd did (you can read about it here).

%cd /content/gdrive/My\ Drive/fastai-v3/data/state-farm-ddd
/content/gdrive/My Drive/fastai-v3/data/state-farm-ddd

Make sure we are in the correct folder.

! pwd
/content/gdrive/My Drive/fastai-v3/data/state-farm-ddd

Now unzip the data in that folder.

! unzip state-farm-distracted-driver-detection.zip -d .
Streaming output truncated to the last 5 lines.
  inflating: ./imgs/train/c9/img_99801.jpg  
  inflating: ./imgs/train/c9/img_99927.jpg  
  inflating: ./imgs/train/c9/img_9993.jpg  
  inflating: ./imgs/train/c9/img_99949.jpg  
  inflating: ./sample_submission.csv  

The data should be there now. However, it happened to me that it was actually not immediately there after the unzipping finished. Apparently, Google Drive sometimes needs some time until the files are available. So, we do not need to do anything. We just need to wait. To check if all the files are there we should go to Google Drive from time to time and look into the folder. When I tried it, the sample_submission.csv file was always the last file. So, when this file was there, I knew all the files were there. It is important to check this directly in Google Drive and not in the folder view in Google Colab, because Google Colab seems to show the files although Google Drive does not have them yet.

Note: You probably need to wait until all the files are available in Google Drive.

When all the files are there, we can remove the zip file.

! rm state-farm-distracted-driver-detection.zip

And we are finished!