Tutorials for accessing data.

Quick-start

Click here for a complete repo of Kaggle tutorials for the BLASTNet datasets.

Browsing Data

Click here for a quick-start tutorial on browsing BLASTNet data. Make sure to click “Copy & Edit” to run the code on Kaggle’s cloud computing platform. This is the quickest way to interact with our datasets.

Training and Testing ML Models

Click here when you’re ready train on multi-GPUs with TensorFlow. Make sure to click “Copy & Edit” to run the code on Kaggle’s cloud computing platform.

Using your own workstation

Click here for a Google Colab quick-start tutorial on how to use Kaggle, as well as reading and writing basic data formats from BLASTNet. This is useful when you want to use your own workstation with BLASTNet.

Kaggle Command Line API

In BLASTNet, we share our data with Kaggle. Kaggle has released a terminal interface that lets you upload and download data in method suited for most scientific clusters. Go to the Kaggle API GitHub for detailed instructions in their README.

Kaggle API Installation

We will provide quick start instructions here to quickly share data on Kaggle. Pre-requisites: python3.

  1. Install Kaggle API for python3
     pip install kaggle
    
  2. Create a Kaggle account here with a valid <username>.
  3. Go to your account page https://www.kaggle.com/<username>/account and click on ‘Create API Token’ to download a kaggle.json file.
  4. Move the files to the default location and change the permissions:
     mkdir ~/.kaggle
     mv kaggle.json ~/.kaggle/kaggle.json
     chmod 600 ~/.kaggle/kaggle.json
    
  5. Now you’re ready to download and upload.

Kaggle Download

To download a single file from Kaggle:

kaggle datasets download <username>/<datasetname> -f <filename> 

To download all files from a dataset from Kaggle:

kaggle datasets download <username>/<datasetname> 

Here <username> is the username of the contributor.

Kaggle Upload

To upload an entire dataset:

  1. Initialize the uploading process by:
     kaggle datasets init -p <path/to/dataset>
    
  2. This results in a dataset-metadata.json file that you have to populate to fill in your dataset title in "title" and url in "id" via:
     vi /path/to/dataset/dataset-metadata.json
    

    Note that only alphanumeric and hyphens - are allowed.

  3. Put your files into 3 folders (<data>, grid, and <chem_thermo_tran>), since Kaggle’s API can only show 20 directories/files at most.
  4. Upload your dataset (Kaggle automatically zips folders into .tar file) via:
     kaggle datasets create -u -p <path/to/dataset> --dir-mode 'tar'
    

    or update your previously created datasets with

     kaggle datasets version  -h