Home page

Deconstructing Deep Learning + δeviations

Drop me an email | RSS feed link : Click
Format : Date | Title
  TL; DR

Total number of posts : 89


View My GitHub Profile

Go to index

Dataset Bindings

Reading time : ~9 mins

by Subhaditya Mukherjee

Tiny post on datasets and a unified downloader for standard ones.

I wanted an easy way to download the standard datasets. I guess Metalhead does it already but it has limited support and there’s around 5 datasets. Boooringg


The objective is to have a list of standard datasets from Computer Vision, Natural Language Processing and maybe Graphs at some point. Very nicely enough, fastai has already done this for us. Althought it is for python. So I will take that list and add some on my own.

File Downloader

Now for actually downloading the files. We follow a simple workflow. Get a URL -> Check if it exists -> Download it -> Choose a name -> Write to file

function downloader(url::String, dest::String, fname::String)
    HTTP.open(:GET, url) do http
        open(dest*fname, w) do file
            write(file, http)
    return dest*fname

Get all supported

There are so many of them, I need a way to get a list of all existing datasets so I can pop it into functions later. Fortunately, this is a dictionary and I can just take out the keys of it.

alldatasets()= keys(dataset_list)

Get extensions

I need a way to get the extension of the file so I can add it to the downloaded name. This is also very simply done because our files are primarily of the following types :

This will be a bit complex but I am basically just chaining a few things together. Wait. Why not pipe it then it should be easier to explain. Let me try.

function getext(st::String)
    return "."*join(split(split(st, "/")[end],".")[2:end],".")

I will blatantly ignore the previous explanation because turnings out…. piping … WORKS.

@pipe split(st, "/")[end] |> split(_, ".")[2:end] |> "."*join(_, ".")

Just look at this beauty. Now let me explain. I take the string as st. Then I split it based on “/”s. Then I take the last split (with the extension) and then split it again based on “.”. This will give me all the extension parts (eg tar , gz). After that I can join all the parts with a “.”. PS. The _ basically takes the output of the last chain and allows you to use it directly in the next chain.

Downloading the standard datasets

Now we can use all these functions to get our dataset with the tiny name instead of searching for them.

I will follow this workflow.

function get_data(name::String, path::String)

    if name in alldatasets()
        @info "Downloading $name"
        url = dataset_list[name]
        finfile = downloader(url, path, name*getext(url))
        # finfile = "/tmp/mnist.tgz"
        @info "Done downloading"
        @info "Please choose something from here"; @info alldatasets()

Supported Datasets

I add them in a dictionary like this

dataset_list = Dict(

    # Computer Vision
        "mnist" => "https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz",
        "cifar10" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz",
        "cifar100" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz",

Here are the datasets I am currently supporting. I will keep this as a table for future reference too lol.

Dataset Link
mnist Source
cifar10 Source
cifar100 Source
birds Source
caltech101 Source
pets Source
flowers Source
food Source
cars Source
imagenette Source
imagenette320 Source
imagenette160 Source
imagewoof Source
imagewoof320 Source
imagewoof160 Source
imdb Source
wikitext103 Source
wikitext2 Source
wmt Source
ag Source
amazon Source
p-amazon Source
dbpedia Source
sogou Source
yahoo Source
yelp Source
p-yelp Source
camvid Source
pascal Source
hmbd Source
ucf Source
kinetics700 Source
kinetics600 Source
kinetics400 Source
Related posts:  FP16  AI Superpowers Kai Fu Lee  Digital Minimalism Cal Newport  More Deep Learning, Less Crying - A guide  Super resolution  Federated Learning  Taking Batchnorm For Granted  A murder mystery and Adversarial attack  Thank you and a rain check  Pruning