Home page

Deconstructing Deep Learning + ╬┤eviations

Drop me an email | RSS feed link : Click
Format : Date | Title
  TL; DR

Total posts : 78

View My GitHub Profile

Index page

Dataset Bindings

Tiny post on datasets and a unified downloader for standard ones.

I wanted an easy way to download the standard datasets. I guess Metalhead does it already but it has limited support and there's around 5 datasets. Boooringg


The objective is to have a list of standard datasets from Computer Vision, Natural Language Processing and maybe Graphs at some point. Very nicely enough, fastai has already done this for us. Althought it is for python. So I will take that list and add some on my own.

File Downloader

Now for actually downloading the files. We follow a simple workflow. Get a URL -> Check if it exists -> Download it -> Choose a name -> Write to file

function downloader(url::String, dest::String, fname::String)
    HTTP.open(:GET, url) do http
        open(dest*fname, w) do file
            write(file, http)
    return dest*fname

Get all supported

There are so many of them, I need a way to get a list of all existing datasets so I can pop it into functions later. Fortunately, this is a dictionary and I can just take out the keys of it.

alldatasets()= keys(dataset_list)

Get extensions

I need a way to get the extension of the file so I can add it to the downloaded name. This is also very simply done because our files are primarily of the following types : - *.* eg:name.zip - *.*.* eg:name.tar.gz

This will be a bit complex but I am basically just chaining a few things together. Wait. Why not pipe it then it should be easier to explain. Let me try.

function getext(st::String)
    return "."*join(split(split(st, "/")[end],".")[2:end],".")

I will blatantly ignore the previous explanation because turnings out.... piping ... WORKS.

@pipe split(st, "/")[end] |> split(_, ".")[2:end] |> "."*join(_, ".")

Just look at this beauty. Now let me explain. I take the string as st. Then I split it based on "/"s. Then I take the last split (with the extension) and then split it again based on ".". This will give me all the extension parts (eg tar , gz). After that I can join all the parts with a ".". PS. The _ basically takes the output of the last chain and allows you to use it directly in the next chain.

Downloading the standard datasets

Now we can use all these functions to get our dataset with the tiny name instead of searching for them.

I will follow this workflow.

  • Get name of dataset and path to save
  • Check if name entered
    • If name in dataset -> Get the url from the dictionary -> Get extension -> Use the name -> Download it
    • If not -> Show all datasets
function get_data(name::String, path::String)

    if name in alldatasets()
        @info "Downloading $name"
        url = dataset_list[name]
        finfile = downloader(url, path, name*getext(url))
        # finfile = "/tmp/mnist.tgz"
        @info "Done downloading"
        @info "Please choose something from here"; @info alldatasets()

Supported Datasets

I add them in a dictionary like this

dataset_list = Dict(

    # Computer Vision
        "mnist" => "https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz",
        "cifar10" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz",
        "cifar100" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz",

Here are the datasets I am currently supporting. I will keep this as a table for future reference too lol.

|Dataset | Link | |mnist|Source| |cifar10|Source| |cifar100|Source| |birds|Source| |caltech101|Source| |pets|Source| |flowers|Source| |food|Source| |cars|Source| |imagenette|Source| |imagenette320|Source| |imagenette160|Source| |imagewoof|Source| |imagewoof320|Source| |imagewoof160|Source| |imdb|Source| |wikitext103|Source| |wikitext2|Source| |wmt|Source| |ag|Source| |amazon|Source| |p-amazon|Source| |dbpedia|Source| |sogou|Source| |yahoo|Source| |yelp|Source| |p-yelp|Source| |camvid|Source| |pascal|Source| |hmbd|Source| |ucf|Source| |kinetics700|Source| |kinetics600|Source| |kinetics400|Source|