Deconstructing Deep Learning + δeviations

Drop me an email | RSS feed link : Click
Format : Date | Title
TL; DR

Index page

# Dataset Bindings

Tiny post on datasets and a unified downloader for standard ones.

I wanted an easy way to download the standard datasets. I guess Metalhead does it already but it has limited support and there's around 5 datasets. Boooringg

## Objective

The objective is to have a list of standard datasets from Computer Vision, Natural Language Processing and maybe Graphs at some point. Very nicely enough, fastai has already done this for us. Althought it is for python. So I will take that list and add some on my own.

Now for actually downloading the files. We follow a simple workflow. Get a URL -> Check if it exists -> Download it -> Choose a name -> Write to file

function downloader(url::String, dest::String, fname::String)
HTTP.open(:GET, url) do http
open(dest*fname, w) do file
write(file, http)
end
end
return dest*fname
end


## Get all supported

There are so many of them, I need a way to get a list of all existing datasets so I can pop it into functions later. Fortunately, this is a dictionary and I can just take out the keys of it.

alldatasets()= keys(dataset_list)


## Get extensions

I need a way to get the extension of the file so I can add it to the downloaded name. This is also very simply done because our files are primarily of the following types : - *.* eg:name.zip - *.*.* eg:name.tar.gz

This will be a bit complex but I am basically just chaining a few things together. Wait. Why not pipe it then it should be easier to explain. Let me try.

function getext(st::String)
return "."*join(split(split(st, "/")[end],".")[2:end],".")
end


I will blatantly ignore the previous explanation because turnings out.... piping ... WORKS.

@pipe split(st, "/")[end] |> split(_, ".")[2:end] |> "."*join(_, ".")


Just look at this beauty. Now let me explain. I take the string as st. Then I split it based on "/"s. Then I take the last split (with the extension) and then split it again based on ".". This will give me all the extension parts (eg tar , gz). After that I can join all the parts with a ".". PS. The _ basically takes the output of the last chain and allows you to use it directly in the next chain.

Now we can use all these functions to get our dataset with the tiny name instead of searching for them.

I will follow this workflow.

• Get name of dataset and path to save
• Check if name entered
• If name in dataset -> Get the url from the dictionary -> Get extension -> Use the name -> Download it
• If not -> Show all datasets
function get_data(name::String, path::String)

if name in alldatasets()
url = dataset_list[name]
# finfile = "/tmp/mnist.tgz"
else
@info "Please choose something from here"; @info alldatasets()
end
end


## Supported Datasets

I add them in a dictionary like this

dataset_list = Dict(

# Computer Vision
"mnist" => "https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz",
"cifar10" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz",
"cifar100" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz",


Here are the datasets I am currently supporting. I will keep this as a table for future reference too lol.

|Dataset | Link | |mnist|Source| |cifar10|Source| |cifar100|Source| |birds|Source| |caltech101|Source| |pets|Source| |flowers|Source| |food|Source| |cars|Source| |imagenette|Source| |imagenette320|Source| |imagenette160|Source| |imagewoof|Source| |imagewoof320|Source| |imagewoof160|Source| |imdb|Source| |wikitext103|Source| |wikitext2|Source| |wmt|Source| |ag|Source| |amazon|Source| |p-amazon|Source| |dbpedia|Source| |sogou|Source| |yahoo|Source| |yelp|Source| |p-yelp|Source| |camvid|Source| |pascal|Source| |hmbd|Source| |ucf|Source| |kinetics700|Source| |kinetics600|Source| |kinetics400|Source|