Tiny post on datasets and a unified downloader for standard ones.
I wanted an easy way to download the standard datasets. I guess Metalhead does it already but it has limited support and there's around 5 datasets. Boooringg
The objective is to have a list of standard datasets from Computer Vision, Natural Language Processing and maybe Graphs at some point. Very nicely enough, fastai has already done this for us. Althought it is for python. So I will take that list and add some on my own.
Now for actually downloading the files. We follow a simple workflow. Get a URL -> Check if it exists -> Download it -> Choose a name -> Write to file
function downloader(url::String, dest::String, fname::String) HTTP.open(:GET, url) do http open(dest*fname, w) do file write(file, http) end end return dest*fname end
There are so many of them, I need a way to get a list of all existing datasets so I can pop it into functions later. Fortunately, this is a dictionary and I can just take out the keys of it.
I need a way to get the extension of the file so I can add it to the downloaded name. This is also very simply done because our files are primarily of the following types : - *.* eg:name.zip - *.*.* eg:name.tar.gz
This will be a bit complex but I am basically just chaining a few things together. Wait. Why not pipe it then it should be easier to explain. Let me try.
function getext(st::String) return "."*join(split(split(st, "/")[end],".")[2:end],".") end
I will blatantly ignore the previous explanation because turnings out.... piping ... WORKS.
@pipe split(st, "/")[end] |> split(_, ".")[2:end] |> "."*join(_, ".")
Just look at this beauty. Now let me explain. I take the string as st. Then I split it based on "/"s. Then I take the last split (with the extension) and then split it again based on ".". This will give me all the extension parts (eg tar , gz). After that I can join all the parts with a ".". PS. The _ basically takes the output of the last chain and allows you to use it directly in the next chain.
Now we can use all these functions to get our dataset with the tiny name instead of searching for them.
I will follow this workflow.
function get_data(name::String, path::String) if name in alldatasets() @info "Downloading $name" url = dataset_list[name] finfile = downloader(url, path, name*getext(url)) # finfile = "/tmp/mnist.tgz" @info "Done downloading" else @info "Please choose something from here"; @info alldatasets() end end
I add them in a dictionary like this
dataset_list = Dict( # Computer Vision "mnist" => "https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz", "cifar10" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar10.tgz", "cifar100" => "https://s3.amazonaws.com/fast-ai-imageclas/cifar100.tgz",
Here are the datasets I am currently supporting. I will keep this as a table for future reference too lol.
|Dataset | Link | |mnist|Source| |cifar10|Source| |cifar100|Source| |birds|Source| |caltech101|Source| |pets|Source| |flowers|Source| |food|Source| |cars|Source| |imagenette|Source| |imagenette320|Source| |imagenette160|Source| |imagewoof|Source| |imagewoof320|Source| |imagewoof160|Source| |imdb|Source| |wikitext103|Source| |wikitext2|Source| |wmt|Source| |ag|Source| |amazon|Source| |p-amazon|Source| |dbpedia|Source| |sogou|Source| |yahoo|Source| |yelp|Source| |p-yelp|Source| |camvid|Source| |pascal|Source| |hmbd|Source| |ucf|Source| |kinetics700|Source| |kinetics600|Source| |kinetics400|Source|