by Subhaditya Mukherjee
Implementing batching for large data.
Onehot and data batching. This is a precursor to everything from optimizers to the actual training loop itself.
First let us look at one hot encoding. Simply put it is encoding the labels as numbers. Should not be particularly challenging. What do we intend to achieve? Well lets say we have “cat” “dog”. All we need to do is label these as 0,1 and have a way to convert them back to “cat” “dog”.
We first identify the unique elements. Then we pop them into a dictionary of numbers from 1 to number of unique.
labels = unique(y_enc) encodedlabels = Dict(labels .=> collect(1:length(labels)))
Now that we have that out of the way, we have to map the entire list from before with these values. I thought it would be as simple as using a map but apparently not.
So it finally worked! I had to use a global function and a fancy replace function to get it to work.
function onecold(y_enc) labels = unique(y_enc) encodedlabels = Dict(labels .=> collect(1:length(labels))) global ytrain for a in keys(encodedlabels) ytrain = replace(ytrain, a=>encodedlabels[a]) end end
Okay now for images. The hardest part and something I am not looking forward to. Hold on. Now that I think about it. The index to the image array which we made before was literally just this.
X[:, :, :, idx]
Since idx is the index, cant we just do idx:final_idx? OH MY! It works!!
X[:, :, :, 1:batch_size]
This would give me a batch. But the question is. How do I make it into a generator :/ I need to it give me the “next” batch when I call a function. So I do not have to store an index value. Because it is dumb and boring.
So I decided to go by a generator approach. What this means is that we have a function (called a co-routine) which basically stores a state. For example if we want to increment a number everytime we call the function till a point, we can just call it directly. As an example.
function testGen(c::Channel) for n=1:4 put!(c,n) end put!(c,"stop") end
This is a function. If we want to increment it, all we do is call.
test_yi = Cha take!(test_yi)
Now for the actual batching. Okay so there is a problem. The batch cannot return the whole array :/ I dont know why but maybe I just dont know enough Julia.
How about returning the indexes instead though. I think that should be enough. Lets see. Okay that took a bit of modification.
We make the function to yield the next index taking into account batch size.
function datagen(c::Channel) global rep_len for n=1:bs:rep_len put!(c,n); end put!(c,"stop"); end bs = 64 rep_len = length(ytest) bunchData = Channel(datagen);
Now we use this to define the iteration over the batches. We index into the images as well as the labels with the yield of the generator function. As a test, let us check the size of the outputs to see if the batch size is correct. The try catch is to identify the end of the batch.
bs = 64 rep_len = length(ytest) bunchData = Channel(datagen); for i in collect(1:round(rep_len/bs)+1) current_index = take!(bunchData) try x_batch,y_batch = Xtest[:, :, :, current_index:current_index+bs-1],ytest[current_index:current_index+bs-1] @info size(x_batch),size(y_batch) catch e x_batch,y_batch = Xtest[:, :, :, rep_len-bs:rep_len],ytest[rep_len-bs:rep_len] @info size(x_batch),size(y_batch) end end
And now we have a dataloader which takes into account batch size. Guess we can go ahead and implement optimizers and the rest of the loop now :)