Home page

Deconstructing Deep Learning + δeviations

Drop me an email | RSS feed link : Click
Format : Date | Title
  TL; DR

Total number of posts : 89

Go To : PAPERS o ARTICLES o BOOKS o SPACE

View My GitHub Profile


Go to index

SGD

Reading time : ~11 mins

by Subhaditya Mukherjee

In this post we will try to implement SGD and read a bit about what it is.

Before that. Let us review what we have so far.

  1. A way to read images and identify classes
  2. A generator go give us this data in batches
  3. A dense layer
  4. A bunch of activation functions
  5. A basic backprop loop.

So what is SGD? Well it is a tiny addition to the loop in which we calculate the gradient and update the weight matrix batch wise instead of doing it for the whole data at once. Thats about it.

Why? Because we have a limited memory. And we are “streaming” the data so to speak. SGD is one of the older technqiues. It works. It works pretty well and it is the first step towards making a proper DL pipeline.

So in the case of gradient descent, SGD just takes random values instead aka Stochastic.

How do we implement it? Let us modify our training loop a bit now and add what we had discussed for “Linear Model”. Lets see if it works. Also as a note. I am using xtest/ytest because it is smaller and we are not using a GPU right now.

To be honest I actually gave up on this for a few days. For some reason it felt like I could not do this. But now I am back after 5 days and I will try it again. It is a bottleneck I think. Once this is done I can actually do a lot more directly. Ugh. I am probably doing something horribly wrong..

Its been hours and I am still trying to understand how to do this. Wow. I have no idea why I wanted to do this. It is way harder than I thought. But anyway, the show will go on. Fingers crossed.

I think it worked :) It is not perfect and we will get back to it later but anyway. Before I write the code for SGD, let us modify our dataloader to normalize the images while loading them. This is a way to reduce the variation and scale the pixels between values and make our model run quicker. Normalization is defined as pixels = (pixels.-μ./σ).

Basically we add this bit.

        img = convert(Array{Float64}, img)
        μ, σ = mean(img), std(img)
        img = (img.-μ./σ)

Now for our SGD. We first repurpose our loss MSE loss function from before.

Dense(x) = W.* x .+ b
function loss(x, y)
     = Dense(x)
	lo =  sum((y .- ).^2 )
    global loss_pl
    loss_pl*= string(lo)*" "
    @show(lo)
    return lo
end

We compute the MSE loss and save the values to a string. Why string? I tried saving it in an array but because im using Automatic Differentiation, mutation an array while the compute is going on is not allowed. So I just saved them to a string and wrote a tiny function to convert it into an array of floats after the training is done. (To plot the losses).

bs = 64
rep_len = length(ytest)

W = rand(he_uniform(64),64,64 )
b = rand(he_uniform(64), 64)
loss_pl = ""
n_epochs = 2


for epoch in collect(1:n_epochs)
    bunchData = Channel(datagen);
    @info "Epoch ",epoch
    for i in collect(1:round(rep_len/bs)+1)
        current_index = take!(bunchData)
        try
            x_batch,y_batch = Xtest[:, :, :, current_index:current_index+bs-1],ytest[current_index:current_index+bs-1]
#             @info size(x_batch),size(y_batch)
            catch e 
            x_batch,y_batch = Xtest[:, :, :, rep_len-bs:rep_len-1],ytest[rep_len-bs:rep_len-1]
#             @info size(x_batch),size(y_batch)
        end
        global x_batch, y_batch
          gs = gradient(() -> loss(x_batch, y_batch), Params([W, b]))

         = gs[W]
        W.= α.*
    end
end

Let us take a batch size of 64. (Repurposing the previous batching code.) The biggest challenge were the dimensions of the Weights and biases matrices. This is because Julias image array is not like pythons ): and I could not just modify the code directly to reflect my python implementation of the same. Which is sad.

Anyway, I also added He initialization to the weights and biases (which is coooool). Now the gradient is computed for batches and not the whole array together. I also added the epochs loop! The rest of the code is modified from the linear reg example.

This code works but I feel like there is something wrong here and I will probably fix it when I realize what.

The last thing in this segment is a tiny function to plot the losses. (Well its MSE and it probably isnt a good idea to plot it but anyway)

loss_pl = split(loss_pl," ");
Plots.plot([parse(Float32,x) for x in loss_pl[collect(1:length(loss_pl)-1)]])

Since I saved them to a string with spaces, I just split them and then convert them into float32. I tried float16 but it is MSE after all and it was out of bounds. Then just a simple plot function using the Plots library. I guess I will have to move this inside the loop if I want it to show me the loss every epoch. (Or atleast modify the loop itself) But well. Baby steps right?

Related posts:  FP16  AI Superpowers Kai Fu Lee  Digital Minimalism Cal Newport  More Deep Learning, Less Crying - A guide  Super resolution  Federated Learning  Taking Batchnorm For Granted  A murder mystery and Adversarial attack  Thank you and a rain check  Pruning