Deconstructing Deep Learning + δeviations

Format : Date | Title
TL; DR

Index page

# VGG

Here we will talk about VGG networks and how to implement VGG16 and VGG19.

## Community

I took a break for a few days because I was Seriously burnt out. But I am back! So here we go. In the meanwhile I decided to fix Flux.jl model zoo because it is an absolute disaster. I hope it will get accepted. I spoke to one of the maintainers - Dhairya Gandhi who helped me out a bit. PR

I also could not figure out the ... operator which I got to know from the community. I would encourage everyone to participate there because it truly is an awesome place to learn and share stuff regarding Julia and Scientific Machine Learning in general.

## Intro

Without further delay, what is VGG?? Simply put, it is a really popular Deep learning architecture. It is widely used for image recognition and does pretty well. There are two major variants of it - VGG16 and VGG19 both of which differ slightly in their architectures.

## What I learnt from the paper

Do read the paper if you get a chance - Link But an executive summary of sorts and some things I learnt from the paper are as follows.

• Adding 1x1 layers increases non linearity ( this is called network in network and is used in many architectures as it not only reduces computation but also sometimes converges better)
• learning rate decay was used. This helps a lot for larger networks
• Greater depth, smaller filter provides Implicit regularization
• Adding small amounts of noise to image increases accuracy in the long run
• Averaging best soft max parts of multiple performing models is HAX. (come on I am sleepy)

## Bla bla.. code time

As usual let us get all our packages

using Flux, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
using Images


We will also define a function to make the images into an array of Float32 and the order required by flux for every image dimension.

getarray(X) = Float32.(permutedims(channelview(X), (2, 3, 1)))


### Data

Now let us load CIFAR10. Then we make a list of all the images in it. And make them into batches. We also one hot encode the labels and push the batches to the GPU. This is our train set.

Rinse and repeat for test set.

X = trainimgs(CIFAR10)
imgs = [getarray(X[i].img) for i in 1:50000];
labels = onehotbatch([X[i].ground_truth.class for i in 1:50000],1:10);
train = gpu.([(cat(imgs[i]..., dims = 4), labels[:,i]) for i in partition(1:49000, 100)]);

valset = collect(49001:50000)
valX = cat(imgs[valset]..., dims = 4) |> gpu
valY = labels[:, valset] |> gpu


### Model helpers

Since a lot of this model has repeats, we will define blocks that will reduce the redundancy in the model. We first define a block of [Conv -> ReLU -> Batchnorm]. Note that we take into account the input and output channels.

conv_block(in_channels, out_channels) = (
Conv((3,3), in_channels => out_channels, relu, pad = (1,1), stride = (1,1)),
BatchNorm(out_channels))


Once we have that, we can define a second block of [Conv -> ReLU -> Batchnorm -> Conv -> ReLU -> Batchnorm -> Max Pool].

double_conv(in_channels, out_channels) = (
conv_block(in_channels, out_channels)...,
conv_block(out_channels, out_channels)...,
MaxPool((2,2)))


And finally we can define another bigger block of [Conv -> ReLU -> Batchnorm -> Conv -> ReLU -> Batchnorm -> Conv -> ReLU -> MaxPool]

triple_conv(in_channels, out_channels) = (
conv_block(in_channels, out_channels),
conv_block(out_channels, out_channels),
Conv((3,3), out_channels => out_channels, relu, pad = (1,1), stride = (1,1)),
MaxPool((2,2)))


The above blocks save a looot of redundancy while taking into account the input and output, so we can now directly go to the architecture. There is nothing special here except the way the number of channels increase from 3 to 4096. The last few layers are linear which we are using for classification. Of course softmax is also needed. The ... is the splat operator. Yes its called that xD All it does is unrolls a structure like our tuple here.

We finally pop these into the GPU.

### VGG16

vgg16(initial_channels, num_classes) = Chain(
double_conv(initial_channels, 64)...,
double_conv(64,128)...,
conv_block(128, 256)...,
double_conv(256, 256)...,
conv_block(256, 512)...,
double_conv(512, 512)...,
conv_block(512, 512)...,
double_conv(512, 512)...,
x -> reshape(x, :, size(x, 4)),
Dense(512, 4096, relu),
Dropout(0.5),
Dense(4096, 4096, relu),
Dropout(0.5),
Dense(4096, num_classes),
softmax
) |> gpu



### VGG19

vgg19(initial_channels, num_classes) = Chain(
double_conv(initial_channels, 64)...,
double_conv(64, 128)...,
conv_block(128,256),
triple_conv(256,256)...,
conv_block(256,512),
triple_conv(512,512)...,
conv_block(512,512),
triple_conv(512,512)...,
x -> reshape(x, :, size(x, 4)),
Dense(512, 4096, relu),
Dropout(0.5),
Dense(4096, 4096, relu),
Dropout(0.5),
Dense(4096, num_classes),
softmax) |> gpu



Yes. That's it. Of course, please dont be fooled by the size. We have hidden away the layers in the functions we defined before. But it is that simple to define these things.

Now let us initialize our model.

m = vgg19(3, 10)
# OR
m = vgg16(3, 10)



Let us use cross entropy loss. And we will also define an accuracy function which is basically a sum of the number of labels that match from our test set on which we predict using our model.

loss(x, y) = crossentropy(m(x), y)

accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))


Now that we have that, we can define the number of epochs(10), the optimizer (Adam) and finally run our models.

evalcb = throttle(() -> @show(accuracy(valX, valY)), 10)