Home page

Deconstructing Deep Learning + δeviations

Drop me an email | RSS feed link : Click
Format : Date | Title
  TL; DR

Total number of posts : 89


View My GitHub Profile

Go to index

Loss Functions

Reading time : ~21 mins

by Subhaditya Mukherjee

In this post we shall explore as many loss functions as I can find.

Loss functions are arguably one of the most important factors in a machine learning model. It gives the model an understanding of how well it did and basically allows it to learn. Simply put, it is the difference between the required result and the produced one. Quite obviously this is different in every place. For example in a Generative Adversarial Network (GANs), the loss function is the completely different. In WGAN, it is a distance metric called Wassertein distance. In a unet, the loss is the difference between the two images and so on and so forth.

Anyway let us explore everything we can about loss functions. I first made a list of all the loss functions offered by keras. It seems to be pretty comprehensive and I have not heard of many of them so far so lets see. Edit : Maybe this isnt a fully comprehensive list. But I will add to it if I find more later. I realized that most of these seem to just be small modifications on previous ones. And some are beyond my understanding right now. But I will come back to them when I get it. (I added a tiny list of those I dont understand yet at the bottom)

Since I am arbitrarily hooking together loss functions from every library I can find, if you feel something is wrong do let me know :) Also note that the examples used are not necessarily the ones that will be used while training and are random values used to test if the code is working

Log softmax

\[\log\left( \frac{e^{ŷ}}{\mathrm{sum}\left( e^{ŷ} \right)} \right)\]
logsoftmax() = log.(exp.()/sum(exp.()))

BCE Logits

\[\left( - \mathrm{sum}\left( y \cdot \mathrm{logsoftmax}\left( ŷ \right) \cdot weight \right) \right) \cdot \mathrm{//}\left( 1, \mathrm{size}\left( y, 2 \right) \right)\]
bcelogits(y,,weight)  =-sum(y .* logsoftmax() .* weight) * 1 // size(y, 2)

Margin Ranking

\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( \mathrm{max}\left( 0, \left( - y \right) \cdot x1 - x2 + margin \right) \right)\]
marginranking(x1,x2,y,margin=0.0) = (1/length(y))*sum(max.(0, -y.*(x1.-x2).+margin))

Huber/Smooth L1/Smooth MAE

if \(\left( \left\|y - ŷ\right\| \lt 1.0 \right) >1\)

\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( 0.5 \cdot \left( y - ŷ \right)^{2} \right)\]


\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( \left\|y - ŷ\right\| - 0.5 \right)\]
function huber(y,) 
    if count(x->x==0, all.(abs.(y.-).<1.0))>=1
        return (1/length(y)).*sum(0.5 .*(y .- ).^2)
        return (1/length(y)).*sum(abs.(y .- ).-0.5)

Negative log likelihood

\[- \mathrm{sum}\left( \log\left( y \right) \right)\]
nll(y) = -sum(log.(y))
nll(x,y) = -sum(log.(y))


\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( y \cdot \log\left( ŷ \right) + \log\left( 1 - y \right) \cdot \log\left( 1 - ŷ \right) \right)\]
bce(y,) = (1/length(y))*sum(y.*log.().+(log.(1 .-y).*log.(1 .-)))


\[- \mathrm{sum}\left( y \cdot \log\left( ŷ \right) \right)\]
cce(y,) = -sum(y.*log.())

Cosine similarity

L2 norm is \(\sqrt{\mathrm{sum}\left( \left( \left\|x\right\| \right)^{2} \right)}\)

Cosine similarity is \(- \mathrm{sum}\left( \mathrm{l2norm}\left( y \right) \cdot \mathrm{l2norm}\left( ŷ \right) \right)\)

l2_norm(x) = sqrt.(sum((abs.(x).^2)))
function cosinesimilarity(y,)
    return -sum(l2_norm(y).*l2_norm())

KL Divergence

We first define xlogx for a weird edge case \(x \cdot \log\left( x \right)\)

Then entropy \(\mathrm{sum}\left( \mathrm{xlogx}\left( y \right) \right) \cdot \mathrm{//}\left( 1, \mathrm{size}\left( y, 2 \right) \right)\)

Then cce as defined before \(- \mathrm{sum}\left( y \cdot \log\left( ŷ \right) \right)\)

Finally KLD \(entropy + crossentropyloss\)

function xlogx(x)
  result = x * log(x)
  ifelse(iszero(x), zero(result), result)

function kldivergence( y,)
  entropy = sum(xlogx.(y)) * 1 //size(y,2)
  cross_entropy = cce(, y)
  return entropy + cross_entropy

Log Cosh

We first define the softplus function \(\log\left( e^{x} + 1 \right)\)

Then , \(x = ŷ - y\)

logcosh = \(\mathrm{mean}\left( x + \mathrm{softplus}\left( -2 \cdot x \right) - \log\left( 2.0 \right) \right)\)

softplus(x) = log.(exp.(x).+1)
function logcosh(y,)
    x =  - y
    return mean(x.+softplus(-2 .*x) .- log(2.))

MAE == L1

There are two ways of doing this, mean and sum. For mean,

\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( \left\|y - ŷ\right\| \right)\]

For sum,

\[\mathrm{sum}\left( \left\|y - ŷ\right\| \right)\]
function mae(y,,reduction= "mean") 
    if reduction=="mean"
        return (1/length(y))*sum(abs.(y .- ))
    elseif reduction=="sum"
        return sum(abs.(y .- ))


\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( \left\|\frac{y - ŷ}{y}\right\| \right)\]
mape(y,) = (1/length(y))*sum( abs.((y-)/y))


\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( \left( \log\left( y + 1 \right) - \log\left( ŷ + 1 \right) \right)^{2} \right)\]
msle(y,) = (1/length(y))*sum((log.(y.+1).-log.(.+1)).^2)


function mse(y,,reduction= "mean") 
    if reduction=="mean"
        return (1/length(y))*sum((y .- ).^2 )
    elseif reduction=="sum"
        return sum((y .- ).^2 )


bcelogits(y,,weight)  =-sum(y .* logsoftmax() .* weight) * 1 // size(y, 2)


\[\frac{1}{\mathrm{length}\left( y \right)} \cdot \mathrm{sum}\left( ŷ - \log\left( ŷ \right) \right)\]
poisson(y,) = (1/length(y))*sum(.-log.())

Sparse CE

\[- \mathrm{sum}\left( ŷ \cdot \log\left( ŷ \right) \right)\]
sparsece(y,) = -sum(.*log.())

Squared Hinge

\[\mathrm{sum}\left( \left( \mathrm{max}\left( 0, 1 - y \cdot ŷ \right) \right)^{2} \right)\]
squaredhinge(y,) = sum(max.(0,1 .-(y.*)).^2)

Triplet margin

We first find the positive distance \(pos{distance} = \left( anchor - positive \right)^{2} -1\)

Then the negative distance \(neg{distance} = \left( anchor - negative \right)^{2} -1\)

Then the temporary loss \(posdistance - negdistance + \alpha\)

And the final loss \(\mathrm{sum}\left( \mathrm{max}\left( loss_{1}, 0.0 \right) \right)\)

function tripletloss(anchor , positive, negative, α = 0.3)
    pos_distance = (anchor.-positive).^2 .+ (-1)
    neg_distance =  (anchor.-negative).^2 .+ (-1)
    loss_1 = (pos_distance.-neg_distance).+α
    return sum(max.(loss_1, 0.0))


\[\mathrm{max}\left( 0, 1 + \mathrm{max}\left( w_{y} \cdot x - w_{t} \cdot x \right) \right)\]
hinge(x,w_y,w_t) = max.(0,1 .+ max.(w_y.*x .- w_t.*x))

Loss functions I can’t make sense of right now

Related posts:  FP16  AI Superpowers Kai Fu Lee  Digital Minimalism Cal Newport  More Deep Learning, Less Crying - A guide  Super resolution  Federated Learning  Taking Batchnorm For Granted  A murder mystery and Adversarial attack  Thank you and a rain check  Pruning