Deconstructing Deep Learning + δeviations

Drop me an email
| RSS feed link : Click

Format :
Date | Title

TL; DR

Go to index

# swish

### Reading time : ~1 min

# Notes

Related posts:
FP16
AI Superpowers Kai Fu Lee
Digital Minimalism Cal Newport
More Deep Learning, Less Crying - A guide
Super resolution
Federated Learning
Taking Batchnorm For Granted
A murder mystery and Adversarial attack
Thank you and a rain check
Pruning

by Subhaditya Mukherjee

Paper notes for the paper

**[24]** swish

- Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941. Paper

- Supposedly does better than selu and relu sometimes (Nope)
- Slower than relu
- f (x) = x · sigmoid(βx)
- found using RL techniques using “core units”
- simple search space inspired by the optimizer search space of Bello et al. (2017) that composes unary and binary functions to construct the activation function
- activation function is constructed by repeatedly composing the the “core
unit”, which is defined as b(u1 (x1 ), u2 (x2 )). The core unit takes in two scalar inputs, passes each
input independently through an unary function, and combines the two unary outputs with a binary
function that outputs a scalar
## Tips for activation fns

- Complicated activation functions consistently underperform simpler activation functions, potentially due to an increased difficulty in optimization.
- A common structure shared by the top activation functions is the use of the raw preactiva- tion x as input to the final binary function
- The searches discovered activation functions that utilize periodic functions, such as sin and cos.
- Functions that use division tend to perform poorly because the output explodes when the denominator is near 0