Deconstructing Deep Learning + δeviations

Format : Date | Title
TL; DR

Index page

# 100PageMlblook

Notes from 100 Page ML Book

I decided to add notes to this blog too. All such notes will be tagged with "book" for easier search. This one is my notes while reading "Andriy Burkov : The Hundred-Page Machine Learning Book". Amazon. Do support the author if you can.

A quick note on how I make notes. I first annotate the pdf of the book. And then type down the text to make it searchable. Yes I probably could use OCR but this helps me remember more. Also, this is not meant to be comprehensive reviews but only what I find interesting from the book. I read a lot about Deep Learning so these will keep popping up.

Okay now let us get to it :)

## Initial thoughts from the content

• Seems like a book which summarizes ML and tiny bit of DL
• Not in depth but more of an executive summary of sorts
• Most of the major algorithms explained in brief
• Bits of extra information scattered here and there
• I skipped making notes of anything I knew prior. So these points are things that I wanted to read again or just found interesting while I was reading the book.
• I skipped things like linear regression while making notes so if you dont know what those are better read the book :)
• Why ML -> Solve practical problems

### SVM

• SVM sees feature vectors as high dimensional spaces and puts them on a n dimensional plot with an n dimensional hyperplace
• minimize euclidean norm
• kernels that make boundaries non linear
• look for largest margin
• Hinge loss -> if data is not linearly separable. penalizes the side of the decision boundary
• SVMs with hinge -> soft margin. normal -> hard margin
• largin margin : generalization
• kernel trick -> implicitly transform original space into a higher dimensional space
• lagrange multipliers -> optimization problem by finding equivalent representation -> can be solved by quadratic algos
• RBF most widely used

### Random variable

• Prob distribution -> list of prov associated with each possible value -> prob mass function
• continuous random variable -> inf possible values in interval -> prob density function
• expectation -> mean of random variable

### Unbiased estimator

• Unlimited no of unbiased estimators -> mean will give actual value.

### Shallow learning

• Learns parameters directly from features.
• Vs DL -> learnt from outputs of previous layers

### Cost func

• avg loss -> empirical risk

### Decision tree

• acyclic graph
• in each branch, specific feature is examined
• choose next leaf based on threshold
• ID3 is approximated by constructing a non parametric model
• recursively continue
• Entropy is an uncertainty measure -> max when all random values have equal probability

### GD

• SGD -> uses batches to compute gradient
• momentum -> accelerate SGD

### Techniques

• Binning -> convert continous feature into multiple binary ones
• Normalize -> Increase speed
• Standardization -> scale between ¦Ì and ¦Ò

### Data imputation

• same value outside normal range
• avg value
• use regression to fix

### Regularization

• L1 -> sparse model,lasso reg
• L2 -> feature selection, ridge reg

### Hyper param

• Grid search
• Bayesian optimization
• Evolutionary optimization

### RNN

• Sequence
• not feed forward -> loops
• each unit gets 2 inps -> vector of outputs from prev layer, vector of states from prev time step
• backprop through time
• gated RNN -> forget gate
• store info for future use
• read write and erase info stored in units

### Seq2seq

• Encoder -> generate state with meaning representation -> embedding
• decoder -> take embedding and give output
• best results with attention

### Ensemble

• Train many low accuracy models and combine

### Other learnings

• Active learning -> label add to those which contribute most to model. Either density (how many examples around x) or uncertainty (how uncertain prediction of model)
• SVM -> Use svm to predict differences and get them annotated

### Semi supervised

• self learning
• autoencoder
• bottleneck layer -> embedding
• denoising -> corrupts left hand side with random peturbation/ normal gaussian noise

### Zero shot

• use embeddings to represent input x and also output y

### Combine models

• Average
• majority vote
• Stack -> Use stacked model to tune hyper params

### Other stuff

• regularization -> dropout, batch norm, early stop
• avoid loops
• density estimation -> model probablity density fn -> novelty
• DBSCAN -> build clusters with arbitrary shape
• Gaussian mixture model -> member of several clusters with diff membership score
• UMAP seems to be better then tsne :o
• Ranking -> LambdaMart -> optimize lists on metric. eg Mean average precision (MAP)