Deconstructing Deep Learning + δeviations

Drop me an email
| RSS feed link : Click

Format :
Date | Title

TL; DR

Index page

# 100PageMlblook

## Initial thoughts from the content

## Notes

### SVM

### Random variable

### Unbiased estimator

### Shallow learning

### Cost func

### Decision tree

### GD

### Techniques

### Data imputation

### Regularization

### Hyper param

### RNN

### Seq2seq

### Ensemble

### Other learnings

### Semi supervised

### Zero shot

### Combine models

### Other stuff

Notes from 100 Page ML Book

I decided to add notes to this blog too. All such notes will be tagged with "book" for easier search. This one is my notes while reading "Andriy Burkov : The Hundred-Page Machine Learning Book". Amazon. Do support the author if you can.

A quick note on how I make notes. I first annotate the pdf of the book. And then type down the text to make it searchable. Yes I probably could use OCR but this helps me remember more. Also, this is not meant to be comprehensive reviews but only what I find interesting from the book. I read a lot about Deep Learning so these will keep popping up.

Okay now let us get to it :)

- Seems like a book which summarizes ML and tiny bit of DL
- Not in depth but more of an executive summary of sorts
- Most of the major algorithms explained in brief
- Bits of extra information scattered here and there

- I skipped making notes of anything I knew prior. So these points are things that I wanted to read again or just found interesting while I was reading the book.
- I skipped things like linear regression while making notes so if you dont know what those are better read the book :)
- Why ML -> Solve practical problems

- SVM sees feature vectors as high dimensional spaces and puts them on a n dimensional plot with an n dimensional hyperplace
- minimize euclidean norm
- kernels that make boundaries non linear
- look for largest margin
- Hinge loss -> if data is not linearly separable. penalizes the side of the decision boundary
- SVMs with hinge -> soft margin. normal -> hard margin
- largin margin : generalization
- kernel trick -> implicitly transform original space into a higher dimensional space
- lagrange multipliers -> optimization problem by finding equivalent representation -> can be solved by quadratic algos
- RBF most widely used

- Prob distribution -> list of prov associated with each possible value -> prob mass function
- continuous random variable -> inf possible values in interval -> prob density function
- expectation -> mean of random variable

- Unlimited no of unbiased estimators -> mean will give actual value.

- Learns parameters directly from features.
- Vs DL -> learnt from outputs of previous layers

- avg loss -> empirical risk

- acyclic graph
- in each branch, specific feature is examined
- choose next leaf based on threshold
- ID3 is approximated by constructing a non parametric model
- recursively continue
- Entropy is an uncertainty measure -> max when all random values have equal probability

- SGD -> uses batches to compute gradient
- adagrad -> scales ¦Á for each parameter wrt history
- momentum -> accelerate SGD

- Binning -> convert continous feature into multiple binary ones
- Normalize -> Increase speed
- Standardization -> scale between ¦Ì and ¦Ò

- same value outside normal range
- avg value
- use regression to fix

- L1 -> sparse model,lasso reg
- L2 -> feature selection, ridge reg

- Grid search
- Bayesian optimization
- Evolutionary optimization

- Sequence
- not feed forward -> loops
- each unit gets 2 inps -> vector of outputs from prev layer, vector of states from prev time step
- backprop through time
- gated RNN -> forget gate
- store info for future use
- read write and erase info stored in units

- Encoder -> generate state with meaning representation -> embedding
- decoder -> take embedding and give output
- best results with attention

- Train many low accuracy models and combine

- Active learning -> label add to those which contribute most to model. Either density (how many examples around x) or uncertainty (how uncertain prediction of model)
- SVM -> Use svm to predict differences and get them annotated

- self learning
- autoencoder
- bottleneck layer -> embedding
- denoising -> corrupts left hand side with random peturbation/ normal gaussian noise

- use embeddings to represent input x and also output y

- Average
- majority vote
- Stack -> Use stacked model to tune hyper params

- regularization -> dropout, batch norm, early stop
- avoid loops
- density estimation -> model probablity density fn -> novelty
- DBSCAN -> build clusters with arbitrary shape
- Gaussian mixture model -> member of several clusters with diff membership score
- UMAP seems to be better then tsne :o
- Ranking -> LambdaMart -> optimize lists on metric. eg Mean average precision (MAP)