- .. is a repository
- .. is a community

- Statistics
- Machine learning
- Computer science
- Data Engineering
- Visualisation

- New ideas from academia and industry practitioners
- Connect ideas to problems

- Validate your own actions
- boundaries (does it even make sense to apply this…)
- Wrong / right
- Do not reinvent square wheels

- Research
`->`

practice - Practice
`->`

publish / capture ideas

- “But, The amount of papers generated is humongous … how do I keep up?”

(Matt Might’s illustrated guide to a PhD)

- Conferences
- Journals
- By Topic
- By people / group / organisations (companies.. eg: goog, fb, for specialised interests)
**Survey papers**

… not exhaustive or prescriptive …

- Name a concept
- Explain the concept
- Explain how it works
- Explain how it is an improvement
- Where to use it. (Upside)
- Where not to use it. (Downside)

“Statistical modeling: The two cultures” by Leo Breiman, 2001.

- Talks about experience and experience is king
- Erudition and story telling is appealing
- I do this for fun. not to write an exam.
- Not dry and mathy.

“Stochastic Data Model” vs “Algorithmic Model”

`response variables = f(predictor variables, random noise, params)`

“… Assume that the data are generated by the following model: ….”

Find a function \(f(x)\) – an algorithm that operates on \(x\) to predict the responses to \(y\).

The statistician, by imagination and by looking at the data, can invent a reasonably good model to explain the complex mechanism of nature.

- The conclusion about the model fit are about the model’s mechanism, and not about nature’s mechanism
- If the model is a poor emulation of nature, the conclusions may be wrong

Data modeling produces a “simple” and “understandable” picture of the relationship between input and output.

But more complicated data models that have appeared suggests that as data becomes complex, the data models lose the advantage of presenting a clear picture of nature’s mechanism

- Fitting equations to data
- Predictive accuracy is the goal
- Applied in: speech and image recognition, handwriting recognition, predicting financial markets etc.,

Approach: – “nature produces data in a black box whose insides are complex, mysterious, and at least, partly unknowable”.

Shifts the focus from data models to properties of algorithms:

- strength as predictors
- convergence (for iterative algorithms)
- whatever gives them good predictive accuracy

- Roshoman: the multiplicity of good models
- Occam: the conflict between simplicity and accuracy
- Bellman: dimensionality – curse or blessing

“there are often a multiple equations of \(f(x)\) giving the same minimum error rate”

ie.,

“They all state the same facts, but their stories of what happened are very different”

- usually interpreted as “Simpler is better”.

- linear regression vs neural networks

“Accuracy generally requires more complex prediction methods. Simple and interpretable functions do not make the most accurate predictors”.

So, the soundest path is to go for predictive accuracy first, and then try to understand why.

- The conventional advice was to reduce the number of features.
- Algorithmic modeling has tried to go in the opposite direction

“Instead of reducing dimensionality, increase it by adding many functions of predictor variables”

- Shape recognition forest
- Support vector machines

The goal is to get accurate information, and not interpretability.

- Focus on finding good solutions (that’s what you are paid for)
- Live with the data before you plunge into modeling
- Search for a model that gives a good solution (Algorithmic or Data modeling)
- Predictive accuracy on the test set is the criterion for how good the model is
- Computers are an indespensible partner

Presentation is available here:

The source to the slides are available on github: