Papers we love – Datascience Indy Kickoff
Feb 10, 2016
What is Papers We Love ?
The data science case for “Papers we love”
Data science is multi-disciplinary.
Data science is at the intersection of …
- Machine learning
- Computer science
- Data Engineering
Get the Upside
- New ideas from academia and industry practitioners
- Connect ideas to problems
Avoid the downside
- Validate your own actions
- boundaries (does it even make sense to apply this…)
- Wrong / right
- Do not reinvent square wheels
What we can get out of such a thing?
-> publish / capture ideas
Should we always read the latest papers?
- “But, The amount of papers generated is humongous … how do I keep up?”
The purpose of research is …
(Matt Might’s illustrated guide to a PhD)
- By Topic
- By people / group / organisations (companies.. eg: goog, fb, for specialised interests)
- Survey papers
Outcomes of a papers we love session
… not exhaustive or prescriptive …
- Name a concept
- Explain the concept
- Explain how it works
- Explain how it is an improvement
- Where to use it. (Upside)
- Where not to use it. (Downside)
The Paper for today’s session
“Statistical modeling: The two cultures” by Leo Breiman, 2001.
Why I liked this paper
- Talks about experience and experience is king
- Erudition and story telling is appealing
- I do this for fun. not to write an exam.
- Not dry and mathy.
“Stochastic Data Model” vs “Algorithmic Model”
Data modeling (Statistics)
response variables = f(predictor variables, random noise, params)
“… Assume that the data are generated by the following model: ….”
Algorithm modeling (Machine Learning)
Find a function \(f(x)\) – an algorithm that operates on \(x\) to predict the responses to \(y\).
The issue with Data modeling
The statistician, by imagination and by looking at the data, can invent a reasonably good model to explain the complex mechanism of nature.
- The conclusion about the model fit are about the model’s mechanism, and not about nature’s mechanism
- If the model is a poor emulation of nature, the conclusions may be wrong
Data modeling produces a “simple” and “understandable” picture of the relationship between input and output.
But more complicated data models that have appeared suggests that as data becomes complex, the data models lose the advantage of presenting a clear picture of nature’s mechanism
- Fitting equations to data
- Predictive accuracy is the goal
- Applied in: speech and image recognition, handwriting recognition, predicting financial markets etc.,
Approach: – “nature produces data in a black box whose insides are complex, mysterious, and at least, partly unknowable”.
Shifts the focus from data models to properties of algorithms:
- strength as predictors
- convergence (for iterative algorithms)
- whatever gives them good predictive accuracy
Lessons in Algorithmic modeling
- Roshoman: the multiplicity of good models
- Occam: the conflict between simplicity and accuracy
- Bellman: dimensionality – curse or blessing
“there are often a multiple equations of \(f(x)\) giving the same minimum error rate”
“They all state the same facts, but their stories of what happened are very different”
- usually interpreted as “Simpler is better”.
- linear regression vs neural networks
“Accuracy generally requires more complex prediction methods. Simple and interpretable functions do not make the most accurate predictors”.
So, the soundest path is to go for predictive accuracy first, and then try to understand why.
Bellman and the curse of dimensionality
- The conventional advice was to reduce the number of features.
- Algorithmic modeling has tried to go in the opposite direction
“Instead of reducing dimensionality, increase it by adding many functions of predictor variables”
- Shape recognition forest
- Support vector machines
The goal is to get accurate information, and not interpretability.
- Focus on finding good solutions (that’s what you are paid for)
- Live with the data before you plunge into modeling
- Search for a model that gives a good solution (Algorithmic or Data modeling)
- Predictive accuracy on the test set is the criterion for how good the model is
- Computers are an indespensible partner
Presentation is available here:
The source to the slides are available on github: