# Doing Data Science

This is an O’Reilly book I received as part of the O’Reilly User Group program .

The authors – Cathy O’Neil and Rachel Schutt, have based this book on a Data Science class they taught at Columbia.

The source code used in the book is available here: https://github.com/oreillymedia/doing_data_science

At the outset, the book appears to be a survey of popular topics that are considered part of “Data Science”. The introduction features Drew Convey’s, now famous, Venn diagram that puts Data Science at the intersection of Programming Skills, Math/Stat knowledge and domain expertise.

Though the authors claim the book to be targetted at Statisticians, Quants, PhDs and Programmers alike, the flavour of the book is solidly in favour of programmers with sufficient math formulae to appeal to the adventurous. The provided “supplemental reading” is a familiar, yet comprehensive list of textbooks in the areas of Linear Algebra, Visualization, Machine Learning, R and Python Programming, Data Analysis. Many of these resources are freely available online with some of them like Elements of Statistical Learning, Linear Algebra having MOOCs to augment the learning.

A data scientist is expected to have skills encompassing in the following areas: Computer Science, Mathematics, Statistics, Machine Learning, Domain expertise, Communication and presentation skills, Data Visualizaion.

As one can imagine, it is going to be hard for any practioner to claim equal expertise in all these areas. Depending on the educational and professional background a data scientist can wear one of these stripes: Data businessperson, Data creative, Data developer and Data researcher. Interestingly, an data developer is expected to give hard-science skills in equal consideration.

## Statistical inference, EDA and the DS process

Statistical inference is the discipliine that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated using stochastic (random) processes.

We take *sample* - a subset (of size *n*) of the the total observations
*N* in order to examine the observations to draw conclusions and make
inferences about the population. Even if you select say 1/10 of the
total observations at random, the underlying *distribution* can still
distort your conclusions. So, it is important to take into account the
proces that got the data into your hands.

The kinds of data that is routinely used for data analysis have also increased in scope and variety. Current day data scientists should be able to work with –

- traditional (numerical, categorial, boolean data),
- text,
- records (timestamped event data, json),
- geo-based location data,
- network data,
- sensor logs,
- images etc..

However, the biggest challenge appears to be in terms of handling real-time streaming data. How will the methods like sampling change in the face of every growing/changing data stream?

Big data = Volume, variety, velocity and value. Rest is all hype-y.

According to Cukier and Mayer-Schoenberger in their article “The rise of big data”, the big data phenomenon consisits of

- collecting and using lot of data rather than small samples
- accepting the messiness in your data
- giving up on knowing the causes

We can understand the above as,

- you don’t get to plan on defining the data to be collected for your experiemnts. you assume that the data is already collected before you start your work.
- which means, there may be little or no quality control on the data collection process..
- this may be a anti-statistical modeling streak present in people from non-statistical background.

Data is not objective - this is an important warning for those who want to believe that data has all the answers, ie., “let’s put in everything and let the data speak for itself”.

Models that ignore causation can be a flaw, rather than a feature. These models can add to historical problems instead of addressing them. Data is just a quantitative, pale echo of the events of our society.

A model is our attempt to understand and represent the nature of reality through a particular lens; A model is an artificial construction where all the extraneous detail has been removed or abstracted.

Building a model is part art and part science. When building a model, you have to make a lot of assumptions about the underlying structure of reality, and we should have standards as to how we make those choices and we explain them. But, we may not have global standards, so we make them up as we go along and make corrections as we learn more.

Exploratory Data Analysis is a good place to start building a model. An important part of this exercise is to write down general feel you have for the data. Simple models are good to start. Simple models may give you 80% of the benefit with little effort (80/20 principle).

Probability distributions are the building blocks of potential models. They are also known as continuous density functions. Some examples of CDFs:

- Normal distribution
- Uniform distribution
- Cauchy distribution
- t distribution
- F distribution
- Chi-square distribution
- Weibull distribution etc.,

A random variable $x$ can be assumed to have a corresponding probability distribution $p(x)$, which maps x to a real number. The area under the curve of a probability distribution is 1, so it can be interepreted as probability.

Example:

If the arrival of the next bus is given by the distribution $p(x) = 2e^(-2x)$, then we can know the likelihood of the next bus arriving between 12 and 13 minutes by finding the area under the curve.

The distributions are chosen by the process of:

- conduct experiemnt
- look at the measurements
- plot them
- approximate the function.

Multivariate functions called *joint distributions* are used to denote
the distributions of more than one random variable, denoted by $p(x,y)$,
and it would take values in the plne (x,y) and give us non-negative
values. The doulble integral over the whole plane would be 1.

Conditional distribution is denoted as $p(x|y)$, which is to say density function of x given the value of y.

### Fitting a model

is the process of estimating the parameters of the model using observed
data. Fitting the model often involves optimization methods and
algorithms such as *maximum likelihood estimation* to get the
parameters.

In practice, you have a functional form that you think fits the data best and you write code (in say, Python or R) to figure out the values of the parameters. The programming environment might use many optimization techniques in arriving at the parameters. As you grow more experienced, you start tweaking these optmization methods. You should watch out for overfitting.

## Exploratory Data Analysis

John Tukey developed EDA. In EDA, there is no hypothesis and no model, as opposed to “confirmatory data analysis”. Exploratory = your understanding of the problem is changing as you go.

The basic tools of EDA are plots, graphs and summary statistics. With EDA, you are using the various tools to understand data – gain intuition, understand the shape of it and trying to figure out the process that generated the data.

EDA is applicable even at “Google” scale. Even though EDA also uses visualization, it is distinct from Data Visualization in that EDA is applied at the beginning of analysis and DV towards the end to communicate the findings.

### Excercise:EDA

[AT THIS POINT, I REALISED THAT I didn’t have enough R chops. So, I started learning R ]. [I WILL BE BACK TO CONTINUE THIS…]

## Recommended reading

- A first course in probability by Sheldon Ross
- Data Science: an action plan for expanding the technical areas of the field of statistics by Bill Cleveland
- Exploratory Data Analysis [1994] by John Tukey
- Elements of Graphing data [1994] by William Cleveland
- Bayesian data analysis [1995, 2004] by Andrew Gelman
- The future of data analysis [pdf] by John Tukey

## People

- Cathy O’Neil (author #1).
- Rachel Schutt (author #2).
- John Tukey
- Andrew Gelman
- William Cleveland
- Edward Tufte

## Impressions

It is nice to see syntax highlighted code. However, the maths is typeset using something “Not-LaTeX”, which is disappointing.

The tone of the book is conversational.The citations are primarily recent online articles and blog posts.

## Reviews of the book

- Andrew Gelman’s Review.