# Machine Learning

Updated: Jul 31, 2019 by Pradeep Gowda.

## Learning Machine Learning

Machine learning is a interdiscriplinary topic that requires mastery of many topics from Computer Science, Engineering, Mathematics and Statistics.

Preparation:

Before heading off into specific topics, one needs to learn how to think mathematically and rigorously. Learn proof techniques first.

• How to prove it by Velleman is a mathematical introduction.

Mathematics:

• A logical approach to Discrete math by Greis and Schneider discusses principles and heuristics developing proofs and work toward giving you the skill in formal manipulation. Thereafter, the authors use the logic in giving rigorous introductions to: set theory, mathematical induction, a theory of sequences, a theory of integers, functions and relations, combinatorics, solving recurrence relations, and modern algebra. This book has chapters on Group Theory and applications of proof techniques to computer programs.

• Calculus by Gilbert Strang (light)

• Calculus by Spivak (heavy)
• Principles of Mathematical Analysis by Walter Rudin (heavy)
• Introduction to Analysis by Maxwell Rosenlicht

Linear Algebra:

• Linear Algebra by Axler
• Introduction to Linear Algebra by Gilbert Strang

Probability:

• Introduction to Probability by Dimitri P. Bertsekas
• A Course in Probability Theory by Kai Lai Chung
• First Look at Rigorous Probability Theory by Jeffrey S. Rosenthal

Statistics:

• Statistics: The Exploration & Analysis of Data by Peck and Devore \\$\\$!

Information Theory:

• Information Theory by David MacKay is available online.

Machine Learning:

• Pattern Recognition and Machine Learning by Christopher Bishop
• Machine Learning by Tom Mitchell (lighter)
• The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Robert Tibshirani et al (denser) (Free Download).
• Introduction to Analysis by Maxwell Rosenlicht

Neural Networks:

• Neural Network Design by Hagan Demuth and Beale
• Neural Networks, A Comprehensive Foundation By Haykin

Search and optimisation:

• Introduction to Stochastic Search and Optimization by James C. Spall

Reinforcement Learning:

• Reinforcement Learning - An Introduction by Barto and Sutton. (complete book online)
• Recent Advances In reinforcement Learning by Barto and Mahadevan (PDF) - a great introduction to Hierarchical Reinforcement learning.
• Neuro Dynamic Programming by Bertsekas

Probabilistic Graphical Models:

• Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman – website. There is a MOOC taught by Prof. Daphne Koller which might help in understanding the material better.

Artificial Intelligence:

Discrete Mathematics:

• Discrete Mathematics and its applications by Kenneth Rosen

Algorithms:

• Introduction to Algorithms - Cormen et.al, (aka CLRS) this is the standard textbook in many algorithm classes. This can take a while to learn. However, there is a lot of material - video, class notes etc., are based on this text book, which makes it very valuable.

## Resources

### Introductory material

• Top 10 algorithms in data mining
• MLRG ie., “Machine Learning Reading Group” is the keyword you should be looking for if you want to learn others in the field are reading. A list of such MLRGs is here.

### Books

#### General machine learning books

1. Mining of massive datasets (downloadable PDF) by Rajaraman, Leskovec and Ullman is collection of lecture notes on the same topic. The material touches on all the currently popular topics in Mining datasets like: Map-reduce, Similarity search, Data-stream processing, technology of search engine (ie., Page rank), Frequent item set mining, Apriori etc., with applications in retail, clustering high-dimensional data. This book is a nice “map” to get alay of the land. The entire book is available freely on the web.
2. Information Theory, Inference, and Learning Algorithms by David J.C.MacKay [Freely available online]
3. The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman [Freely available online]
4. Bayesian Reasoning and Machine Learning by David Barber [Freely available online]
5. Model-Based Machine Learning (Early Access): an online book John Winn and Christopher Bishop with Thomas Diethe. Early access (free) online.

Recommendations 2-4 via Harvard CS281.

#### Specialized machine learning books

Recommended on Harvard’s advanced machine learning class

• Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K.I. Williams [freely available online].
• Non-Uniform Random Variate Generation by Luc Devroye [freely available online].
• Daphne Koller and Nir Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press.
• Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer.
• Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis, CRC.
• Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Wiley.
• Christian P. Robert and George Casella, Monte Carlo Statistical Methods, Springer.

Free online books:

### Blogs

• Shape of Data : this blog talks about the role that geometry plays in the analysis of large, high-dimensional data sets.

### Cheatsheets

Set of illustrated Machine Learning cheatsheets covering the content of Stanford’s CS 229 class:

## Researchers

• Ameet Talwalkar author of “Foundations of machine learning” (i own a copy).

## Thursday, 11 September 2014

MJ recommends these Textbooks for learning (PhD Student level):

I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.

• A. Tsybakov’s book “Introduction to Nonparametric Estimation” as a very readable source for the tools for obtaining lower bounds on estimators

• Y. Nesterov’s very readable “Introductory Lectures on Convex Optimization” as a way to start to understand lower bounds in optimization.

• A. van der Vaart’s “Asymptotic Statistics”, a book that we often teach from at Berkeley, as a book that shows how many ideas in inference (M estimation—which includes maximum likelihood and empirical risk minimization—the bootstrap, semiparametrics, etc) repose on top of empirical process theory.

• B. Efron’s “Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction”, as a thought-provoking book.

On deep learning:

OK, I guess that I have to say something about “deep learning”. This seems like as good a place as any (apologies, though, for not responding directly to your question).

“Deep” breath.

My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decision-making; see my other post :-) is beginning to make impact on real-world problems. I’m in particular happy that the work of my long-time friend Yann LeCun is being recognized, promoted and built upon. Convolutional neural networks are just a plain good idea.

I’m also overall happy with the rebranding associated with the usage of the term “deep learning” instead of “neural networks”. In other engineering areas, the idea of using pipelines, flow diagrams and layered architectures to build complex systems is quite well entrenched, and our field should be working (inter alia) on principles for building such systems. The word “deep” just means that to me—layering (and I hope that the language eventually evolves toward such drier words…). I hope and expect to see more people developing architectures that use other kinds of modules and pipelines, not restricting themselves to layers of “neurons”.

With all due respect to neuroscience, one of the major scientific areas for the next several hundred years, I don’t think that we’re at the point where we understand very much at all about how thought arises in networks of neurons, and I still don’t see neuroscience as a major generator for ideas on how to build inference and decision-making systems in detail. Notions like “parallel is good” and “layering is good” could well (and have) been developed entirely independently of thinking about brains.

I might add that I was a PhD student in the early days of neural networks, before backpropagation had been (re)-invented, where the focus was on the Hebb rule and other “neurally plausible” algorithms. Anything that the brain couldn’t do was to be avoided; we needed to be pure in order to find our way to new styles of thinking. And then Dave Rumelhart started exploring backpropagation—clearly leaving behind the neurally-plausible constraint—and suddenly the systems became much more powerful. This made an impact on me. Let’s not impose artificial constraints based on cartoon models of topics in science that we don’t yet understand.

My understanding is that many if not most of the “deep learning success stories” involve supervised learning (i.e., backpropagation) and massive amounts of data. Layered architectures involving lots of linearity, some smooth nonlinearities, and stochastic gradient descent seem to be able to memorize huge numbers of patterns while interpolating smoothly (not oscillating) “between” the patterns; moreover, there seems to be an ability to discard irrelevant details, particularly if aided by weight- sharing in domains like vision where it’s appropriate. There’s also some of the advantages of ensembling. Overall an appealing mix. But this mix doesn’t feel singularly “neural” (particularly the need for large amounts of labeled data).

Indeed, it’s unsupervised learning that has always been viewed as the Holy Grail; it’s presumably what the brain excels at and what’s really going to be needed to build real “brain-inspired computers”. But here I have some trouble distinguishing the real progress from the hype. It’s my understanding that in vision at least, the unsupervised learning ideas are not responsible for some of the recent results; it’s the supervised training based on large data sets.

One way to approach unsupervised learning is to write down various formal characterizations of what good “features” or “representations” should look like and tie them to various assumptions that seem to be of real-world relevance. This has long been done in the neural network literature (but also far beyond). I’ve seen yet more work in this vein in the deep learning work and I think that that’s great. But I personally think that the way to go is to put those formal characterizations into optimization functionals or Bayesian priors, and then develop procedures that explicitly try to optimize (or integrate) with respect to them. This will be hard and it’s an ongoing problem to approximate. In some of the deep learning learning work that I’ve seen recently, there’s a different tack—one uses one’s favorite neural network architecture, analyses some data and says “Look, it embodies those desired characterizations without having them built in”. That’s the old-style neural network reasoning, where it was assumed that just because it was “neural” it embodied some kind of special sauce. That logic didn’t work for me then, nor does it work for me now.

Lastly, and on a less philosophical level, while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with non-stationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.

Based on seeing the kinds of questions I’ve discussed above arising again and again over the years I’ve concluded that statistics/ML needs a deeper engagement with people in CS systems and databases, not just with AI people, which has been the main kind of engagement going on in previous decades (and still remains the focus of “deep learning”). I’ve personally been doing exactly that at Berkeley, in the context of the “RAD Lab” from 2006 to 2011 and in the current context of the “AMP Lab”.