Machine Learning
See also: Statistics.
Learning Machine Learning
Machine learning is a interdiscriplinary topic that requires mastery of many topics from Computer Science, Engineering, Mathematics and Statistics.
Preparation:
Before heading off into specific topics, one needs to learn how to think mathematically and rigorously. Learn proof techniques first.
 How to prove it by Velleman is a mathematical introduction.
Mathematics:

A logical approach to Discrete math by Greis and Schneider discusses principles and heuristics developing proofs and work toward giving you the skill in formal manipulation. Thereafter, the authors use the logic in giving rigorous introductions to: set theory, mathematical induction, a theory of sequences, a theory of integers, functions and relations, combinatorics, solving recurrence relations, and modern algebra. This book has chapters on Group Theory and applications of proof techniques to computer programs.

Calculus by Gilbert Strang (light)
 Calculus by Spivak (heavy)
 Principles of Mathematical Analysis by Walter Rudin (heavy)
 Introduction to Analysis by Maxwell Rosenlicht
Linear Algebra:
 Linear Algebra by Axler
 Introduction to Linear Algebra by Gilbert Strang
Probability:
 Introduction to Probability by Dimitri P. Bertsekas
 A Course in Probability Theory by Kai Lai Chung
 First Look at Rigorous Probability Theory by Jeffrey S. Rosenthal
Statistics:
 Statistics: The Exploration & Analysis of Data by Peck and Devore \$\$!
Information Theory:
 Information Theory by David MacKay is available online.
Machine Learning:
 Pattern Recognition and Machine Learning by Christopher Bishop
 Machine Learning by Tom Mitchell (lighter)
 The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Robert Tibshirani et al (denser) (Free Download).
 Introduction to Analysis by Maxwell Rosenlicht
Neural Networks:
 Neural Network Design by Hagan Demuth and Beale
 Neural Networks, A Comprehensive Foundation By Haykin
Search and optimisation:
 Introduction to Stochastic Search and Optimization by James C. Spall
Reinforcement Learning:
 Reinforcement Learning  An Introduction by Barto and Sutton. (complete book online)
 Recent Advances In reinforcement Learning by Barto and Mahadevan (PDF)  a great introduction to Hierarchical Reinforcement learning.
 Neuro Dynamic Programming by Bertsekas
Probabilistic Graphical Models:
 Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman – website. There is a MOOC taught by Prof. Daphne Koller which might help in understanding the material better.
Artificial Intelligence:
 Artificial Intelligence: A modern approach by Peter Norvig
Discrete Mathematics:
 Discrete Mathematics and its applications by Kenneth Rosen
Algorithms:
 Introduction to Algorithms  Cormen et.al, (aka CLRS) this is the standard textbook in many algorithm classes. This can take a while to learn. However, there is a lot of material  video, class notes etc., are based on this text book, which makes it very valuable.
Links:
Resources
Introductory material
 Visual Introduction to Machine Learning [circa 2015]
 Explained Visually
 Machine Learning for developers – code examples in Scala.
Reading Lists
 Top 10 algorithms in data mining

MLRG ie., “Machine Learning Reading Group” is the keyword you should be looking for if you want to learn others in the field are reading. A list of such MLRGs is here.
Books
General machine learning books
 Mining of massive datasets (downloadable PDF) by Rajaraman, Leskovec and Ullman is collection of lecture notes on the same topic. The material touches on all the currently popular topics in Mining datasets like: Mapreduce, Similarity search, Datastream processing, technology of search engine (ie., Page rank), Frequent item set mining, Apriori etc., with applications in retail, clustering highdimensional data. This book is a nice “map” to get alay of the land. The entire book is available freely on the web.
 Information Theory, Inference, and Learning Algorithms by David J.C.MacKay [Freely available online]
 The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman [Freely available online]
 Bayesian Reasoning and Machine Learning by David Barber [Freely available online]
 ModelBased Machine Learning (Early Access): an online book John Winn and Christopher Bishop with Thomas Diethe. Early access (free) online.
Recommendations 24 via Harvard CS281.
Specialized machine learning books
Recommended on Harvard’s advanced machine learning class
 Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K.I. Williams [freely available online].
 NonUniform Random Variate Generation by Luc Devroye [freely available online].
 Daphne Koller and Nir Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press.
 Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer.
 Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis, CRC.
 Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Wiley.
 Christian P. Robert and George Casella, Monte Carlo Statistical Methods, Springer.
Free online books:

Understanding Machine Learning: From Theory to Algorithms by Shai ShalevShwartz and Shai BenDavid. Cambridge University Press; 2014. Courses. Video lectures on Youtube by one of the authors…

Algebra, Topology, Differential Calculus, andOptimization TheoryFor Computer Science and Machine Learning by Jean Gallier and Jocelyn Quaintance, Department of Computer and Information Science, University of Pennsylvania
Blogs
 Shape of Data : this blog talks about the role that geometry plays in the analysis of large, highdimensional data sets.
Courseware
 Statitical Machine Learning, Spring 2015 by

Undergraduate Advanced Data Analysis, 36402 by Shalizi, Spring 2011. :  book – Extending the Linear Model with R by Julian Faraway.

Learning from Data :  based on the textbook of the same name  has lecture videos

Popular Algorithms in Data Mining and Machine Learning this is a seminar+coding homework based course that uses the “top 10 algorithms in data mining” set of papers.

Machine Learning and Probabilistic Graphical Models by Sargur Srihari at Buffalo. :  has lecture videos  books – Bishop’s PRML and Koller’s PGM  uses MATLAB

Scalable machine learning taught by Alex Smola :  has lecture videos  has datasets

Advanced Machine Learning at Harvard. :  has lecture videos.  books – Murphy, Bishop, MacKay, Hastie, Barber

Data mining 36350 taught by Shalizi, Fall 2009. :  book – Principles of Data Mining by Hand, Mannila and Smyth  book – Statistical Learning from a Regression Perspective by Berk  uses R for programming

Statistical Computing 36350 tuaght by Shalizi, Fall 2013. :  newer version of the “Data mining 36350”

DataIntensive Information Processing Applications (Spring 2010) at UMD :  Hadoop  HMM  Mapreduce and databases  Bigtable, Hive & Pig  Book [free] (DataIntensive Text Processing with MapReduce)

Statistical Machine Learning from Data (200506) :  has lab exercises.  books – Hastie, Vapnik

Machine Learning by Andrew Ng :  one of the popular courses on machine learning

Deep Learning and Unsupervised Feature Learning by Andrew Ng

Machine Learning at Columbia by Prof. Jebara :  books – Introduction to Graphical Models by Jordan and Bishop  books – PRML by Bishop

Advanced Machine Learning at Columbia by Prof. Jebara :  Paper reading

Practial Machine Learning at Berkeley by Micheal Jordan :  books – The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie  books – Data Mining: Practical Machine Learning Tools and Techniques by Frank and Witten  R, Rapidminer, libsvm, Weka

Large scale machine learning at Columbia by Sanjiv Kumar
 emphasis on algebra, kernel methods…
 feels more “mathy” in a good way
 Machine Learning (Module) at Glasgow by Prof Girolami :  detailed course notes  books – Hastie, Duda, PRML.  Matrix cookbook is a good resource on matrices.
 Machine Learning for Natural Language Processing (Spring 2012)
 Jaakkola & Collins, 6.867: http://courses.csail.mit.edu/6.867/lectures.html
 Hamilton, CS831: http://www2.cs.uregina.ca/~hamilton/courses/831/
 McGill, CS599: http://www.cs.mcgill.ca/~prakash/Courses/Comp599/comp599.html
Cheatsheets
Set of illustrated Machine Learning cheatsheets covering the content of Stanford’s CS 229 class:
 Deep Learning: http://stanford.io/2BsQ91Q
 Supervised Learning: http://stanford.io/2nRlxxp
 Unsupervised Learning: http://stanford.io/2MmP6FN

Tips and tricks: http://stanford.io/2MEHwFM
 soulmachine/machinelearningcheatsheet: Classical equations and diagrams in machine learning
 Machine learning classifier gallery – shows the results of numerous experiments on ML algorithms when applied to twodimensional patterns.
Libraries
 KeyStoneML
 H2O by 0xdata. blurb: “The open source math and inmemory Prediction engine for Big Data Science.”
 HLearn machine learning library in Haskell. TFP 2013 paper
 memoiry/LightML.jl: Minimal and clean examples of machine learning algorithms implemented in Julia 2017.
Blog posts
 Neglected machine learning ideas  Locklin on science
 The Unreasonable Effectiveness of Random Forests
Researchers
 Ameet Talwalkar author of “Foundations of machine learning” (i own a copy).
Papers to read
Thursday, 11 September 2014
AMA: Michael I Jordan : MachineLearning
MJ recommends these Textbooks for learning (PhD Student level):
I think that you should read all of them three times—the first time you barely understand, the second time you start to get it, and the third time it all seems obvious.

A. Tsybakov’s book “Introduction to Nonparametric Estimation” as a very readable source for the tools for obtaining lower bounds on estimators

Y. Nesterov’s very readable “Introductory Lectures on Convex Optimization” as a way to start to understand lower bounds in optimization.

A. van der Vaart’s “Asymptotic Statistics”, a book that we often teach from at Berkeley, as a book that shows how many ideas in inference (M estimation—which includes maximum likelihood and empirical risk minimization—the bootstrap, semiparametrics, etc) repose on top of empirical process theory.

B. Efron’s “LargeScale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction”, as a thoughtprovoking book.
On deep learning:
OK, I guess that I have to say something about “deep learning”. This seems like as good a place as any (apologies, though, for not responding directly to your question).
“Deep” breath.
My first and main reaction is that I’m totally happy that any area of machine learning (aka, statistical inference and decisionmaking; see my other post :) is beginning to make impact on realworld problems. I’m in particular happy that the work of my longtime friend Yann LeCun is being recognized, promoted and built upon. Convolutional neural networks are just a plain good idea.
I’m also overall happy with the rebranding associated with the usage of the term “deep learning” instead of “neural networks”. In other engineering areas, the idea of using pipelines, flow diagrams and layered architectures to build complex systems is quite well entrenched, and our field should be working (inter alia) on principles for building such systems. The word “deep” just means that to me—layering (and I hope that the language eventually evolves toward such drier words…). I hope and expect to see more people developing architectures that use other kinds of modules and pipelines, not restricting themselves to layers of “neurons”.
With all due respect to neuroscience, one of the major scientific areas for the next several hundred years, I don’t think that we’re at the point where we understand very much at all about how thought arises in networks of neurons, and I still don’t see neuroscience as a major generator for ideas on how to build inference and decisionmaking systems in detail. Notions like “parallel is good” and “layering is good” could well (and have) been developed entirely independently of thinking about brains.
I might add that I was a PhD student in the early days of neural networks, before backpropagation had been (re)invented, where the focus was on the Hebb rule and other “neurally plausible” algorithms. Anything that the brain couldn’t do was to be avoided; we needed to be pure in order to find our way to new styles of thinking. And then Dave Rumelhart started exploring backpropagation—clearly leaving behind the neurallyplausible constraint—and suddenly the systems became much more powerful. This made an impact on me. Let’s not impose artificial constraints based on cartoon models of topics in science that we don’t yet understand.
My understanding is that many if not most of the “deep learning success stories” involve supervised learning (i.e., backpropagation) and massive amounts of data. Layered architectures involving lots of linearity, some smooth nonlinearities, and stochastic gradient descent seem to be able to memorize huge numbers of patterns while interpolating smoothly (not oscillating) “between” the patterns; moreover, there seems to be an ability to discard irrelevant details, particularly if aided by weight sharing in domains like vision where it’s appropriate. There’s also some of the advantages of ensembling. Overall an appealing mix. But this mix doesn’t feel singularly “neural” (particularly the need for large amounts of labeled data).
Indeed, it’s unsupervised learning that has always been viewed as the Holy Grail; it’s presumably what the brain excels at and what’s really going to be needed to build real “braininspired computers”. But here I have some trouble distinguishing the real progress from the hype. It’s my understanding that in vision at least, the unsupervised learning ideas are not responsible for some of the recent results; it’s the supervised training based on large data sets.
One way to approach unsupervised learning is to write down various formal characterizations of what good “features” or “representations” should look like and tie them to various assumptions that seem to be of realworld relevance. This has long been done in the neural network literature (but also far beyond). I’ve seen yet more work in this vein in the deep learning work and I think that that’s great. But I personally think that the way to go is to put those formal characterizations into optimization functionals or Bayesian priors, and then develop procedures that explicitly try to optimize (or integrate) with respect to them. This will be hard and it’s an ongoing problem to approximate. In some of the deep learning learning work that I’ve seen recently, there’s a different tack—one uses one’s favorite neural network architecture, analyses some data and says “Look, it embodies those desired characterizations without having them built in”. That’s the oldstyle neural network reasoning, where it was assumed that just because it was “neural” it embodied some kind of special sauce. That logic didn’t work for me then, nor does it work for me now.
Lastly, and on a less philosophical level, while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g., (1) How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have? (2) How can I get meaningful error bars or other measures of performance on all of the queries to my database? (3) How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources? (4) How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on? (5) How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken? (6) How do I deal with nonstationarity? (7) How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?
Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.
Based on seeing the kinds of questions I’ve discussed above arising again and again over the years I’ve concluded that statistics/ML needs a deeper engagement with people in CS systems and databases, not just with AI people, which has been the main kind of engagement going on in previous decades (and still remains the focus of “deep learning”). I’ve personally been doing exactly that at Berkeley, in the context of the “RAD Lab” from 2006 to 2011 and in the current context of the “AMP Lab”.