Old Machine Learning notes
“Sooner or later, you will have to derive it”
This is a work in progress and not ready for public consumption (if it ever will be!).
Introduction
Prerequisites
 Algebra
 Optimisation
 Probability and Randomness
Text books and references
 Pattern recognition and machine learning by Crhistopher Bishop
 The elements of Statistical Learning by T. Hastie, R. Tibshirani and

 Friedman

Journals and conferences
Lecture notes and videos
Probability
Random variables
Probability Distributions
Terminology^{1}
As probability theory is used in quite diverse applications, terminology is not uniform and sometimes confusing. The following terms are used for noncumulative probability distribution functions:
 Probability mass, Probability mass function,
p.m.f.
: for discrete random variables.  Categorical distribution: for discrete random variables with a finite set of values.
 Probability density, Probability density function,
p.d.f
: Most often reserved for continuous random variables.
The following terms are somewhat ambiguous as they can refer to noncumulative or cumulative distributions, depending on authors’ preferences:
 Probability distribution function: Continuous or discrete, noncumulative or cumulative.
 Probability function: Even more ambiguous, can mean any of the above, or anything else.
Finally,
 Probability distribution: Either the same as probability distribution function. Or understood as something more fundamental underlying an actual mass or density function.
Basic terms
 Mode: most frequently occurring value in a distribution
 Tail: region of least frequently occurring values in a distribution
Conditional probability
Marginal Probability
Joint Probability
Discrete probability Distribution
Examples: {Poisson, Bernoulli, binomial, geometric, and negative binomial} distribution.
A discrete probability distribution is often represented as a generalized probability density function involving Dirac delta functions which substantially unifies the treatment of continuous and discrete distributions. This is especially useful when dealing with probability distributions involving both a continuous and a discrete part.
Normal Distribution or Gaussian Distribution
is a continuous probability distribution that has a bellshaped probability density function, known as the Gaussian function or informally the bell curve
$f(x; \mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma}e^{(x\mu)^2/(2\sigma^2)}$
where, μ is the mean or expectation(location of the peak) and σ^{2} is the variance.
Ref: http://en.wikipedia.org/wiki/Normal\_distribution
Probability Distribution Function
OF a random variable describes the relative frequencies of different values for that random variable.
Joint Distribution Function
Covariance Matrix
Precision matrix (Σ^{ − 1} ?)
Multivariate Distribtion
Multivariate Normal Distribution
Some definitions
Exponentiating a quadratic function f(x) = a^{2} + bx + c gives f(x) = e^{a2 + bx + c}
Parameterisation is the process of deciding and defining the parameters necessary for a complete or relevant specification of a model or geometric object.
Natural Parameter of the normal distribution: ?
Mathematics notes
Posterior probability
The wikipedia page has good explanation.
The posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account. Similarly, the posterior probability distribution is the distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey.
$$P(AB) = \frac{P(BA) P(A)}{P(B)}$$
The posterior probability distribution of one random variable given the value of another can be calculated with Bayes’ theorem by multiplying the prior probability distribution by the likelihood function, and then dividing by the normalizing constant, as follows:
Conjugate Prior
If the posterior distributions p(θx) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood
A Compendium of Conjugate Priors (pdf), has a good explanation.
Conjugate prior relationships.
Closed form
In mathematics, an expression is said to be a closedform expression if it can be expressed analytically in terms of a bounded number of certain “wellknown” functions. Typically, these wellknown functions are defined to be elementary functions—constants, one variable x, elementary operations of arithmetic (+ − × ÷), nth roots, exponent and logarithm (which thus also include trigonometric functions and inverse trigonometric functions).
An equation is said to be a closedform solution if it solves a given problem in terms of functions and mathematical operations from a given generally accepted set. For example, an infinite sum would generally not be considered closedform. However, the choice of what to call closedform and what not is rather arbitrary since a new “closedform” function could simply be defined in terms of the infinite sum.^{2}