Outline
- Regression
- Least squares [implemented]
- Ridge regression [implemented]
- Gaussian process [implemented]
- Classification
- Gaussian Bayes [implemented]
- Naive Bayes (Poisson) [implemented]
- Decision tree [implemented]
- Support vector machine [implemented]
- kNN
- Boostrap aggreation: random forest
- Adaptive boostrap
- Logistic regression
- Clustering
- k-mean [implemented]
- Gaussian mixture model [implemented]
- Recommendation system (PMF)
- Topic modeling (NMF)
- Principle component analysis
- Probabilistic PCA
- Kernel PCA
Code
I come across this wonderful course on Machine Learning by ColumbiaX [https://courses.edx.org/courses/course-v1:ColumbiaX+CSMM.102x+2T2017/info]. This is the first course that is rigorously satisfying for me. It inspires me to implement most of the involved algorithms from the scratch. The implementations are not that difficult because the lecturer explains the math of the algorithms so well that you hardly need to look for external references to clear the doubts.
The unifying mantra for inventing all these algorithms follows two steps:
- Make some modelling assumptions. Sometimes, it is a generative model with a bunch of parameters to determine such as ridge regression and Bayes classifiers. Other times, the objective is simply an optimization problem with no probabilistic interpretation like SVM and Decision Tree. The common feature of all these formulations is that we have a bunch of parameters to be determined based on the observed data.
- Use the observed data to find the parameters that best explain the observations. This step is always reduced to a procedure to maximize or minimize certain function. We may only obtain a local maximum or minimum sometimes like in K-mean.
Regression¶
Least squares¶
Find the parameters that minimize the sum of squares of errors:
$$w_{LS} = \underset{w}{\operatorname{argmin}} \sum_{i=1}^{n}(y_i - w^T x_i)^2$$Least squares also has a probabilistic interpretation. It is equivalent to finding the parameters that maximize the likelihood of the data:
$$w_{LS} = w_{ML} = \underset{w}{\operatorname{argmax}} \ln p(y | \mu=Xw, \sigma^2)$$The closed-form solution of least squares is:
$$W_{LS} = (X^T X)^{-1}X^Ty$$Ridge regression¶
Find the parameters that minimize the regularized sum of squares of errors:
$$w_{RR} = \underset{w}{\operatorname{argmin}} \sum_{i=1}^{n}(y_i - w^T x_i)^2 + \frac{\lambda}{2}w^Tw$$The probabilistic interpretation of ridge regression is:
$$w_{RR} = w_{MAP} = \underset{w}{\operatorname{argmax}} \ln p(y | \mu=Xw, \sigma^2) p(w)$$The closed-form solution of ridge regression is:
$$W_{RR} = (\lambda\sigma^2+X^T X)^{-1}X^Ty$$Gaussian process¶
Suppose that $f(x) = GP(0,K)$ and $y|f \sim N(f,\sigma^2 I)$. Given the measured data $D_n = \{(x_1,y_1), (x_2,y_2), ..., (x_n,y_n)\}$, we have the following predictive formula for $y_0 = N(\mu_0, \sigma_o^2 I)$ at new data point $x$:
$$\mu_0 = K(x_0, D_n) K_n^{-1}y$$$$\sigma_0^2 = \sigma^2 + K(x_0,x_0) - K(x_0,D_n) K_n^{-1}K(x_0,D_n)^T$$Classification¶
Gaussian Bayes¶
Given $P(X,Y)$ as the distribution that generates (x,y), the Bayesian classifier $f$ is defined as:
$$f(x) = \underset{y}{\operatorname{argmax}} P(Y=y|X=x)$$$$P(Y=y|X=x) = \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)}$$If $P$ is unknown, some probabilistic assumptions are necessary to estimate $P$ from the data:
$$P(Y) \sim Discrete(\pi)$$$$P(X|Y=y) \sim Normal(\mu_y, \Sigma_y)$$Other options like Poisson likelihood (in Naive Bayes) are perfectly fine in certain circumstances. Using Maximum Likelihood Estimation principle, $Discrete(\pi)$ and $Normal(\mu_y, \Sigma_y)$ can be estimated from the data $\{(x_i,y_i)\}_{i=1}^n$. Specifically:
$$\hat{\pi}_y = \frac{1}{n}\sum_{i=1}^{n}1(y_i=y)$$$$\hat{\mu}_y = \frac{1}{n_y}\sum_{i=1}^{n}1(y_i=y)x_i$$$$\hat{\Sigma}_y = \frac{1}{n_y}\sum_{i=1}^{n}1(y_i=y)(x_i-\mu_y)(x_i-\mu_y)^\top$$Now $\hat{P}$ is an estimate of $P$ and used to make prediction:
$$\hat{f}(x) = \underset{y}{\operatorname{argmax}} \hat{P}(Y=y|X=X)$$$$\hat{f}(x) = \underset{y}{\operatorname{argmax}} \hat{\pi}_y |\hat{\Sigma}_y|^{-\frac{1}{2}}\exp{\big\{-\frac{1}{2}(x-\mu_y)^\top{\hat{\Sigma}_y}^{-1}(x-\mu_y)\big\}}$$Naive Bayes¶
The derivation of naive Bayes classifier is identical to that of Gaussian Bayes classifier except that a different assumption is imposed on $P(X=x|Y=y)$. Specifically $x(j)$'s are assumed to be independent:
$$P(X=x|Y=y) = \prod_{j=1}^d P_j\big(X(j)=x(j)|Y=y\big)$$$P_j\big(x(j)|Y=y\big)$ may takes some well-defined distribution such as Poisson or Gaussian. In the case that $P_j\big(x(j)|Y=y\big)$ is a Poisson distribution, we have:
$$P_j\big(X(j)=x(j)|Y=y\big) = e^{-\lambda_{y,j}} \frac{{\lambda_{y,j}}^{x(j)}}{x(j)!}$$Poisson distribution is suitable for modelling the number of times a word appearing in a document. Maximum likelihood estimation gives:
$$\hat{\pi}_y = \frac{1}{n}\sum_{i=1}^{n}1(y_i=y)$$$$ \hat{\lambda}_{y,j} = \frac{\text{unique uses of word $j$ in observations from $y$}}{\text{observations from $y$}} $$Predictions are made by:
$$\hat{f}(x) = \underset{y}{\operatorname{argmax}} \hat{\pi}_y \prod_{j=1}^d e^{-\lambda_{y,j}} \frac{{\lambda_{y,j}}^{x(j)}}{x(j)!}$$$$\hat{f}(x) = \underset{y}{\operatorname{argmax}} \ln\hat{\pi}_y - \sum_{j=1}^d\lambda_{y,j} + \sum_{j=1}^d x(j) \ln(\lambda_{y,j})$$Decision tree¶
The basic idea is to partition the feature space into regions and minimize the sum of the impurity measures of the regions:
$$\sum_i p_{R_i}u(R_i)$$Specifically, we will partition a region if that partitioning helps to reduce the impurity measure of that region. To this end, we need to know how to make two importan decisions:
- Which variable should be use for the split?
The simplest idea is to try all combinations and choose the one that reduces the impurity measure the most.
- When to stop spliting?
The simplest idea is to stop spliting when the impurity measure that that region is low enough. Zero impurity is not a good idea as it would lead to overfitting
Support vector machine¶
Primal problem is a standard quadratic program:
$$\underset{w,w_0,\zeta_1,..., \zeta_n}{\operatorname{argmin}} \frac{1}{2}\|w\|^2 + \lambda\sum_{i=1}^{n}\zeta_i$$$$\text{subject to } y_i(x_i^T w + w_0) \ge 1 - \zeta_i \text{ and } \zeta_i \ge 0 \text{ for $i = 1,..., n$}$$Dual problem is also a standard quadratic program:
$$\underset{w,w_0,\zeta_1,..., \zeta_n}{\operatorname{argmax}} \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^Tx_j$$$$\text{subject to } \sum_{i=1}^n \alpha_i y_i =0 \text{ and } 0 \le \alpha_i \le \lambda \text{ for $i = 1,..., n$}$$Clustering¶
K-mean¶
The objective function of K-mean:
$$L = \sum_{i=1}^{n} \sum_{k=1, K} 1\{c_i=k\}\|x_i-\mu_k\|^2$$This is a non-convex function, which can be iteratively reduced using coordinate descent:
- Update $c_i$ to reduce $L$
- Update $\mu_k$ to reduce $L$
Gausian mixture model¶
The likelihood function:
$$L = \sum_{i=1}^{n} \ln \sum_{k=1}^{K} p(x_i, c_i=k|\pi, \mu, \Sigma)$$$$L = \sum_{i=1}^{n} \ln \sum_{k=1}^{K} \pi_k N(x_i|\mu_k, \Sigma_k)$$This is also a non-convex function, which can be iteratively reduced using EM algorithm:
Update posterior distribution:
$$\phi_i(k) = q(c_i=k) = \frac{\pi_k N(x_i|\mu_k, \Sigma_k)}{\sum_j\pi_j N(x_i|\mu_j, \Sigma_j)}$$
Update model parameters:
Recommendation System (PMF)¶
The objective is to maximize the a posterior probability of the component matrices, given the observed values:
$$U_{MAP}, V_{MAP} = \underset{U,V}{\operatorname{argmax}} P(U,V|M_0)$$$$U_{MAP}, V_{MAP} = \underset{U,V}{\operatorname{argmax}} P(M_0,U,V)P(U)P(V)$$$$U_{MAP}, V_{MAP} = \underset{U,V}{\operatorname{argmax}} \prod_{i,j}P(M_{ij}|u_i,v_j)P(u_i)P(v_j)$$with the generative models $P(u_i) = N(0,\lambda^{-1}I)$, $P(v_j) = N(0,\lambda^{-1}I)$ and $P(M_{ij}|u_i,v_j) = N(u_i^Tv_j,\sigma^2)$
Taking the derivatives, we come up with the following update equations:
$$u_i = (\lambda \sigma^2 I + \sum_{j}v_jv_j^T)^{-1} \sum_j M_{ij}v_j$$$$v_i = (\lambda \sigma^2 I + \sum_{j}u_ju_j^T)^{-1} \sum_j M_{ij}u_j$$Topic Modelling (NMF)¶
Least squares objective:
$$\text{minimize } \sum_{ij}(X_{ij} - (WH)_{ij})^2 \text{ subject to } W_{ik} \ge 0 \text{ and } H_{kj} \ge 0$$Using $G(h,h^t) = F(h^t) + (h-h^t)^T \Delta F(h^t) + \frac{1}{2}(h - h^t)^T K(h^t) (h-h^t)$ as an auxiliary funcation, https://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf proves that the following update functions monotonically decreases the sum of squares:
$$H_{kj} \leftarrow H_{kj}\frac{(W^T X)_{kj}}{(W^T WH)_{kj}}$$$$W_{ik} \leftarrow W_{ik}\frac{(XH^T)_{ik}}{(WHH^T)_{ik}}$$