Machine Learning Notes

Supervised vs Unsupervised Learning

Supervised uses sets of correct training data sets (e-mail filter is an example)
Unsupervised looks at a sets of data without any absolutely correct data (clustering similar data would be an example)

Classification vs Regression Problem

Classification looks at discrete data sets and predict a discrete result (ie. true or false)
Regression looks at a continues sets of data and predicts a value interpolate/extrapolate from existing data (ie. predicting amount of rainfall)

Monovarious Linear Regression

$m$ = number of traning examples $x$ 's = "input" variable / features $y$ 's = "output" variable / "target" variable

$(x, y)$ = one training example $(x^{(i)}, y^{(i)})$ = $i$ th training example

Example

Input ( $x$ )	Output ( $y$ )
0	10
2	19
5	38
7	45

Number of training examples is 4: $m=4$ $x^{(2)}$ is 2 $(x^{(3)}, y^{(3)})$ is the third training example, so $(5, 38)$

Hypothesis

$h_\theta(x)=\theta_0+\theta_1 x$ Hypothesis maps $x$ 's to $y$ 's

Cost Function

Cost function used to choose $\theta_0$ and $\theta_1$ by minimizing the difference of the hypothesized output variable and $y$ .

$J(\theta_0, \theta_1)=\frac{1}{2m}\sum^m_{i=1}{(h_\theta(x^{(i)})-y)^2}$ We need to minimize $J(\theta_0, \theta_1)$ by varying $\theta_0$ and $\theta_1$ , where $h_\theta(x^{(i)})$ is the hypothesis $h_\theta(x)=\theta_0+\theta_1 x^{(i)}$

Gradient Descent

Repeat the following operation until the cost function is minimized: $\vec\theta:=\vec\theta-\alpha\frac{\partial}{\partial\theta_j}J(\vec\theta)$ Example $\theta_0:=\theta_0-\alpha\frac{1}{m}\sum^m_{i=1}x_0(h_\theta(x^{(i)})-y)$ $\theta_1:=\theta_1-\alpha\frac{1}{m}\sum^m_{i=1}x_1(h_\theta(x^{(i)})-y)$

Where $x_0 = 1$ and $\alpha$ is the learning rate, two statements are computed simultaneously. Repeat until gradient descent reaches local minimum

Multivarious Linear Regression

$n$ = number of features $\vec\theta$ contains all theta parameters including $\theta_0,\theta_1...\theta_n$

Gradient Descent

Repeat the following simulataneously until reached minimum: $\theta_0:=\theta_0-\alpha\frac{1}{m}\sum^m_{i=1}x_0(h_\theta(x^{(i)})-y)$ $\theta_1:=\theta_1-\alpha\frac{1}{m}\sum^m_{i=1}x_1(h_\theta(x^{(i)})-y)$ $\theta_2:=\theta_2-\alpha\frac{1}{m}\sum^m_{i=1}x_2(h_\theta(x^{(i)})-y)$ $...$ $\theta_n:=\theta_n-\alpha\frac{1}{m}\sum^m_{i=1}x_n(h_\theta(x^{(i)})-y)$

$\theta_0, \theta_1...$ can be elements of $\vec\theta$ . Likewise for $x_0, x_1...$

Feature Scaling

Feature scaling makes gradient descent faster Example $x_1$ is between 0 and 100,000 $x_2$ is between 0 and 5 $x_3$ is between 0 and 0.0001

Running gradient descent without scaling is very slow Appropriate scaling in this case: $x_{1(scaled)}=\frac{x_1}{100000}$ $x_{2(scaled)}=\frac{x_2}{5}$ $x_{3(scaled)}=1000x_3$

Mean Normalization

Used to normalize the values to be centered around 0 $x_{normalized}=\frac{x-\mu}{x_{max}-x_{min}}$

Where $x$ is the original non-normalized value, $\mu$ is the average of the feature set, $x_{max}$ is the maximum value of the feature set, and $x_{min}$ is the minimum value of the feature set.

Polynomial Regression

Basically multivarious linear regression except $x_2, x_3...$ are square, cube... of $x_1$ . Which will represent a graph that is non linear, but instead polynomial (duh) Example $h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_1^2+\theta_3x_1^3$

Where $x_0=1$ , $x_1=x_1$ , $x_2 = x_1^2$ , and $x_3=x_1^3$

Normal Equation

Normal Equation is an alternative method to gradient descent, it is much faster unless there are a lot of features (eg. $n = 100000$ )

$\theta = (X^TX)^{-1}X^Ty$

Example

$x_1$	$x_2$	$x_3$	$y$
64	25	59	1600
48	11	100	800
5	88	76	450

$n = 3$ $m = 3$

All features and example values are stored in a $m\times (n+1)$ design matrix $X$ with an added column of 1 in front because that accounts for $x_0$ which is always 1 $X = \begin{bmatrix}1 & 64 & 25 & 59 & 1600 \\1 & 48 & 11 & 100 & 800 \\1 & 5 & 88 & 76 & 450 \\\end{bmatrix}$ All output example values are stored in vector $y$ $y = \begin{bmatrix}1600 \\ 800 \\ 450\end{bmatrix}$ We then plug these into the formula $\theta = (X^TX)^{-1}X^Ty$ which will return theta in the form of a vector
Note: Rows of design matrix $X$ are training sets $x^{(i)}$ transposed. In this case, $x^{(1)}=\begin{bmatrix}1\\64\\25\\59\\1600\end{bmatrix}$

To operate inverse on matrices, use pinv() function as certain matrices inside the calculation may not be invertable

Linear Regression Vectorization

Hypothesis

$h_\theta(x^{(i)})=\theta^Tx^{(i)}$

Where hypothesis equals to transposed vector $\theta$ multiplied by vector $x^{(i)}$ Alternatively:

$h_\theta(X)=X\theta$

Where hypothesis equals to the matrix multiplication of $X$ , the matrix containing all features and sample sets, and $\theta$ vector

Cost Function

Vectorized form for multivarious cases can written as:

$J(\theta)=\frac{1}{2m}(X\theta-\vec{y})^T(X\theta-\vec{y})$

Where $X$ is the matrix containing the feature sets, $\theta$ is a vector containing $\theta_0, \theta_1...$ and $\vec{y}$ is a vector containing all the $y$ values

Alternatively, this is the same thing as:

$J(\theta)=\frac{1}{2m}(h_\theta(X)-\vec y)^2=\frac{1}{2m}(X\theta-\vec y)^2$

Where the squaring part is actually element-wise

Gradient Descent

$\theta:=\theta-\alpha\delta$ $\delta=\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$

Where $\theta$ is a vector changed during gradient descent. $\delta$ is a vector calculated from $\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$ where $x^{(i)}$ is a vector containing one set of feature values. Hypothesis may be vectorized (reference above)

Alternatively:

$\delta=\frac{1}{m}(h_\theta(X)-y)^TX=\frac{1}{m}(X\theta-y)^TX$

This is basically the same thing as above except we changed vector $x^{(i)}$ to matrix $X$ that contains all values, and changed variable $y^{(i)}$ to a vector $y$ that holds all

Logistic Regression

Logistic regression is used for classification problems, and (despite the name) not associated with regression problems. Using linear regression isn't a good idea for classification problems

Binary Classification: $y\in\{0,1\}$

Where the output has only two outcomes (eg. 0 or 1, true or false)

Multiclass Classfication: $y\in\{0, 1, 2, ...\}$

Hypothesis

Sigmoid / Logistic function:

$h_\theta(x)=g(\theta^Tx) \text{ and }g(z)=g(\frac{1}{1+e^{-z}})$

$h_\theta(x)=\frac{1}{1+e^{-(\theta^Tx)}}$

$h_\theta(x)$ outputs probability of $y=1$ on input $x$ between 0 and 1, which is what we want for binary classification

Example

$h_\theta(x)=P(y=1|x; \theta)=0.3$

This means that the probability of $y=1$ given features $x$ and parameterized by $\theta$ is 30%. This also means that the probablity (in binary classification) of $y=0$ is 70%

If we predict $y=1$ when $h_\theta(x)\geq0.5$ , then $\theta^Tx\geq0$

Decision Boundary

Decision Boundary is a property of $h_\theta(x)$ where it separates different outcomes of classifications

Example

$h_\theta(x)=g(\theta_0+\theta_1x_1)$ and $\theta=\begin{bmatrix}2\\1\end{bmatrix}$

So pluggin in $\theta$ and doing the inequality to predict when $y=1$ or $h_\theta(x)\geq0.5$ , we get $2+x_1\geq0$ or $x_1\geq-2$ . This means that we predict $y=1$ when $x_1\geq-2$
Also, $x_1=-2$ is a line that act as the decision boundary between the two outcomes. The points on the line corresponds directly to where $h_\theta(x)=0.5$

Decision boundary is not limited to lines. Polynomial regressions, for example allows for more complex boundaries including elipses and curves

Example

$h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2)$ and $\theta=\begin{bmatrix}1\\0\\0\\1\\1\end{bmatrix}$

This makes the boundary $x_1^2+x_2^2=1$ , which is a circle about the origin with radius $=1$

Cost Function

Recall cost function for linear regression with $\frac{1}{2}$ moved inside the summation:

$J(\theta)=\frac{1}{m}\sum^m_{i=1}\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$

We replace $\frac{1}{2}(h_\theta(x)-y)^2$ with $\text{Cost}(h_\theta(x, y)$ so we have: $J(\theta)=\frac{1}{m}\sum^m_{i=1}\text{Cost}(h_\theta(x), y)$

Running this linear regression cost functions with logistic regression hypothesis ( $h_\theta(x)$ ) will result in non-convex function that has many local optima for $J(\theta)$ . So running gradient descent does not guarentee to reach the global minimum. Instead we use this new cost function:

$\text{Cost}(h_\theta(x), y)=\begin{cases}-\log(h_\theta(x))&\text{if }y=1\\-\log(1-h_\theta(x))&\text{if }y=0\end{cases}$

Or simply:

$\text{Cost}(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))$

We can then plug this into the overall logistic regression cost function $J(\theta)$ , to fit parameters $\theta$ and minimize $J(\theta)$ , to make prediction given new $x$

$J(\theta)=-\frac{1}{m}[\sum^m_{i=1}y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]$

Gradient Descent

Recall Gradient descent from linear regression: repeat $\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$ while simultaneously update all $\theta_j$ . Doing the partial derivatives we get:

$\theta_j:=\theta_j-\alpha\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

This is identical to the gradient descent algorithm for linear regression, except where $h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$ instead

Vectorization

The gradient descent can be vectorized by using vectors $\theta$ and $x^{(i)}$ : $\theta:=\theta-\alpha\frac{1}{m}\sum^m_{(i=1)}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$

Multiclass Classification

$y$ can be more than two outcomes ( $y\in\{0, 1, 2, ...\}$ )

One vs. All method trains a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$ . For predictions with new input $x$ , choose class that maximizes $h_\theta^{(i)}(x)$

$h_\theta^{(i)}(x)=P(y=i|x;\theta)\quad(i=0,1,2,3, ...)$

Where the superscript $(i)$ indexes which class it is.

For a multiclass classification problem with $n$ amount of classes, $n$ different logistic regression classifier will be trained ( $n$ decision boundaries for each class)

Advanced Optimization

Gradient descent is not always the most efficient algorithm for optimizing / minimizing $J(\theta)$ . Others include Conjugate Descent, BFGS, L-BFGS. They are more complex than gradient descent, but often run much faster

Example $\theta=\begin{bmatrix}\theta_1\\\theta_2\end{bmatrix}$ $J(\theta)=(\theta_1-5)^2+(\theta_2-5)^2$ $\frac{\partial}{\partial\theta_1}J(\theta)=2(\theta_1-5)$ $\frac{\partial}{\partial\theta_2}J(\theta)=2(\theta_2-5)$

To minimize $J(\theta)$ , $\theta_1=5, \theta_2=5$

The following code is how to implement the descent in MATLAB/Octave

 
xxxxxxxxxx
9
1
function [jVal, gradient] = costFunction(theta)
2
    jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
3
    gradient = zeros(2,1);
4
    gradient(1) = 2*(theta(1)-5);
5
    gradient(2) = 2*(theta(2)-5);
6
    
7
options = optimset('GradObj', 'on', 'MaxIter', '100');
8
initialTheta = zeros(2,1);
9
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We implements the costFunction function that returns two things: jVal is the cost and gradient is a $2\times1$ vector that corresponds to the two partial derivates
Advanced optimization function fminunc is then called after costFunction is setup. It takes a pointer to the cost function (@costFunction) and some other options
fminunc should return the optimal $\theta$

Overfitting & Underfiting

Underfitting or high bias is when hypothesis doesn't fit training set very well. Usually too few features

Overfitting or high variance fits the traning example very well, but fails to generalize enough to predict future samples. Usually too many features

To address overfitting:

Reduce number of features (reducing features discards information about the problem)
- Manually select features to remove
- Model selection algorithm
Regularization
- Keep all the features, but reduce the magnitude of parameter $\theta_j$

Regularization

Regularization reduces the magnitude of parameters $\theta_j$ which allows for "simpler" hypothesis and less prone to overfitting. So we modify the cost function: $J(\theta)=\frac{1}{2m}[\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum^n_{j=1}\theta_j^2]$

Added a term at the end to regularize all the $\theta$ magnitudes, where $\lambda$ is the regularization parameter that keeps the $\theta$ small. If $\lambda$ is too large, the algorithm will result in underfitting
Note: $j$ starts from 1 because by convention, we don't need to regularize $\theta_0$

Regularized Linear Regression

Regularized Gradient Descent

Modified gradient descent for linear regression:

Repeat {

$\theta_0:=\theta_0-\alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}$

$\theta_j:=\theta_j-\alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j$

}

Since we don't need to regularize $\theta_0$ we can separate it. For every other $\theta$ , we take the partial derivative of the new regularized $J(\theta)$

Grouping the terms together, we can write:

$\theta_j:=\theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

$(1-\alpha\frac{\lambda}{m}) < 1$ for $\alpha, \lambda, m>0$ which is multiplied to $\theta_j$ and makes it smaller. The second part of the equation is identical to unregularized

Regularized Normal Equation

Recall normal equation:

$X=\begin{bmatrix}(x^{(1)})^T\\\vdots\\(x^{(m)})^T\end{bmatrix}$

$y=\begin{bmatrix}y^{(1)}\\\vdots\\y^{(m)}\end{bmatrix}$

$\theta=(X^TX)^{-1}X^Ty$

Modified normal equation adds an extra term that is $\lambda$ multiplied by a $(n+1)\times(n+1)$ identity matrix except with 0 for the element with first column and first row:

$\theta=\left(X^TX+\lambda\begin{bmatrix}0&0&0&\cdots&0\\0&1&0&\cdots&0\\0&0&1&\cdots&0\\\vdots&\vdots&\vdots&\ddots&\vdots\\0&0&0&\cdots&1\end{bmatrix}\right)^{-1}X^Ty$

If $m\leq n$ , then $X^TX$ is non-invertable, regularization fixes this as long as $\lambda>0$ . So $X^TX+\lambda A$ , where $A$ is our "modified indentity matrix", is invertable

Regularized Logistic Regression

Recall cost function for logistic regression, we add the term that applies the regularization

$J(\theta)=-\frac{1}{m}[\sum^m_{i=1}y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum^n_{j=1}\theta_j^2$

Recall gradient descent for logic regression. Same as regularized linear regression gradient descent, we separate the operations for $\theta_0$ and other $\theta_j$ . The gradient descent is the same as regularized linear regression except $h_\theta(x)$ is for logistic regression

Regularized Advanced Optimization

Recall advanced optimization

 
xxxxxxxxxx
7
1
function [jVal, gradient] = costFunction(theta)
2
    jVal = %code to compute J(theta)
3
    gradient(1) = %p.deriv of J(theta) to theta_0
4
    gradient(2) = %p.deriv of J(theta) to theta_1
5
    gradient(3) = %p.deriv of J(theta) to theta_2
6
    %...
7
    gradient(n+1) = %p.deriv of J(theta) to theta_n

The code is very similar to the unregularized, except the code to computer the cost function and the derivative of the cost function changed
This costFunction can be used in fminunc function to get the regularized hypothesis