# Supervised vs Unsupervised Learning

• Supervised uses sets of correct training data sets (e-mail filter is an example)
• Unsupervised looks at a sets of data without any absolutely correct data (clustering similar data would be an example)

# Classification vs Regression Problem

• Classification looks at discrete data sets and predict a discrete result (ie. true or false)
• Regression looks at a continues sets of data and predicts a value interpolate/extrapolate from existing data (ie. predicting amount of rainfall)

# Monovarious Linear Regression

$m$ = number of traning examples $x$'s = "input" variable / features $y$'s = "output" variable / "target" variable

$(x, y)$ = one training example $(x^{(i)}, y^{(i)})$ = $i$th training example

Example

Input ($x$)Output ($y$)
010
219
538
745

Number of training examples is 4: $m=4$ $x^{(2)}$ is 2 $(x^{(3)}, y^{(3)})$ is the third training example, so $(5, 38)$

## Hypothesis

Hypothesis maps $x$'s to $y$'s

## Cost Function

Cost function used to choose $\theta_0$ and $\theta_1$ by minimizing the difference of the hypothesized output variable and $y$.

We need to minimize $J(\theta_0, \theta_1)$ by varying $\theta_0$ and $\theta_1$, where $h_\theta(x^{(i)})$ is the hypothesis $h_\theta(x)=\theta_0+\theta_1 x^{(i)}$

Repeat the following operation until the cost function is minimized: Example

Where $x_0 = 1$ and $\alpha$ is the learning rate, two statements are computed simultaneously. Repeat until gradient descent reaches local minimum

# Multivarious Linear Regression

$n$ = number of features $\vec\theta$ contains all theta parameters including $\theta_0,\theta_1...\theta_n$

Repeat the following simulataneously until reached minimum:

$\theta_0, \theta_1...$ can be elements of $\vec\theta$. Likewise for $x_0, x_1...$

## Feature Scaling

Feature scaling makes gradient descent faster Example $x_1$ is between 0 and 100,000 $x_2$ is between 0 and 5 $x_3$ is between 0 and 0.0001

Running gradient descent without scaling is very slow Appropriate scaling in this case:

## Mean Normalization

Used to normalize the values to be centered around 0

Where $x$ is the original non-normalized value, $\mu$ is the average of the feature set, $x_{max}$ is the maximum value of the feature set, and $x_{min}$ is the minimum value of the feature set.

# Polynomial Regression

Basically multivarious linear regression except $x_2, x_3...$ are square, cube... of $x_1$. Which will represent a graph that is non linear, but instead polynomial (duh) Example

Where $x_0=1$, $x_1=x_1$, $x_2 = x_1^2$, and $x_3=x_1^3$

# Normal Equation

Normal Equation is an alternative method to gradient descent, it is much faster unless there are a lot of features (eg. $n = 100000$)

Example

$x_1$$x_2$$x_3$$y$
6425591600
4811100800
58876450

$n = 3$ $m = 3$

All features and example values are stored in a $m\times (n+1)$ design matrix $X$ with an added column of 1 in front because that accounts for $x_0$ which is always 1 All output example values are stored in vector $y$ We then plug these into the formula $\theta = (X^TX)^{-1}X^Ty$ which will return theta in the form of a vector

Note: Rows of design matrix $X$ are training sets $x^{(i)}$ transposed. In this case, $x^{(1)}=\begin{bmatrix}1\\64\\25\\59\\1600\end{bmatrix}$

To operate inverse on matrices, use pinv() function as certain matrices inside the calculation may not be invertable

# Linear Regression Vectorization

## Hypothesis

Where hypothesis equals to transposed vector $\theta$ multiplied by vector $x^{(i)}$ Alternatively:

Where hypothesis equals to the matrix multiplication of $X$, the matrix containing all features and sample sets, and $\theta$ vector

## Cost Function

Vectorized form for multivarious cases can written as:

Where $X$ is the matrix containing the feature sets, $\theta$ is a vector containing $\theta_0, \theta_1...$ and $\vec{y}$ is a vector containing all the $y$ values

Alternatively, this is the same thing as:

Where the squaring part is actually element-wise

Where $\theta$ is a vector changed during gradient descent. $\delta$ is a vector calculated from $\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$ where $x^{(i)}$ is a vector containing one set of feature values. Hypothesis may be vectorized (reference above)

Alternatively:

This is basically the same thing as above except we changed vector $x^{(i)}$ to matrix $X$ that contains all values, and changed variable $y^{(i)}$ to a vector $y$ that holds all

# Logistic Regression

Logistic regression is used for classification problems, and (despite the name) not associated with regression problems. Using linear regression isn't a good idea for classification problems

Binary Classification: $y\in\{0,1\}$

Where the output has only two outcomes (eg. 0 or 1, true or false)

Multiclass Classfication: $y\in\{0, 1, 2, ...\}$

## Hypothesis

Sigmoid / Logistic function:

$h_\theta(x)$ outputs probability of $y=1$ on input $x$ between 0 and 1, which is what we want for binary classification

Example

This means that the probability of $y=1$ given features $x$ and parameterized by $\theta$ is 30%. This also means that the probablity (in binary classification) of $y=0$ is 70%

If we predict $y=1$ when $h_\theta(x)\geq0.5$, then $\theta^Tx\geq0$

## Decision Boundary

Decision Boundary is a property of $h_\theta(x)$ where it separates different outcomes of classifications

Example

and

So pluggin in $\theta$ and doing the inequality to predict when $y=1$ or $h_\theta(x)\geq0.5$, we get $2+x_1\geq0$ or $x_1\geq-2$. This means that we predict $y=1$ when $x_1\geq-2$

Also, $x_1=-2$ is a line that act as the decision boundary between the two outcomes. The points on the line corresponds directly to where $h_\theta(x)=0.5$

Decision boundary is not limited to lines. Polynomial regressions, for example allows for more complex boundaries including elipses and curves

Example

and

This makes the boundary $x_1^2+x_2^2=1$, which is a circle about the origin with radius $=1$

## Cost Function

Recall cost function for linear regression with $\frac{1}{2}$ moved inside the summation:

We replace $\frac{1}{2}(h_\theta(x)-y)^2$ with $\text{Cost}(h_\theta(x, y)$ so we have:

Running this linear regression cost functions with logistic regression hypothesis ($h_\theta(x)$) will result in non-convex function that has many local optima for $J(\theta)$. So running gradient descent does not guarentee to reach the global minimum. Instead we use this new cost function:

Or simply:

We can then plug this into the overall logistic regression cost function $J(\theta)$, to fit parameters $\theta$ and minimize $J(\theta)$, to make prediction given new $x$

Recall Gradient descent from linear regression: repeat $\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$ while simultaneously update all $\theta_j$. Doing the partial derivatives we get:

This is identical to the gradient descent algorithm for linear regression, except where $h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}$ instead

## Vectorization

The gradient descent can be vectorized by using vectors $\theta$ and $x^{(i)}$:

## Multiclass Classification

$y$ can be more than two outcomes ($y\in\{0, 1, 2, ...\}$)

One vs. All method trains a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$. For predictions with new input $x$, choose class that maximizes $h_\theta^{(i)}(x)$

Where the superscript $(i)$ indexes which class it is.

For a multiclass classification problem with $n$ amount of classes, $n$ different logistic regression classifier will be trained ($n$ decision boundaries for each class)

Gradient descent is not always the most efficient algorithm for optimizing / minimizing $J(\theta)$. Others include Conjugate Descent, BFGS, L-BFGS. They are more complex than gradient descent, but often run much faster

Example

To minimize $J(\theta)$, $\theta_1=5, \theta_2=5$

The following code is how to implement the descent in MATLAB/Octave

 xxxxxxxxxx91function [jVal, gradient] = costFunction(theta)2    jVal = (theta(1)-5)^2 + (theta(2)-5)^2;3    gradient = zeros(2,1);4    gradient(1) = 2*(theta(1)-5);5    gradient(2) = 2*(theta(2)-5);6    7options = optimset('GradObj', 'on', 'MaxIter', '100');8initialTheta = zeros(2,1);9[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We implements the costFunction function that returns two things: jVal is the cost and gradient is a $2\times1$ vector that corresponds to the two partial derivates

Advanced optimization function fminunc is then called after costFunction is setup. It takes a pointer to the cost function (@costFunction) and some other options

fminunc should return the optimal $\theta$

# Overfitting & Underfiting

Underfitting or high bias is when hypothesis doesn't fit training set very well. Usually too few features

Overfitting or high variance fits the traning example very well, but fails to generalize enough to predict future samples. Usually too many features

1. Reduce number of features (reducing features discards information about the problem)

• Manually select features to remove
• Model selection algorithm
2. Regularization

• Keep all the features, but reduce the magnitude of parameter $\theta_j$

# Regularization

Regularization reduces the magnitude of parameters $\theta_j$ which allows for "simpler" hypothesis and less prone to overfitting. So we modify the cost function:

Added a term at the end to regularize all the $\theta$ magnitudes, where $\lambda$ is the regularization parameter that keeps the $\theta$ small. If $\lambda$ is too large, the algorithm will result in underfitting

Note: $j$ starts from 1 because by convention, we don't need to regularize $\theta_0$

## Regularized Linear Regression

Modified gradient descent for linear regression:

Repeat {

}

Since we don't need to regularize $\theta_0$ we can separate it. For every other $\theta$, we take the partial derivative of the new regularized $J(\theta)$

Grouping the terms together, we can write:

$(1-\alpha\frac{\lambda}{m}) < 1$ for $\alpha, \lambda, m>0$ which is multiplied to $\theta_j$ and makes it smaller. The second part of the equation is identical to unregularized

### Regularized Normal Equation

Recall normal equation:

Modified normal equation adds an extra term that is $\lambda$ multiplied by a $(n+1)\times(n+1)$ identity matrix except with 0 for the element with first column and first row:

If $m\leq n$, then $X^TX$ is non-invertable, regularization fixes this as long as $\lambda>0$. So $X^TX+\lambda A$, where $A$ is our "modified indentity matrix", is invertable

## Regularized Logistic Regression

Recall cost function for logistic regression, we add the term that applies the regularization

Recall gradient descent for logic regression. Same as regularized linear regression gradient descent, we separate the operations for $\theta_0$ and other $\theta_j$. The gradient descent is the same as regularized linear regression except $h_\theta(x)$ is for logistic regression

Recall advanced optimization
 xxxxxxxxxx71function [jVal, gradient] = costFunction(theta)2    jVal = %code to compute J(theta)3    gradient(1) = %p.deriv of J(theta) to theta_04    gradient(2) = %p.deriv of J(theta) to theta_15    gradient(3) = %p.deriv of J(theta) to theta_26    %...7    gradient(n+1) = %p.deriv of J(theta) to theta_n
This costFunction can be used in fminunc function to get the regularized hypothesis