Machine Learning NotesSupervised vs Unsupervised LearningClassification vs Regression ProblemMonovarious Linear RegressionHypothesisCost FunctionGradient DescentMultivarious Linear RegressionGradient DescentFeature ScalingMean NormalizationPolynomial RegressionNormal EquationLinear Regression VectorizationHypothesisCost FunctionGradient DescentLogistic RegressionHypothesisDecision BoundaryCost FunctionGradient DescentVectorizationMulticlass ClassificationAdvanced OptimizationOverfitting & UnderfitingRegularizationRegularized Linear RegressionRegularized Gradient DescentRegularized Normal EquationRegularized Logistic RegressionRegularized Advanced Optimization

**Supervised**uses sets of correct training data sets (e-mail filter is an example)**Unsupervised**looks at a sets of data without any absolutely correct data (clustering similar data would be an example)

**Classification**looks at discrete data sets and predict a discrete result (ie. true or false)**Regression**looks at a continues sets of data and predicts a value interpolate/extrapolate from existing data (ie. predicting amount of rainfall)

= number of traning examples 's = "input" variable / features 's = "output" variable / "target" variable

= one training example = th training example

`Example`

Input () | Output () |
---|---|

0 | 10 |

2 | 19 |

5 | 38 |

7 | 45 |

Number of training examples is 4: is 2 is the

thirdtraining example, so

**Hypothesis** maps 's to 's

**Cost function** used to choose and by minimizing the difference of the hypothesized output variable and .

We need to minimize by varying and , where is the hypothesis

Repeat the following operation until the cost function is minimized:
`Example`

Where and is the learning rate, two statements are computed

simultaneously. Repeat until gradient descent reaches local minimum

= number of features contains all theta parameters including

Repeat the following **simulataneously** until reached minimum:

can be elements of . Likewise for

**Feature scaling** makes gradient descent faster
`Example`

is between 0 and 100,000
is between 0 and 5
is between 0 and 0.0001

Running gradient descent without scaling is very slow Appropriate scaling in this case:

Used to normalize the values to be centered around 0

Where is the original non-normalized value, is the average of the feature set, is the maximum value of the feature set, and is the minimum value of the feature set.

Basically multivarious linear regression except are square, cube... of . Which will represent a graph that is non linear, but instead polynomial (duh)
`Example`

Where , , , and

**Normal Equation** is an alternative method to gradient descent, it is much faster unless there are a lot of features (eg. )

`Example`

64 | 25 | 59 | 1600 |

48 | 11 | 100 | 800 |

5 | 88 | 76 | 450 |

All features and example values are stored in a design matrix with an added column of 1 in front because that accounts for which is always 1 All output example values are stored in vector We then plug these into the formula which will return theta in the form of a vector

Note: Rows of design matrix are training sets transposed. In this case,

To operate inverse on matrices, use `pinv()`

function as certain matrices inside the calculation may not be invertable

Where hypothesis equals to transposed vector multiplied by vector

Alternatively:

Where hypothesis equals to the matrix multiplication of , the matrix containing all features and sample sets, and vector

Vectorized form for multivarious cases can written as:

Where is the matrix containing the feature sets, is a vector containing and is a vector containing all the values

*Alternatively, this is the same thing as:*

Where the squaring part is actually

element-wise

Where is a vector changed during gradient descent. is a vector calculated from where is a vector containing one set of feature values. Hypothesis may be vectorized (reference above)

*Alternatively:*

This is basically the same thing as above except we changed vector to matrix that contains all values, and changed variable to a vector that holds all

**Logistic regression** is used for classification problems, and (despite the name) not associated with regression problems. Using linear regression isn't a good idea for classification problems

**Binary Classification:**

Where the output has only two outcomes (eg. 0 or 1, true or false)

**Multiclass Classfication:**

**Sigmoid / Logistic function:**

outputs probability of

on input between 0 and 1, which is what we want for binary classification

`Example`

This means that the probability of given features and parameterized by is 30%. This also means that the probablity (in binary classification) of is 70%

If we predict when , then

**Decision Boundary** is a property of where it separates different outcomes of classifications

`Example`

and

So pluggin in and doing the inequality to predict when or , we get or . This means that we predict when

Also, is a line that act as the

decision boundarybetween the two outcomes. The points on the line corresponds directly to where

Decision boundary is not limited to lines. Polynomial regressions, for example allows for more complex boundaries including elipses and curves

`Example`

and

This makes the boundary , which is a circle about the origin with radius

`Recall`

cost function for linear regression with moved inside the summation:

We replace with so we have:

Running this linear regression cost functions with logistic regression hypothesis () will result in **non-convex** function that has many local optima for . So running gradient descent does not guarentee to reach the global minimum. Instead we use this new cost function:

*Or simply:*

We can then plug this into the overall logistic regression cost function , to fit parameters and minimize , to make prediction given new

`Recall`

Gradient descent from linear regression: repeat while simultaneously update all . Doing the partial derivatives we get:

This is identical to the gradient descent algorithm for linear regression, except where instead

The gradient descent can be vectorized by using vectors and :

can be more than two outcomes ()

**One vs. All** method trains a logistic regression classifier for each class to predict the probability that . For predictions with new input , choose class that maximizes

Where the superscript indexes which class it is.

For a multiclass classification problem with amount of classes, different logistic regression classifier will be trained ( decision boundaries for each class)

Gradient descent is not always the most efficient algorithm for optimizing / minimizing . Others include *Conjugate Descent*, *BFGS*, *L-BFGS*. They are more complex than gradient descent, but often run much faster

`Example`

To minimize ,

The following code is how to implement the descent in MATLAB/Octave

`xxxxxxxxxx`

91`function [jVal, gradient] = costFunction(theta)`

2`jVal = (theta(1)-5)^2 + (theta(2)-5)^2;`

3`gradient = zeros(2,1);`

4`gradient(1) = 2*(theta(1)-5);`

5`gradient(2) = 2*(theta(2)-5);`

6

7`options = optimset('GradObj', 'on', 'MaxIter', '100');`

8`initialTheta = zeros(2,1);`

9`[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);`

We implements the

`costFunction`

function that returns two things:`jVal`

is the cost and`gradient`

is a vector that corresponds to the two partial derivatesAdvanced optimization function

`fminunc`

is then called after`costFunction`

is setup. It takes a pointer to the cost function (`@costFunction`

) and some other options

`fminunc`

should return the optimal

**Underfitting** or **high bias** is when hypothesis doesn't fit training set very well. Usually too few features

**Overfitting** or **high variance** fits the traning example very well, but fails to generalize enough to predict future samples. Usually too many features

To address overfitting:

Reduce number of features (reducing features discards information about the problem)

- Manually select features to remove
- Model selection algorithm

Regularization

- Keep all the features, but reduce the magnitude of parameter

**Regularization** reduces the magnitude of parameters which allows for "simpler" hypothesis and less prone to overfitting. So we modify the cost function:

Added a term at the end to regularize all the magnitudes, where is the regularization parameter that keeps the small. If is too large, the algorithm will result in underfitting

Note:starts from 1 because by convention, we don't need to regularize

Modified gradient descent for linear regression:

Repeat {

}

Since we don't need to regularize we can separate it. For every other , we take the partial derivative of the new regularized

*Grouping the terms together, we can write:*

for which is multiplied to and makes it smaller. The second part of the equation is identical to unregularized

`Recall`

normal equation:

Modified normal equation adds an extra term that is multiplied by a identity matrix except with 0 for the element with first column and first row:

If , then is non-invertable, regularization fixes this as long as . So , where is our "modified indentity matrix", is invertable

`Recall`

cost function for logistic regression, we add the term that applies the regularization

`Recall`

gradient descent for logic regression. Same as regularized linear regression gradient descent, we separate the operations for and other . The gradient descent is the same as regularized linear regression except is for logistic regression

`Recall`

advanced optimization

`xxxxxxxxxx`

71`function [jVal, gradient] = costFunction(theta)`

2`jVal = %code to compute J(theta)`

3`gradient(1) = %p.deriv of J(theta) to theta_0`

4`gradient(2) = %p.deriv of J(theta) to theta_1`

5`gradient(3) = %p.deriv of J(theta) to theta_2`

6`%...`

7`gradient(n+1) = %p.deriv of J(theta) to theta_n`

The code is very similar to the unregularized, except the code to computer the cost function and the derivative of the cost function changed

This

`costFunction`

can be used in`fminunc`

function to get the regularized hypothesis