Machine Learning Notes

Machine Learning NotesSupervised vs Unsupervised LearningClassification vs Regression ProblemMonovarious Linear RegressionHypothesisCost FunctionGradient DescentMultivarious Linear RegressionGradient DescentFeature ScalingMean NormalizationPolynomial RegressionNormal EquationLinear Regression VectorizationHypothesisCost FunctionGradient DescentLogistic RegressionHypothesisDecision BoundaryCost FunctionGradient DescentVectorizationMulticlass ClassificationAdvanced OptimizationOverfitting & UnderfitingRegularizationRegularized Linear RegressionRegularized Gradient DescentRegularized Normal EquationRegularized Logistic RegressionRegularized Advanced Optimization

Supervised vs Unsupervised Learning

Classification vs Regression Problem

Monovarious Linear Regression

= number of traning examples 's = "input" variable / features 's = "output" variable / "target" variable

= one training example = th training example


Input ()Output ()

Number of training examples is 4: is 2 is the third training example, so


Hypothesis maps 's to 's

Cost Function

Cost function used to choose and by minimizing the difference of the hypothesized output variable and .

We need to minimize by varying and , where is the hypothesis

Gradient Descent

Repeat the following operation until the cost function is minimized: Example

Where and is the learning rate, two statements are computed simultaneously. Repeat until gradient descent reaches local minimum

Multivarious Linear Regression

= number of features contains all theta parameters including

Gradient Descent

Repeat the following simulataneously until reached minimum:

can be elements of . Likewise for

Feature Scaling

Feature scaling makes gradient descent faster Example is between 0 and 100,000 is between 0 and 5 is between 0 and 0.0001

Running gradient descent without scaling is very slow Appropriate scaling in this case:

Mean Normalization

Used to normalize the values to be centered around 0

Where is the original non-normalized value, is the average of the feature set, is the maximum value of the feature set, and is the minimum value of the feature set.

Polynomial Regression

Basically multivarious linear regression except are square, cube... of . Which will represent a graph that is non linear, but instead polynomial (duh) Example

Where , , , and

Normal Equation

Normal Equation is an alternative method to gradient descent, it is much faster unless there are a lot of features (eg. )



All features and example values are stored in a design matrix with an added column of 1 in front because that accounts for which is always 1 All output example values are stored in vector We then plug these into the formula which will return theta in the form of a vector

Note: Rows of design matrix are training sets transposed. In this case,

To operate inverse on matrices, use pinv() function as certain matrices inside the calculation may not be invertable

Linear Regression Vectorization


Where hypothesis equals to transposed vector multiplied by vector Alternatively:

Where hypothesis equals to the matrix multiplication of , the matrix containing all features and sample sets, and vector

Cost Function

Vectorized form for multivarious cases can written as:

Where is the matrix containing the feature sets, is a vector containing and is a vector containing all the values

Alternatively, this is the same thing as:

Where the squaring part is actually element-wise

Gradient Descent

Where is a vector changed during gradient descent. is a vector calculated from where is a vector containing one set of feature values. Hypothesis may be vectorized (reference above)


This is basically the same thing as above except we changed vector to matrix that contains all values, and changed variable to a vector that holds all

Logistic Regression

Logistic regression is used for classification problems, and (despite the name) not associated with regression problems. Using linear regression isn't a good idea for classification problems

Binary Classification:

Where the output has only two outcomes (eg. 0 or 1, true or false)

Multiclass Classfication:


Sigmoid / Logistic function:

outputs probability of on input between 0 and 1, which is what we want for binary classification


This means that the probability of given features and parameterized by is 30%. This also means that the probablity (in binary classification) of is 70%

If we predict when , then

Decision Boundary

Decision Boundary is a property of where it separates different outcomes of classifications



So pluggin in and doing the inequality to predict when or , we get or . This means that we predict when

Also, is a line that act as the decision boundary between the two outcomes. The points on the line corresponds directly to where

Decision boundary is not limited to lines. Polynomial regressions, for example allows for more complex boundaries including elipses and curves



This makes the boundary , which is a circle about the origin with radius

Cost Function

Recall cost function for linear regression with moved inside the summation:

We replace with so we have:

Running this linear regression cost functions with logistic regression hypothesis () will result in non-convex function that has many local optima for . So running gradient descent does not guarentee to reach the global minimum. Instead we use this new cost function:

Or simply:

We can then plug this into the overall logistic regression cost function , to fit parameters and minimize , to make prediction given new

Gradient Descent

Recall Gradient descent from linear regression: repeat while simultaneously update all . Doing the partial derivatives we get:

This is identical to the gradient descent algorithm for linear regression, except where instead


The gradient descent can be vectorized by using vectors and :

Multiclass Classification

can be more than two outcomes ()

One vs. All method trains a logistic regression classifier for each class to predict the probability that . For predictions with new input , choose class that maximizes

Where the superscript indexes which class it is.

For a multiclass classification problem with amount of classes, different logistic regression classifier will be trained ( decision boundaries for each class)

Advanced Optimization

Gradient descent is not always the most efficient algorithm for optimizing / minimizing . Others include Conjugate Descent, BFGS, L-BFGS. They are more complex than gradient descent, but often run much faster


To minimize ,

The following code is how to implement the descent in MATLAB/Octave

function [jVal, gradient] = costFunction(theta)
    jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
    gradient = zeros(2,1);
    gradient(1) = 2*(theta(1)-5);
    gradient(2) = 2*(theta(2)-5);
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We implements the costFunction function that returns two things: jVal is the cost and gradient is a vector that corresponds to the two partial derivates

Advanced optimization function fminunc is then called after costFunction is setup. It takes a pointer to the cost function (@costFunction) and some other options

fminunc should return the optimal

Overfitting & Underfiting

Underfitting or high bias is when hypothesis doesn't fit training set very well. Usually too few features

Overfitting or high variance fits the traning example very well, but fails to generalize enough to predict future samples. Usually too many features

To address overfitting:

  1. Reduce number of features (reducing features discards information about the problem)

    • Manually select features to remove
    • Model selection algorithm
  2. Regularization

    • Keep all the features, but reduce the magnitude of parameter


Regularization reduces the magnitude of parameters which allows for "simpler" hypothesis and less prone to overfitting. So we modify the cost function:

Added a term at the end to regularize all the magnitudes, where is the regularization parameter that keeps the small. If is too large, the algorithm will result in underfitting

Note: starts from 1 because by convention, we don't need to regularize

Regularized Linear Regression

Regularized Gradient Descent

Modified gradient descent for linear regression:

Repeat {


Since we don't need to regularize we can separate it. For every other , we take the partial derivative of the new regularized

Grouping the terms together, we can write:

for which is multiplied to and makes it smaller. The second part of the equation is identical to unregularized

Regularized Normal Equation

Recall normal equation:

Modified normal equation adds an extra term that is multiplied by a identity matrix except with 0 for the element with first column and first row:

If , then is non-invertable, regularization fixes this as long as . So , where is our "modified indentity matrix", is invertable

Regularized Logistic Regression

Recall cost function for logistic regression, we add the term that applies the regularization

Recall gradient descent for logic regression. Same as regularized linear regression gradient descent, we separate the operations for and other . The gradient descent is the same as regularized linear regression except is for logistic regression

Regularized Advanced Optimization

Recall advanced optimization

function [jVal, gradient] = costFunction(theta)
    jVal = %code to compute J(theta)
    gradient(1) = %p.deriv of J(theta) to theta_0
    gradient(2) = %p.deriv of J(theta) to theta_1
    gradient(3) = %p.deriv of J(theta) to theta_2
    gradient(n+1) = %p.deriv of J(theta) to theta_n

The code is very similar to the unregularized, except the code to computer the cost function and the derivative of the cost function changed

This costFunction can be used in fminunc function to get the regularized hypothesis