Multiple Linear Regression
$n$ = number of features$x^{(i)}$ = input features of $i^{th}$ training example
$x_j^{(i)}$ = value of feature $j$ in $i^{th}$ training example
$$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 +...+ \theta_n x_n $$ For convenience of notation, define $x_0 = 1$. In vector notation $$h_\theta (x) = \theta^T x $$ where $$ x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_{n} \\ \end{bmatrix} \in \mathbb{R}^{n+1} $$ and $$ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_{n} \\ \end{bmatrix} \in \mathbb{R}^{n+1} $$
Gradient Descent
Cost function: $$ J(\theta) = \frac{1}{2m} \sum_{i=1}^m ( h_{\theta} ( x^{(i)} ) - y^{(i)} )^2 $$ Gradient descent - Repeat $$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J (\theta) $$ (simultaneously update for every $j=0,...,n$) where $\alpha$ is the learning rate.Feature Scaling
We want the features to be on a similar scale. Having features on very different scales creates problem for the optimizer. The goal of feature scaling is to get every feature into approximately a $-1 \leq x_i \leq 1$ range. We normalize the features by subtracting the mean and dividing by some measure of dispersion (range or standard deviation) $$ \tilde{x}_1 = \frac{x_1 - \mu_1}{s_1} $$ where $\mu_1$ is the mean of feature $x_1$ and $s_1$ is a dispersion estimate.Polynomial Regression
We can learn nonlinear relationships by including polynomial features in the regression equation. For example, instead of $$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4 $$ we could have $$ h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2^3 + \theta_3 x_3^2 + \theta_4 \sqrt{x_4} $$Classification
Each training example is labeled as being positive or negative. $y \in {0,1}$ where 0 is the negative class, and 1 is the positive class. The aim of a classification algorithm is to predict the class for an out-of-sample input vector. Logistic Regression is one popular classification algorithm. For standard regression $h_\theta (x) $ can be $> 1$ or $< 0$. However, for logistic regression $0 \leq h_\theta (x) \leq 1$.$h_\theta (x) = g (\theta^T x)$ where g is the sigmoid function, defined as $g(z) = \frac{1}{1+e^{-z}}$. $h_\theta (x)$ is interpreted as an estimate of the probability that $y=1$ on input x. The parameters of the model define a decision boundary which is used to separate positive and negative examples. Complex nonlinear decision boundaries can be optimized by including polynomial features in the logistic regression equation. Cost function: $$ \text{Cost} ( h_\theta (x),y) = -y\phantom{.}\text{log} (h_\theta (x)) - (1-y) \text{log} ( 1 - h_\theta (x) ) $$ Logistic Regression is designed for binary classification problems. For multi-class problems one may adopt a multinomial version of the logistic regression model. Alternatively, the simple one-versus-all technique can be applied. We train a logistic regression classifier $h_\theta^{(i)} (x)$ for each class i to predict the probability that $y=i$. On a new input $x$, to make a prediction, pick the class i that maximizes $\text{max } h_\theta^{(i)} (x) $
Overfitting
If we have too many features, the learning hypothesis may fit the training set very well, but fail to generalize to new examples. We can address overfitting by reducing the number of features, either manually or using a model selection algorithm. A second possibly remedy is to apply regularization, where we keep all features but reduce the magnitude of parameters $\theta_j$. Regularization works well when we have lots of features, each of which contributes a bit to predicting y. For standard linear regression, an example regularized cost function is $$ J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m ( h_{\theta} ( x^{(i)} ) - y^{(i)} )^2 + \lambda \sum_{j=1}^n \theta_j^2 \right] $$ where $\lambda$ is the regularization parameter. Setting $\lambda$ too large results in under fitting.Debugging a Learning Algorithm
run diagnostics to identify the problem
fix the problem
Problem:
Overfitting (high variance) .. e.g. model too complex
Underfitting (high bias) e.g. model too simple
Diagnostic (Bias vs. variance is one common diagnostic)
Variance: training error much lower than test error
Bias: Training error also high
Bias vs. Variance
plot learning curves.
x-axis (training set size), y-axis (error) … plot test and training error
typical learning curve for high variance:
test error still decreasing as the number of examples increases. suggests larger training set will help
large gap between training and test error
typical learning curve for high bias:
both training and test error are high
small gap between training and test error
Fixes to try:
get more training examples. Fixes high variance
smaller set of features. Fixes high variance
larger set of features. Fixes high bias
change features. Fixes high bias
run for optimization algorithm for more iterations. Fixes optimization algorithm
try a different learning algorithm (gradient descent, stochastic gradient descent, numerical method e.g. Newton's method). Fixes optimization algorithm
change the learning rate. Fixes optimization objective
try a different algorithm e.g. SVM instead of logistic regression. Fixes optimization objective
try increasing/decreasing $\lambda$ try adding polynomial features
Training/Testing Procedure
Learn parameter $\theta$ from training data (minimizing training error $J (\theta) $ ). Then compute test set error which, for linear regression, is defined as $$ J_{\text{test}} (\theta) = \frac{1}{2m_{\text{test}}} \sum_{i=1}^{m_{\text{test}}} ( h_{\theta} ( x^{(i)}_{\text{test}} ) - y^{(i)}_{\text{test}} )^2 $$ For logistic regression the test error is given by $$ J_{\text{test}} (\theta) = - \frac{1}{m_{\text{test}}} \sum_{i=1}^{m_{\text{test}}} y^{(i)}_{\text{test}} \phantom{.} \text{log} \phantom{.} h_\theta (x^{(i)}_{\text{test}}) + (1- y^{(i)}_{\text{test}}) \phantom{.} \text{log} \phantom{.} h_\theta (x^{(i)}_{\text{test}}) $$ For classification problems we can also use the misclassification error (0/1 misclassification error) defined as $$ \text{Test error} = \frac{1}{2m_{\text{test}}} \sum_{i=1}^{m_{\text{test}}} err( h_\theta (x^{(i)}_{\text{test}} ),y^{(i)}_{\text{test}} ) $$ where the error function $ err(h_\theta^{(x)},y) $ is given by $$ err(h_\theta^{(x)},y) = \begin{cases} 1 &\mbox{if } h_\theta (x) \geq 0.5, \text{ given } y=0 \\ 1 &\mbox{if } h_\theta (x) < 0.5, \text{ given } y=1 \\ 0 & \text{ otherwise } \end{cases} $$Model Selection
For example, we want to choose the degree of polynomial to use in a regression equation. Fit a model to the training set for each degree in a range by minimizing the training error $J_{\text{train}}(\theta)$, storing the test error $J_{\text{test}}(\theta)$ for each model. Choose the model with the lowest test error. The problem with this approach is that the degree parameter is now fit to the test set so the test error reported is likely to be optimistic. A better approach is the split the examples into training, cross validation, and test sets. Fit each model to the training set, store the cross validation error $J_{\text{cv}}(\theta)$. Select the model with the lowest validation error, and finally, report the test error for this model.Diagnosing Bias versus Variance
Plot the training and cross validation error as a function of the parameter you want to select (e.g. the degree of polynomial as in the previous section). Select the parameter value at the trough of the cross validation error curve. At this point either a lower or higher parameter value results in an increase in the cross validation error.A bias (underfit) problem is typically characterized by a high $J_{\text{train}}(\theta)$ and a small different between $J_{\text{cv}}(\theta)$ and $J_{\text{train}}(\theta)$. Conversely, a variance (underfit) problem results in a low $J_{\text{train}}(\theta)$ and a $J_{\text{cv}}(\theta)$ significantly higher than $J_{\text{train}}(\theta)$.
Regularization and Bias/Variance
A large $\lambda$ compresses the parameters toward zero resulting in a biased model which underfits the data. A small $\lambda$ puts a small cost on the size of a parameter resulting in a high variance model that overfits the data. Our aim is to choose a value for $\lambda$ that achieves a reasonable bias/variance tradeoff. We apply the train/validate/test procedure outlined in the Model Selection section to choose the regularization parameter $\lambda$.Learning Curves
Learning curves can be used to identify a bias/variance problem. We plot $J_{\text{cv}}(\theta)$ and $J_{\text{train}}(\theta)$ as a function of the training set size $m$. A small gap between the validation and train ing error curves suggests a high bias problem. If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much. Conversely, a high variance problem is characterized by a big gap between $J_{\text{cv}}(\theta)$ and $J_{\text{train}}(\theta)$. If a learning algorithm is suffering from high variance, getting more training data is likely to help.Error Analysis
Recommended approach:- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Error Metrics for Skewed Classes
If 99% of the examples belong to the negative class a naive model that always outputs y=0 would yield excellent performance as measured by Accuracy. Thus, some measures such as Accuracy are not suitable for the case of skewed classes. Precision and recall are two alternatives.Precision $$ \text{Precision } = \frac{\text{True Positives}}{\text{# Predicted Positive}} = \frac{\text{True Positives}}{\text{True Positive + False Positive}} $$
Recall $$ \text{Recall } = \frac{\text{True Positives}}{\text{# Actual Positives}} = \frac{\text{True Positives}}{\text{True Positive + False Negative}} $$ For a classification algorithm which predicts 1 if $h_\theta (x) \ge \text{threshold}$, increasing the threshold (which was set to 0.5 in our previous example) increases Precision but lowers Recall. The $F_1$ Score is a useful measure that is defined as the harmonic mean of Precision and Recall $$ F_1 \text{ Score} = 2 \frac{\text{PR}}{\text{P} + \text{R}} $$
Data for Machine Learning
"it's not who has the best algorithm that wins. It's who has the most data". A landmark study showed that there wasn't much different in performance between a number of different classification algorithms over a large number of training examples.Large data rationale: Use a learning algorithm with many parameters (e.g. logistic/linear regression with many features; neural network with many hidden units). This results in a low bias algorithm. $J_{\text{train}} (\theta)$ will be small. Use a very large training set (unlikely to over fit)... i.e. low variance. The aim is that $J_{\text{test}} (\theta)$ is small.
Clustering
K-means algorithmInput:
- K (number of clusters)
- Training set $x^{(1)} ,x^{(2)},...,x^{(m)} $
$x^{(i)} \in \mathbb{R}^{n}$ (drop $x_0$ = 1 convention)
Randomly initialize K cluster centroids $\mu_1,\mu_2,...,\mu_K \in \mathbb{R}^{n}$
Repeat {
for i=1 to m
$c^{(i)}$ := index (from 1 to K) of cluster centroid closest to $x^{(i)}$
for k=1 to K
$\mu_K$ := average (mean) of points assigned to cluster k
}
Optimization objective: $$ J(c^{(1)},...,c^{(m)},\mu_1,...\mu_K ) = \frac{1}{m} \sum_{i=1}^m || x^{(i)} - \mu_{c^{(i)}} ||^2 $$ Minimise objective function: $$ \displaystyle \min_{\substack{ c^{(1)},...,c^{(m)} \\ \mu_1,...\mu_k }} J(c^{(1)},...,c^{(m)},\mu_1,...\mu_K ) $$
PCA
Our aim is to reduce the dimensionality of a feature set. For example, to reduce from 2-dimension to 1-dimension we find a direction (a vector $u^{(1)} \in \mathbb{R}^{n}$ ) onto which to project the data so as to minimize the projection error. More generally, to reduce from n-dimension to k-dimension we find k vectors $u^{(1)},u^{(2)},...,u^{(k)}$ onto which to project the data, so as to minimize the projection error.PCA is a useful algorithm for achieving this.
Algorithm Steps:
- Normalize the features (subtract mean and divide by standard deviation)
- Compute covariance matrix $$ \Sigma = \frac{1}{m} \sum_{i=1}^n (x^{(i)})(x^{(i)})^T $$ - Compute "eigenvectors' of matrix $\Sigma$. The Matlab code for this is $$ [U,S,V] = \text{svd}(Sigma); $$ where svd is the Matlab function for carrying out a singular value decomposition of a matrix. U is an $n\times n$ matrix with each column being an eigen vector. We define $U_\text{reduce}$ as an $n\times k$ matrix containing the first k eigenvectors. The Matlab code for this is $$ Ureduce = U(:,1:k); $$ -compute the scores (which are the original features projected onto the new axis). The Matlab code is $$ z = Ureduce'*x $$ where z is a matrix of scores, x is the feature matrix, and ' is the transpose operator. So, how do you choose k (number of principal components)?
Average squared projection error: $$ \displaystyle \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} ||^2 $$ Total variation in the data: $$ \displaystyle \frac{1}{m} \sum_{i=1}^{m} || x^{(i)}||^2 $$ Typically, choose k to be the smallest value so that $$ \displaystyle \frac{ \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} ||^2 }{\frac{1}{m} \sum_{i=1}^{m} || x^{(i)}||^2} \leq .01 $$ so that 99% of the variance is retained. PS we can change the threshold from .01 to some other value. We apply PCA to compress the data (reduce memory/disk needed to store data, speed up algorithm), and also to visualize.
No comments:
Post a Comment