Powered by GitBook

Machine Learning
1. Three Parametric Approach to Classification
  1. Discriminant Functions
    1. Fisher's
    2. Perceptron
    3. SVM
  2. Probabilistic Discriminative Models
    1. Logistic Regression
    2. CRF
  3. Probabilistic Generative Models
    1. Naive Bayes
    2. HMM
2. Regression
  1. Polynomial Curve Fitting： Use polynomial curve to fit a target curve, and target is minimize the fitting with the ground truth. But we have a problem at here: how many orders we should use? The higher the order, the smaller the object function. Howerver this strategy lead to overfitting.
  2. Regularization. We use a term to penalty the weight of the polynomial curve, but there is a parameter fot this term. How to decide the value of this parameter? Separate the data as train, test and validation.
  3. How to handle overfitting? 1. more data. 2. regularization. 3. probabilistic model in Bayesian setting.
  4. Besides polynomial curve, we also can use Gaussian function, sigmoid function...
  5. How to understand Least Square and Regularized Least Square? Lest Square is Maximum Likelihood, Regularized Least Square is MAP. Maximum Likelihood is p(t|w), so we can see here we care about t, we do our best to get best t. So that' why we cannot avoid overfitting. Maximum A Posteriori is p(w|t), at here we care about w, we do our best to get best w.
  6. How to solve the object functions? Gradient descent. Two types, Batch and stochastic. If we have all data, we can choose Batch, which update the parameters for a group (all) data. Sochastic updte the parameter for every sample.
3. Fisher Linear Discriminal (Discriminant Funciton)
  1. The idea of discriminative functions: y = w*x+w0. y>0 belongs to one class, y< 0 belongs to the other class. Then how to learn the w?
  2. We can understand y=w*x+w0 as this: project the data x to one dimension, then use -w0 as separation. Then we will think, we 'd better project x well, that means the gap between two class should be large.
  3. So now we have criterion about good w. Fisher's algorithm: maximize the between class separation (after the projection w*x, the distance of the center of the two groups is large), minimize the within-class variance (all of the same class samples stay together).
  4. Use this method we can find w. How about w0? Assume after projection, the samples of two classes are Gaussion distribution. We can get mean, and variance because we know all training samples. And set p(w0|C1) = p(w0|C2).
4. Perceptron
  1. y = w*x+w0. y>0 belongs to one class, y\<0 belongs to the other class. What does perceptron do to find the w?
  2. We want the prediction and the ground truth have the same label, so (w*x+w0)t>0. If misclassified, then (w*x+w0)t\<0. So we want to maximize (w*x+w0)t or minimize -(w*x+w0)t only for misclassified samples. So the gradient is -xt. Update method is w - gradient, that is w+xt.
  3. Set the two classes label as 1 and -1. If the prediction and the ground truth is the same, then y*t >0, it's done. If not, make the magnitude of y close to 0. That means minimize y*t. Because y*t derivable, so we can use gradient descent to solve it.
  4. Kernel version. We are not calculate y=w*x+w0 now. We calculate y_i=w*(fi(x_i)^T, fi(x))+w0. In this way for each sample, we have the feature interaction with every other samples. And we can see (fi(x_i)^T, fi(x)) is a scalar. W is the same dimension as number of training samples. We have inner production now, so we have kernel.
  5. Prove a kernel is valid.
  6. Perceptron solution depends on: 1. initial values of w and b. 2. order of processing of data points.
5. SVM
  1. SVM also consider about y=w*x+w0. And the consideration is we have a lot of line to separate two classes, but the best one is the one with the largest margin between two classes. The distance from a point to a plane is y(x)/|w|. The margin is minimum distance. And we need the largest margin.
  2. The real distance is meaningless. So we can set the margin as 1 in calculation, so now maximize y(x)/|w| is to maximize 1/|w|, also means to minimize |w|.
  3. Dual Problem. We can solve the problem use some Quadratic programming package, we also can solve the Lagrange dual problem. The original problem is min |w| with the conditions that each sample satisfy y(w*x+w0)>1. And the Lagrange dual problem is max (|w|-sum(alpha*y(w*x+w0)-1)). Now the Lagrange dual problem is about alpha. And the value of this equation is maximal for w and b. So the gradient of this quation about w and b is 0. Then we can remove w and b from the Lagrange equation by using the gradient function. And the constraints of y(w*x+w0)>1 transfer to alpha >0.
  4. KKT conditions.
  5. Kernel version. Because we have inner production in the equation, it's easy to develop it to kernel version.
  6. The point is, after we get the support vector, the decision function is only about test x and the support samples. For the rest samples are useless. And the decision function is in kernel style (test x times the support samples).
  7. Relaxed version.
  8. SVM for Regression
    1. w*x+w0-epsilon < GT < w*x+w0+epsilon'
    2. To my understanding, the object function is to minimize epsilon + epsilon'. But why we still need to minimize |w|? It's like regularization.
  9. SVM for Ranking.
    1. w*x_i_ > _w*x_j
  10. Data Scaling
    1. defferent features should be set into the same scale: 0-1 or -1-+1.
6. K-Nearest Neighbor
  1. Based on distance and the assumption that the label changes smoothly. So we have different versions of KNN for different distance measurement, such as Euclidean distance, Hamming distance, Mahalanobis distance.
  2. Distance based KNN: not only consider about the number of samples close to one sample, but also consider about the distance. So is we have 3 closest neighbors of one class and also 3 neighbors for the other class, calculate the closest total distance.
  3. Kernel version KNN: the same as perceptron, set a feature function as inner production of one sample with other samples. The kernel itself use as the 1/distance.
  4. KNN for regression: the value of the point in the regression line is the average of its nearest k neighbors.
7. Feature Selection
  1. Wrapper method: Based on the performance for classification
    1. Feedward method: best feature set is start from 0, add the feature that improve the model performance most.
    2. Backward method: From the whole feature, each time remove the feature that responds to the best model (effect the result minimum).
    3. Mixed version, two step forward, one step backward.
    4. SVM for feature selection: select the feature that correspond to the minimum |w|.
  2. Filter method: Based on the relation with the label
    1. Mutual information
    2. Chi Square
    3. Pearson Correlation Coefficient
    4. Signal to Noise Ratio
    5. T-test
8. Decision Tree Model
  1. We prefer the shortest tree over the larger tree, so that's ID3 algorithm: select the feature with the largest information gain. And that's greedy method.
  2. The tree generated is a perfect tree, so maybe it leads to overfitting.
    1. The method that reduces overfitting: 1) stop grown before perfectly classify. 2) Prune the perfect tree.
    2. Then problem now is how to decide the right size. Three methods, 1) validation set. 2) statistical test. 3) Minimal Description Length.
9. Naive Bayes
  1. Probabilistic Generative Model: We need p(c|x), but we model p(x|c) and p(c) first because we only know them now, then p(c|x) = (p(x|c)*p(c))/p(x). And p(x) = sum(p(c|x)*p(c)). This probability is a sigmoid function 1/(1+exp(a)). and if a is a linear function of x then it is a generalized linear model.
  2. Naive bayes assumes the features are independent to each other, so p(x|c) is the production of p(x_i|c).
  3. Then we can re-think about 1. p(c|x)=(p(x|c)*p(c))/p(x). We try to find c_i that p(c_i|x) is maximal, so for different c_i, p(x) is the same. So we only have to compare p(x|c)*p(c).
  4. Then how to train? For dicrete feature, just count how many times it appeared in training set. For continuous features, assume each feature is Gaussian distribution and estimate mean and virance.
10. Logistic Regression
  1. In Probabilistic Generative Model we know we can set p(c|x) as a sigmoid function, in Logistic Regression, we use this probability to do classification. p(c|x) = 1/(1+exp(-a(x))), and a(x) = w*x. In the object function, we need to calculate w.
  2. The object function of maximal likelihood is p(t|w). It is for all t. So we have PI at here.
  3. Compare Naive Bayes with Logistic Regression. They have the same function p(c|x) = 1/(1+exp(-a(x))), but the a(x) function is different.
    1. For Naive Bayes, a(x) = p(x|c)p(c), we directly handle p(x|c). But x is a vector, it's hard to get p(x|c). At here we assume each element of x is independent. so p(x|c) = PI(p(x_i|c)).
    2. For Logistic Regression, a(x) = wx. We only want to use w map x to a probability.
  4. So two ways in Logistic Regression: one is maximal likelihood, that means if the object function is arg max p(Label|w), then we are calculating maximal likelihood. We also can calculate the MAP, which is arg max p(w|t), p(w|t) .= p(t|w)p(w). As likelihood, we know p(t|w). How about p(w)? We assume it is Gaussian distribution. Then it means have a term to regularize w.
  5. Maximum Likelihood = Maximum Entropy
11. Unsupervised Learning
  1. Clustering
    1. Basically it's start with several clusters, and define the distance of clusters, then merge.
    2. K-means problem. EM algorithm.
  2. PCA
    1. After the projection, the variance of the samples is largest.
    2. Each feature minus its mean, then calculate covariance matrix, then eigenvalue and eigenvector of this covariance matrix. Select the n largest eigenvalue, then the n corresponding eigenvector is the project direction.
    3. Two perspectives to see PCA: one is to project data to lower dimension so the variance of the projection is maximized; the other is project the data to lower dimension so the error between original data and projection is minimized.

results matching ""

No results matching ""