통계학습개론(Introduction to statistical learnning) 수업 정리

created : 2021-10-07T19:46:27+00:00
modified : 2021-12-07T18:16:51+00:00

[en] 1. Introduction to Statistical Learning

2. Linear regression

[en]

2.1 Review - Linear regression

2.2 Comparision of Linear regression with KNN

[ko]

선형 회귀


3. Resampling Method

3.1 Cross-Validation

3.1.1 The Validation Set approach

3.1.2 K-fold Cross-validation

3.2 The Bootstrap

4. Linear Model Selection and Regularization

4.1 Subset Selection

Best Subset Selection

  1. Let $M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
  2. For $k = 1, …, p$:
  3. Fit all $\binom{p}{k}$ models that contain exactly k predictors.
  4. Pick the best among these $\binom{p}{k}$ models, and call it $M_k$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$.
  5. Select a single best model from among $M_0, …, M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Forward Stepwise Selection

  1. Let $M_0$ denote the null model, which contains no predictors.
  2. For $k = 0, …, p-1$:
  3. Consider all $p-k$ models that augment the predictors in $M_k$ with one additional predictor.
  4. Choose the best among these $p - k$ models, and call it $M_{k+1}$. Here best is defined as having smallest RSS or highest $R^2$.
  5. Select a single best model from among $M_0, …, M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Backward Stepwise Selection

  1. Let $M_p$ denote the full model, which contains all p predictors.
  2. For $k=p, p-1, …, 1$:
  3. Consider all k models the contains all but one of the predictors in $M_k$, for a total of k - 1 predictors.
  4. Choose the best among htese k models, and call it $M_{k-1}$. Here best is defined as having smallest RSS or highest $R^2$.
  5. Select a single best model from among $M_0, …, M_p$ using cross-validated prediction error, $C_p (AIC), BIC, or adjusted R^2$.

4.2 Shrinkage

4.2.1 Ridge regression

4.2.2 The Lasso regression

5 Moving Beyond Linearity

5.1 Polynomial Regression

$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + … + \beta_d x_i^d + \epsilon_i$.

5.2 Step functions

5.3 Regression Splines

5.4 Smoothing Splines

5.5 Local regression

Alogirhtm

  1. Gather the fraction $s= k/n$ of training points whose $x_i$ are closest to $x_0$.
  2. Assign a weight $K_{i0} = K(x_i, x_0)$ to each point in this neighborhood, so that the point furthest from $x_0$ has weight zero, and the closest has the highest weight. All but theses $k$ nearest neighbors get weight zero.
  3. Fit a weighted least squares regression of the $y_i$ on the $x_i$ using the aformentioned weightes, by finding $\hat \beta_0$ and $\hat \beta_1$ that minimize:
    • $\sum_{i=1}^n K_{i0}(y_i - \beta_0 - \beta_1 x_i)^2$.
  4. The fitted value at $x_0$ is given by $\hat f(x_0) = \hat \beta_0 + \hat \beta_1 x_0$

5.5.1

5.6 Generalized Additive Models

5.6.1 Backfitting Algorithm

6. Support Vector Machines

6.1 Separating Hyperplane

6.2 Quadratic Programming and Dual Problem

Dual problem

6.3 Linearly Nonseparable Case

Primal problem for the soft margin SVM

7. Principal component analysis

7.1 Principal Components Analysis

7.2 Principal Components Regression

8. Classification

8.1. Logistic Regression

Interpretation of $\beta$

Estimation

Logistic regression with several variables

8.2 KNN classifier

8.3 Comparison methods

9. Clustering

9.1. Hierarchical Clustering

9.2 Nonhierarchical Clustering

9.2.1 K-means clustering

9.2.2 K-medoids - Partitioning Around Medoids (PAM) clustering

10. Quantile regression

10.1 Motivation

10.2 Quantile

10.3 Conditional quantile and quantile regression

10.4 Real data analysis

11. Tree-based methods

11.1 Regression Trees

How do we build the regression tree

  1. We divide the predictor space - that is, the set of possible values for $X_1, …, X_p$ - into $J$ distinct and non-overlapping regions, $R_1,…,R_J$.:
    • Find $R_1, …, R_J$ that minimize the RSS given by:
    • $RSS = \sum_{j=1}^J \sum_{i \in R_j} (y_i - \hat y R_j)^2$, * where $\hat y R_j$ is the mean response for the training observations within the $R_j$.
  2. For every observation that falls into the region $R_j$, we make the same prediction, which is simply the mean of the reponse values for the training observations in $R_j$.
    • Computationally infeasible to consider every possible partition of the feature space into $J$ boxes. Thus, we take a top-down, greedy approach called recursive binary splitting.
  3. We first select the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions $R_1 = { X\vert X_j < s }$ and $R_2 = { X \vert X_j \ge s }$ leads to the greatest possible reduction in RSS.:
    • $RSS = \sum_{i:x_i \in R_1} (y_i - \hat y_{R_1})^2 + \sum_{i: x_i \in R_2} (y_i - \hat y_{R_2})^2$
  4. Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions.

Pruning a tree

11.2 Classification Trees