Regression Analysis

created : 2021-06-06T05:56:14+00:00
modified : 2021-06-20T14:05:24+00:00
statistics lectures

Chapter 1. Introduction

1.1 What is regression analysis?

1.2 Hisotry

1.4 Procedures of regression analysis

  1. Statement of the problem
  2. Selection of potentially relevant variables
  3. Experimental design and data colelciton
  4. Model specification
  5. Choice of fittin model
  6. Model fitting
  7. Model validation and criticism
  8. Using the model for the intended purpose

Various Models

Chapter 2. Corrleation analysis and simple linear regression

2.1 Covariance and correlation

Anscombe’s quartet

2.2 Simple linear regression

2.3 Least Squares Estimation (LSE)

2.4 Properties of the LSE

2.5 Quality of fit

  1. (objective) Thre greater t test statistic of $H_0 : \beta_1 = 0$ (or the smaller the p-value) is, the stronger the strength of the linear relationship between X and Y is.
  2. (subjective) The scatter plot may be used to discover the strength of the linear relationship.
  3. Examine the scatter plot of Y versus $\hat Y$. The closer the set of points to a straight line, the stronger the linear relationship between Y and X. One can measure the strength of the linear relationship in this graph by computing the correlation coefficient between Y and $\hat Y$,:
    • $Cor(Y, \hat Y) = | Cor(Y, X) |$
  4. Furtuermore, in both simple and multiple regressions, $Cor(Y, \hat Y)$ is related to another useful measure of the quality (goodness) of fit of the linear model to the observed data, that is called the coefficient of determination $R^2$.

2.6 Simple linear regression with no intercept

2.7 Trivial regression and one sample t test

2.8 Hypothesis tests about a population correlation coefficient


Chatper 3. Multiple linear regression

Chatper 3.1 Parameter Estimation: Least Squares Estimation (LSE)

Chapter 3.2 Interpretation of the regression coefficients

Chapter 3.3. Centering and scaling

  1. Unit-Length scaling:
    • $Z_y = \frac{Y - \bar y}{L_y}$
    • $Z_j = \frac{X_j - \bar x_j}{L_j}$
    • where $L_j = \sqrt{\sum_{i=1}^n (y_i - \bar y)^2}$ and $L_j = \sqrt{\sum_{i=1}^n (x_{ij} - \bar x_j)^2}$
  2. Standardizing:
    • $\hat Y = \frac{Y - \bar y}{s_y}$
    • $\hat X_j = \frac{X_j - \bar x_j}{s_j}$

Chapter 3.4 Properties of LSEs

Chapter 3.5 Multiple correlation coefficient

Chapter 3.6 Inference for individual regression coefficients

Chatper 3.7 Tests of hypotheses in a linear model

Source SS df MS F P-value
Regression SSR p MSR = SSR/p F = MSR/MSE $P(F_{p+1 -k, n-p-1} \ge F)$
Residudals SSE n-p-1 MSE = SSE(n-p-1)    

Chatper 4. Diagnostics

4.1 Standard Regression Assumptions

  1. Assumption about the form of the model (linearity of Y and X_1, …, X_p): $Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon$
  2. Assumptions about the random erros: $\epsilon_1, \cdots, \epsilon_n \sim^{iid} N(0, \sigma^2)$:
  3. Normality assumption
  4. Mean zero
  5. Constant variance (homogeneity or homoscedasticity) assumption: when violated, it is called the heterogeity (or heteroscedasticity) problem
  6. Independent assumption: when violated, it is called the autocorrelation problem
  7. Assumptions about the predictors:
  8. Nonrandom
  9. No measurement erros
  10. Linearly independent: when violated, it is called the multicollinearity problem
  11. Assumption about the observations (equally reliable)

4.2 Residuals

4.3 Graphical Methods

  1. Detect errors in the data (e.g., an outlying point may be a result of a typographical error)
  2. Recognize patterns in the data (e.g., clusters, outliers, gaps, etc.)
  3. Explore relationships among variables
  4. Discover new phenomena
  5. Confirm or negate assumptions
  6. Assess the adequacy of a fitted model
  7. Suggest remedial actions (e.g., transform the data, redesign the experiment, collect more data, etc.)
  8. Enhace numerical analyses in general
  1. Q-Q plot or P-P plot or normal probability plot: checking normality
  2. Scatter plots of the standardized residual vs X_i : checking linearity and homogeneity
  3. Scatter plot of standardized residual vs fitted values: checking linearity and homogeneity
  4. Index plot of the standardized residuals: checking independent errors

4.4 Leverages, Outliers Influence

4.4.1 Outliers in response

4.4.2 Outliers in predictors

4.4.3 Masking and Swamping Problems

4.4.4 Incluential points

4.5 Added-variable (AV) plot and residual plus component (RPC) plot

4.5.1 Effects of adding a variable

4.5.2 Robust regression

Chapter 5. Regression analysis with qualitative explanatory variables

5.1 Introduction

5.2 Interactions

5.3 Equal slopes and unequal intercepts

5.4 Unequal slopes and unequal intercepts

5.5 Seasonality

Chapter 6. Transformations

Function Transformation Linear Form
$Y=\alpha X^{\beta}$ $Y’ = logY, X’ = logX$ $Y’=\alpha ‘ + \beta X’$
$Y = \alpha e^{\beta X}$ $Y’ = logY$ $Y’ = log \alpha + \beta X$
$Y = \alpha + \beta log X$ $X’ = logX$ $Y = \alpha + \beta X’$
$Y = \frac{X}{\alpha X - \beta}$ $Y’ = \frac{1}{Y}, X’ = \frac{1}{X}$ $Y’ = \alpha - \beta X’$
$Y = \frac{e^{\alpha + \beta X}}{1 + e^{\alpha + \beta X}}$ $Y’ = log \frac{Y}{1-Y}$ $ Y’ = \alpha + \beta X$

6.2 Detection of heterogeneity

6.3 Variance stabilizing transformations

$\sigma$ Transformation
$\sigma = \mu^k$ $Y^{1 - k}$
$\sigma = \mu$ $log Y$
$\sigma = \sqrt{\mu}$ $\sqrt{Y}$
$\sigma = \sqrt{\mu (1 - \mu) / n}$ $arcsin(\sqrt{Y})$

6.4 Weighted Least Squares (WLS)

6.5 Box-Cox power transformations

Chatper 7. Weighted Least Squares

Chapter 8. Correlated errors

8.1 Runs Test

8.2 Durbin-Watson test

8.3 Transformation to remove autocorrelation (Cochrane and Orcutt, 1949)

8.4 Autocorrelation and missing predictors

8.5 Seasonality and dummy variables

Chatper 9 Multicollinearity


9.1.1 Multicollinearity may affect inferences in a regression model

9.1.2 Multicollinearity may affect forecasting

9.2 Detection of multicollinearity

Summary of Chapter 9

Chatper 10. Methods for data with multicollinearity

10.1 Principal components

10.2 Recovering the regression coefficients of the original variables

10.2.1 Recovering LSEs from the fit using the centered or/and scaled data

10.2.2 Recovering the LSEs from the fit using the principal components

Summary of recovering regression coefficients

10.3 Principal component regression (Dimension reduction)

  1. Center and/or scale the data
  2. Calculate the principal components of the sample variance-covariance matrix or the sample correlation matrix
  3. Select the number of principal components
  4. Fit the data using the selected principal components
  5. Recover the estimates of the regression coefficients

10.4 Ridge regression

10.5 Least Absolute Shrinkage and Selection Operator (LASSO)

Chapter 11 Variable selections

11.1 Why do we need variable selections?

11.2 Effects of variable selections

Appendix: Effects of Incorrect Model Specifications

11.3 Practical issues in variable selections

11.4 Forward, backward, stepwise selection

FS (Forward Selection)

  1. Preselect level of significance $\alpha_{in}$
  2. Start with the samllest model $y_i = \beta_0 + \epsilon_i$, denoted by $M_0$
  3. Find the model having the smallest p-value among k models: $y_i = \beta_0 + \beta_j x_{ij} + \epsilon$. If the p-value of the model is less than or equal to $\alpha_{in}$, then include the variable and the model is denoted by $M_1$. Otherwise, stop the procedure.
  4. Continue this procedure until there is no variable has the smaller p-value than $\alpha_in$ or it reaches the model including all variables.

BE (Backward Elimination)

  1. Predelect level of signifiance $\alpha_{out}$
  2. Start with the biggest model: $y_i = \beta_0 + \sum_{j=1}^k \beta_j x_{ij} + \epsilon_i$
  3. If the largest p-value of $X_j$ is greater than or equal to $\alpha_{out}$, then eliminate it.
  4. Refit the data with $k-1$ variables
  5. Continue the procedures until no predictor has larger p-value than $\alpha_{out}$ or it reaches $y_i = \beta_0 + \epsilon_i$

Stepwise selection

  1. Preselect $\alpha_{in}$ and $\alpha_{out}$
  2. At each step of the FS method, check whether a variable has the p-value greater than or equal to $\alpha_{out}$, if so, eliminate the variable.
  3. Continue these procedure until no variable has smaller p-value less than or equal to $\alpha_{in}$ and greater than or equal to $\alpha_{out}$ or it reaches the full model.

11.5 Best subset selection (regression)

11.6 Criteria

11.8 Variable selections with multicollinear data