2 Modeling and Errors

In Summary

Modeling is a critical component of regression and statistics in general. Parameter estimating can allow a researcher to interpolate, or fill in gaps, in their dataset and determine key relationships between variables and outcomes.

A lot of work goes in behind the scenes to determine how “well-fit” a model is to its underlying data. Having a core understanding of the different errors, including their attributes and impacts, is fundamental in statistical methodology.

The single linear regression model is:

\[Y_i= \beta_0 +\beta_1X_i + \epsilon_i\] The model is called a single or simple model because we are using one predictor.

2.1 Features of Models

\(Y_i\) is an outcome variable. \(Y_i\) \(Y_i\) random for each experiment because it is the sum of a constant term, \(\beta_0 +\beta_1X_i\), plus a random variable, \(\epsilon_i\).

\(X_i\) is considered a fixed value predictor for the experiment, and \(\beta_0\) and \(\beta_1\) are parameters, usually unknown. The goal is to estimate \(\beta_0\) and \(\beta_1\).

We can estimate \(E(Y_i|X_i)\) with \(\hat y\) and the expectation that the errors will be zero.

Definition 2.1 (Predicted Value Formula) \[\begin{equation} \begin{split} E(Y_i|X_i)&= \beta_0 +\beta_1X_i\\ \hat y&= b_0 +b_1x_i \end{split} \end{equation} \]

We expect our errors to deviate randomly around the regression line.

Definition 2.2 (Deviation Formula) \[\epsilon_i = Y_i -(\beta_0 +\beta_1X_i)\]

Example 2.1 We want to estimate the systolic blood pressure for a 20 year old. We know the following variables:

  • \(x\): age
  • \(y\): systolic blood pressure
  • \(\beta_0\): the intercept, \(90\)
  • \(\beta_1\): the slope, \(0.9\)

Note that \(\beta\)’s are usually unknown. We can say, we expect systolic blood pressure to be \(90\) at age \(0\) (or at birth), increasing \(0.9\) units every year.

\[\begin{align} E(Y|X = 20) \\= 90 + 0.9(20)\\=108 \end{align}\]

Factoring in an error margin, given \(X = 20\) we would expect \(Y = 108 +\epsilon\). A 20-year-old should have a systolic blood pressure around 108.

2.2 An Introduction to Residuals

The expected value of \(\epsilon\) is on the regression line.

We expect some variation between the measured values in our model and the true regression function. We quantify these errors with an observed error residual, \(\epsilon_i\). The residual captures the random errors of measurement in each experiment.

Residual for observation \(i\):

\(e_i=\) observed - predicted

Definition 2.3 (Observed Residual Error) \[\begin{equation} \begin{split} e_i&= y_i - \hat y\\ &= y_i - (b_0 - b_1x_i) \end{split} \end{equation}\]

We need to establish some properties for \(\epsilon_i\):

Property 1:

\(E(\epsilon_i) = 0\), \(\forall i\)

On average, the error is the regression line, 0.

\[\begin{equation} \begin{split} E(\epsilon_i) = 0\\ \epsilon_i &= Y_i -(\beta_0 +\beta_1X_i) \\ E[\epsilon_i] &= E[Y_i -(\beta_0 +\beta_1X_i)] \\ 0 &= E[Y_i] - E[\beta_0 +\beta_1X_i] \\ 0 &= E[Y_i] - E[\beta_0] +E[\beta_1X_i] \\ 0 &= E[Y_i] - \beta_0 + \beta_1X_i \\ \Rightarrow E[Y_i] = \beta_0 + \beta_1X_i \\ \end{split} \end{equation}\]

showing that on average, \(Y_i\) falls along the regression line. We can say the regression line is the average of \(Y_i\), conditional on \(X_i\).

\[ E(Y_i|X_i)= \beta_0 +\beta_1X_i \] Property 2:

\(\epsilon_i...\epsilon_n\) have a constant variance.

Define \(\sigma^2(\epsilon_i) \equiv \sigma^2\), then $^2(Y_i) = var(Y_i) $.

\(var(\epsilon_i)\) increases as \(X\) increases.

Property 3:

\(\epsilon_i...\epsilon_n\) are uncorrelated.

\[\begin{equation} \begin{split} corr(\epsilon_i,\epsilon_j) &= 0, i\neq j \\ corr(\epsilon_i,\epsilon_j) &= var(\epsilon_i) \neq 0 \\ cov(\epsilon_i,\epsilon_j) &= 0, i\neq j \end{split} \end{equation}\]

Similarly, but with different notation, \(corr(\epsilon_i,\epsilon_j) = \sigma(\epsilon_i,\epsilon_j)\).

Property 4:

\(\epsilon_i...\epsilon_n\) are normally distributed.

2.3 Estimating \(\beta\)’s

We can find good estimates of \(\beta_0\) and \(\beta_1\) using the least squares method, where \(b_0\) and \(b_1\) are unbiased least squares estimates.

When you have \(n\) pairs of data, \((x_1, y_1)...(x_n, y_n)\):

Compose the squared deviation formula from the residual formula (Defn []):

\[e_i^2 = [y_i - (\beta_0 +\beta_1x_i)]^2\]

This will allow us to define \(Q\), the sum of squared deviations:

Definition 2.4 (Sum of Squared Deviation ) \[\begin{equation} \begin{split} Q &\equiv \sum_{i = 1}^n e_i^2\\&=\sum_{i = 1}^n [y_i - (\beta_0 +\beta_1x_i)]^2 \end{split} \end{equation}\]

The least squares estimates are values of \(\beta_0\) and \(\beta_1\) that minimize \(Q\).

We use \(b_1\) to estimate the slope \(\beta_1\):

Definition 2.5 (Estimation of Regression Slope) \[\begin{equation} \begin{split} b_1 &=\frac{\hat{cov}(X,Y)}{\hat{var}(X,Y)} \\ &= \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} \end{split} \end{equation}\]

The sampling distribution of \(b_1\) is normal. \[ b_1\sim N\]

Once we have \(b_1\), the estimate for the intercept \(\beta_0\) is straightforward:

Definition 2.6 (Estimation of Regression Intercept) \[\begin{equation} \begin{split} b_0 = \bar{y}- b_1\bar{x}\\ \end{split} \end{equation}\]

where the point \((\bar{x}, \bar{y})\) falls along the regression line.

2.3.1 Linear Combinations

The linear combination of a set of values \(a_1...a_n\) has the formula:

\[ \sum_{i=1}^n c_ia_i \]

where \(c_1...c_n\) are constants.

Using the notation above we can estimate \(b_1\):

\[ b_1 = \frac{\sum(x_i-\bar{x}) y_i}{\sum(x_i-\bar{x})^2} \]

where the factors exclusive of \(y_i\) render the constant \(c_i\).

2.4 Errors

SSE: Sum of Squared Errors

The residual sum of squares.

Definition 2.7 (Sum of Squared Errors) \[\begin{equation} \begin{split} SSE &= \sum_{i = 1}^n e_i^2\\ &= \sum_{i = 1}^n (y_i - \hat y)^2 \end{split} \end{equation}\]

MSE: Mean Squared Errors

Definition 2.8 (Mean Squared Errors) \[\begin{equation} \begin{split} MSE &= \frac{\sum_{i = 1}^n (y_i - \hat y)^2}{n-2} \\ &= \frac{SSE}{n-2} \end{split} \end{equation}\]

Where we have \(n-2\) degrees of freedom for estimating two parameters, \(\beta_0\) and \(\beta_1\).

The MSE is an unbiased estimator for \(\sigma^2\), as the expected value is the true value.

\[ E(MSE) = \sigma^2\]

Our estimates of \(\beta\)’s are random variables. We expect \(b_1\) to change with each sampling. As we repeat sampling, we expect the MSE to average out to \(\thicksim\sigma^2\).

\[ \sigma^2(b_1) = \frac{\sigma^2}{\sum(x_i -\bar x)^2} \]

and so,

Definition 2.9 (Estimation of Variance) \[ s^2(b_1)=\frac{MSE}{\sum(x_i -\bar x)^2} \]

which we’ll use in our p-value and confidence interval calculations.

Examples

Example 2.2 The predicted model in Example [] were close, but not quite in line with the true regression line. Determine the errors:

\[\begin{equation} \begin{split} SSE &= \sum_{i= 1}^{n}e_i^2 \approx 4457.83\\ MSE &= \frac{SSE}{n-2}\approx 247\\ s^2(b_1) &= \frac{MSE}{\sum(x-\bar{ x})^2} \approx 0.04067\\ s(b_1) &= \sqrt(0.04067) \approx 0.202 \end{split} \end{equation}\]

We’ll use these for hypothesis testing in the next chapter.