1 Single Linear Regression
One predictor, one outcome
In Summary
This section serves as an introduction to expectation, variance, bias, and modeling. These topics serve as building blocks of linear regression. As we sample and re-sample our data repeated, we’ll start to see our statistics, our estimates, start to mirror the true parameters they emulate.
1.1 Variables
Single variable regression has only one continuous response variable (\(y\)) and one explanatory variable (\(x\)). The response variable, \(y\) is also known as the dependent or outcome variable. Explanatory variables, \(x\) aka independent, predictor, or covariate variables, can include continuous or categorical values. We also have two parameters in regression: \(\beta_0\) is the y-intercept and \(\beta_1\) is the slope of the function. All four components make up the linear regression equation, a variant of slope-intercept form.
\[y=\beta_0 + \beta_1x\]
Fixed and Random Variables
Fixed variables are those that don’t change from one experiment to the next. The values are not chosen by the researcher. \(X\) is treated as a fixed variable for regression. That said, regression is often used when the \(X\) is not wholly chosen; it may simply the only data you have to work with. For example, sampling a population will return a distribution of ages that may not match the true distribution of the population. So while we treat \(X\) as a fixed value, it is also technically random.
Random variables are expected to return different values after repeating an experiment. The variability between the true regression line and the points \((X,Y)\) is captured with an error value, \(\epsilon\).
\[ Y=\beta_0 + \beta_1X + \epsilon \] \(\beta\)’s are considered fixed values along with \(X\), whereas \(\epsilon\) is a random variable.
\(Y\) is also a random variable given it’s dependency on \(X\) and the effects of \(\epsilon\).
Parameters and Statistics
Statistics are used to estimate the true parameters
\(\beta_0\) and \(\beta_1\) are parameters; usually unknown values related to the population and not the sample.
- Parameter: values for the population
- Statistic: values for the sample
After collecting data from the population, we can create a sample on which to run our statistics for which we can estimate our parameters.
parameter | statistic | |
---|---|---|
mean | \(\mu\) | \(\bar{y}\) |
variance | \(\sigma^2\) | \(s^2\) |
slope | \(\beta_1\) | \(b_1\) |
intercept | \(\beta_0\) | \(b_0\) |
1.2 Expected Values
Expected values are mean or average values. They look a little different depending on what the data looks like.
Sample means are averages. Given the sample data \(X_1... X_n\), would return the statistic:
Definition 1.1 (Sample Mean) \[\begin{equation} \bar{x} = \frac{1}{n} \sum_{i = 1}^nX_i \end{equation}\]
After several repeated experiments, we can come up with the population mean, the expected value. If \(Z_1..Z_N\) for all values in the population, then we would use the parameter:
\[ E(Z) = \frac{1}{N}\sum_{i=1}^N Z_i\]
In general, a discrete random variable \(Y\) with possible values \(y_1... y_k\) we can say \(E(Y)\) is a weighted average of the possible values \(y_1... y_k\) and their respective expected probabilities \(P(Y = y_1)...P(Y = y_k)\):
Definition 1.2 (Expected Value) \[\begin{equation} E(Y) = \sum_{i=1}^k y_i P(Y = y_i)\end{equation} \]
Expectation with discrete values is a modification of the weighted average formula. Given the limitations of probabilities, we simply have to add the caveats that the sum of all probabilities is equal to 100%; \(\sum w_i = 1\) and \(0 \leq w_1 \leq 1\).
\[\begin{equation}\sum_{i=1}^k w_ia_i\end{equation}\]
Expectation for continuous random variable \(Y\) would be calculated with a density function:
\[E(Y) = \int_{-\infty}^{\infty} yf(y)dy\] It’s the same thing, yet terrible because integrals.
Example 1.1 The normal distribution Y ~ N(0,1)
with mean 0
and variance 1
, would be represented using the density function:
\[E(Y) = \int_{-\infty}^{\infty}y\frac{1}{\sqrt{2\pi}} \exp\left( -\frac{y^2}{2}\right) dy = 0 \]
Rules for Expectation
Given random variables \(X\) and \(Y\) and constants \(a,b,\) & \(c\):
- \(E(c) = c\)
- \(E(cX) = cE(X)\)
- \(E(X+Y) = E(X) + E(Y)\)
Example 1.2 Use the rules of expectation to simplify the equation.
\[\begin{equation} E(a+bX+cY) \\ = a + bE(X) + cE(Y) \end{equation}\]
1.3 Variance
Sample data \(x_i...x_n\) would give us the unbiased sample variance:
Definition 1.3 (Sample Variance) \[ s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} \]
Focusing on the denominator, sample sizes \(n\) will shift the statistic significantly compared to a larger sample.
Population variance can be defined as:
Definition 1.4 (Variance) \[ var(Y) = E[Y-E(Y)]^2 \]
The squaring of \([Y-E(Y)]\) “removes the sign”, and reverts all measures to positive distances from the mean.
We can use the population variance formula to show that \(E(Y)\) is equivalent to the parameter \(\mu\). Given \(var(Y)\):
\[\begin{equation} \begin{split} var(Y) & = E[Y-E(Y)]^2\\& =E[Y-\mu]^2\\\therefore \mu &= E(Y) \end{split} \end{equation}\]
Given our known substitutions, we can say:
\[\begin{equation} \begin{split} var(Y) & = E[Y-E(Y)]^2\\ & =\sum_{i=1}^k (y_i-\mu)^2P(Y-y_i) \end{split} \end{equation}\]
If \(Z_1... Z_n\) account for all values in a population, we would have parameter:
\[var(Z) = \frac{1}{N}\sum_{i=1}^{N}(Z_i-\bar{Z})^2\]
where \(\bar{Z} = E(Z)\).
Rules for Variance
- \(var(a) = 0\)
- \(var(aX) = a^2 var(X)\)
- \(var(X+Y) = var(X) + var(Y) + 2cov(X,Y)\)
Example 1.3 Use the rules for variance to simplify the equation. \[\begin{equation} \begin{split} var(a+bX-cY) \\ &= var(a) + var(bX-cY) + 2cov(X,Y)\\ &= 0 + var(bX)+ var(-cY) \\ &=b^2var(X) + c^2var(Y)+2[(-bc)cov(X,Y)] \end{split} \end{equation}\]
1.4 Covariance
Sample covariance is defined as:
Definition 1.5 (Sample Covariance) \[ \hat{cov}(X,Y) = \frac{\sum(X_i - \bar{X})(Y_i-\bar{Y})^2}{n-1} \]
Given the random variables \((X,Y)\) we can expand the definition of covariance to:
Definition 1.6 (Covariance) \[cov(X,Y) = E[(X -\mu_x)(Y-\mu_y)]\]
Rules for Covariance
- \(cov(X,Y) = cov(Y,X)\)
- if \(X ⫫ Y \Rightarrow cov(X,Y) = 0\), if the random variables are independent of one another, then we cannot predict their variability
- if \(cov(X,Y)= 0 \Rightarrow\) may or may not \(X ⫫ Y\)
- \(cov(a,b) = 0\)
- \(cov(a, X) = 0\)
- \(cov(aX, Y) = a cov(X,Y)\)
- \(cov(aX, bY) = ab cov(X,Y)\)
- \(cov(X, Y+Z) = cov(X,Y)+cov(X,Z)\)
- \(cov(X,X) = var(X)\)
1.5 Correlation
Correlations is not equal to causation in non-random studies
Correlation is defined on the scale \(-1 \leq corr(X,Y) \leq 1\) where:
Definition 1.7 (Correlation) \[ corr(X,Y) = \frac{cov(X,Y)}{\sqrt{var(X)var(Y)}} \]
1.6 Bias
The estimators \(b_0\) and \(b_1\) are unbiased estimators. They only require that for any \(\epsilon_i\), that \(E(\epsilon_i) = 0\).
So \(E(b_0) =\beta_0\) and \(E(b_1) = \beta_1\).
Bias is a term that refers to how much an estimate (or estimator) is off from its true value. Contextually, we want to know how far off the slope and intercept are for a sample of values in an experiment deviates from the true regression line.
Because of random variability, \(s^2\) will vary with each experiment. After repeating the experiment we can collect an unbiased sample variance, the average value of those experiments.
\[ E(s^2) = E\frac{\sum(Y_i-\bar{Y})^2}{n-1}= var(Y)\]
Examples
Example 1.4 We want to estimate the systolic blood pressure, \(y\), for 20 subjects based on their age, \(x\). \(\beta\)’s are usually unknown, but in this case we know we want to be reasonably close to \(\beta_0=90\) and \(\beta_1=0.9\)
The mean age, \(\bar{x}\), is \(38.15\) and the mean systolic blood pressure measured, \(\bar{y}\) was \(118.11\) mmHg.
First, solve for \(b_1\):
\[\begin{equation} \begin{split} b_1 &=\frac{\hat{cov}(X,Y)}{\hat{var}(X,Y)}\\ \hat{cov}(X,Y)&= \sum_{i =1}^{20} (x_i- \bar x)(y_i - \bar y)\approx 4863.282\\ \hat{var}(X,Y) &= \sum_{i =1}^{20} (x_i- \bar x) \approx 6072.55\\ b_1 &= \frac{4863.282}{6072.55} \approx 0.8 \end{split} \end{equation}\]
Then, for \(b_0\):
\[\begin{equation} \begin{split} b_0 &= \bar{y}- b_1\bar{x}\\ b_0 &= 118.11- 0.8(38.15)\\ b_0 &\approx 91.6 \end{split} \end{equation}\]
Our estimated regression line:
\[ y = 0.8x-91.6 \] is roughly comparable to the true regression line:
\[ y = 0.9x-90 \]
Example 1.5 Given the values above, what is the predicted systolic blood pressure for a 50-year-old?
\[E(Y|X = 50)\]
Use the estimated regression line to determine the value for \(x_1 = 50\): \[\hat y = 0.8(50) - 91.6\approx 131.6\]