Skip to main content

Simple Regression with R

·4 mins

Fitting a Line to Data

Simple linear regression a method of estimating an outcome based on a single related variable. We’ll want to estimate the systolic blood pressure (our outcome) for 20 subjects based on their age (an independent variable) using R.

Here are our 20 subjects:

Table 1: Blood Pressure Sample
Person 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Age 58 46 56 37 23 30 20 25 31 50 34 27 22 52 20 42 20 58 24 35
BP 167 136 116 116 101 121 103 101 105 158 117 118 113 148 105 142 89 153 122 135

First the Basics

It’s good to get a feel for the characteristics of our sample. Our subject information is stored in a dataframe called data.

We’ll want to collect some fundamental information about our sample.

Call summary(data):

##     Person               Age              BP       
##  Length:20          Min.   :20.00   Min.   : 89.0  
##  Class :character   1st Qu.:23.75   1st Qu.:105.0  
##  Mode  :character   Median :32.50   Median :117.5  
##                     Mean   :35.50   Mean   :123.3  
##                     3rd Qu.:47.00   3rd Qu.:137.5  
##                     Max.   :58.00   Max.   :167.0

The summary tells us we have 20 subjects with a mean age of \(35.5\) years. Looking at the quartiles, then our minimum and maximum values, the ages appear to be evenly distributed from 20 to 58 years. A quick check of our blood pressure data shows values that are evenly distributed as well. We’ll look at formal diagnostics later, but right now linear regression is looking like a good tool for estimation.

Visualizing Data

The function ggplotly() from the plotly library is a fantastic way to convert ggplot() items into interactive figures. Note that you can simply use the ggplot() code in the brackets for a static plot.

    {
      ggplot(data, aes(name = Person, x = Age, y = BP )) + 
      geom_point() + 
      labs(title = "Blood Pressure by Age, n = 20", 
           y = "Blood Pressure (mmHg)", 
           x = "Age (Years)")
    } %>% 
    ggplotly()

Figure 1: Blood Pressure by Age, n= 20

We certainly have a linear trend. There are slightly more younger subjects than older subjects. Generally speaking, there doesn’t appear to be too much spread (variability) in our data. That said, one of our subjects, Person 3, appears to have a systolic blood pressure substantially lower than expected at for their age; 116 mmHg at the age of 56.

Modeling

We appear to be in good shape for modeling with linear regression. Simple linear regression is presented in point-slope form, with parameters \(\beta_0\) and \(\beta_1\) making up our slope and intercept, respectively.

\[y = \beta_0 + \beta_1x\]

In R, we’ll call lm() for linear modeling. In this case, we want to model blood pressure BP by age.

lm(formula = BP~Age, data) -> m
## 
## Call:
## lm(formula = BP ~ Age, data = data)
## 
## Coefficients:
## (Intercept)          Age  
##      76.633        1.315

The function returns two coefficients, y-intercept (Intercept) and a ‘slope’ for variable, Age. Starting at 76.633 at Age = 0, we can expect a 1.315 mmHg increase in systolic blood pressure with each added year.

Following the regression formula, we’ll set (Intercept) to \(b_0\) and Age to \(b_1\).

# Set coefficients to betas
b0 <- m$coefficients[1] # 76.633
b1 <- m$coefficients[2] #  1.315

\[BP_{est} = b_0 + b_1*Age\\ BP_{est} = 76.663 + 1.315*Age\]

We can now add the regression line to the plot.

  {
    ggplot(data, aes(x = Age, y = BP)) +
    geom_point() + 
    geom_line(aes(y = b0 + b1*Age)) + 
    labs(title = "Estimated Regression Line: Blood Pressure by Age")
  } %>% 
  ggplotly()

Figure 2: Regression Line: Blood Pressure by Age

Hovering along the line, we can find the estimated blood pressure for any age in the range.

Fit and Validity

Note: In this case, (20,103) and (58,153) appear to begin and end the regression line. While it occurs for these data, it’s not common for a regression line to start and end with “true” subject values.

Notice how the regression line appears to sit in the middle of the points for younger subjects, but after x=40 the line tends to fall under the points. While the regression line appears to be a good fit for the younger subjects, our model frequently underestimates the blood pressure of older subjects.

So how good of an estimate is our line? The next post will talk more about residual errors, goodness-of-fit, and diagnostics.