Fitting a Line to Data

Simple linear regression a method of estimating an outcome based on a single related variable. We’ll want to estimate the systolic blood pressure (our outcome) for 20 subjects based on their age (an independent variable) using R.

Here are our 20 subjects:

Table 1: Blood Pressure Sample
Person	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
Age	58	46	56	37	23	30	20	25	31	50	34	27	22	52	20	42	20	58	24	35
BP	167	136	116	116	101	121	103	101	105	158	117	118	113	148	105	142	89	153	122	135

First the Basics

It’s good to get a feel for the characteristics of our sample. Our subject information is stored in a dataframe called data.

We’ll want to collect some fundamental information about our sample.

Call summary(data):

##     Person               Age              BP       
##  Length:20          Min.   :20.00   Min.   : 89.0  
##  Class :character   1st Qu.:23.75   1st Qu.:105.0  
##  Mode  :character   Median :32.50   Median :117.5  
##                     Mean   :35.50   Mean   :123.3  
##                     3rd Qu.:47.00   3rd Qu.:137.5  
##                     Max.   :58.00   Max.   :167.0

The summary tells us we have 20 subjects with a mean age of \(35.5\) years. Looking at the quartiles, then our minimum and maximum values, the ages appear to be evenly distributed from 20 to 58 years. A quick check of our blood pressure data shows values that are evenly distributed as well. We’ll look at formal diagnostics later, but right now linear regression is looking like a good tool for estimation.

Visualizing Data

The function ggplotly() from the plotly library is a fantastic way to convert ggplot() items into interactive figures. Note that you can simply use the ggplot() code in the brackets for a static plot.

    {
      ggplot(data, aes(name = Person, x = Age, y = BP )) + 
      geom_point() + 
      labs(title = "Blood Pressure by Age, n = 20", 
           y = "Blood Pressure (mmHg)", 
           x = "Age (Years)")
    } %>% 
    ggplotly()

Figure 1: Blood Pressure by Age, n= 20

We certainly have a linear trend. There are slightly more younger subjects than older subjects. Generally speaking, there doesn’t appear to be too much spread (variability) in our data. That said, one of our subjects, Person 3, appears to have a systolic blood pressure substantially lower than expected at for their age; 116 mmHg at the age of 56.

Modeling

We appear to be in good shape for modeling with linear regression. Simple linear regression is presented in point-slope form, with parameters \(\beta_0\) and \(\beta_1\) making up our slope and intercept, respectively.

\[y = \beta_0 + \beta_1x\]

In R, we’ll call lm() for linear modeling. In this case, we want to model blood pressure BP by age.

lm(formula = BP~Age, data) -> m

## 
## Call:
## lm(formula = BP ~ Age, data = data)
## 
## Coefficients:
## (Intercept)          Age  
##      76.633        1.315

The function returns two coefficients, y-intercept (Intercept) and a ‘slope’ for variable, Age. Starting at 76.633 at Age = 0, we can expect a 1.315 mmHg increase in systolic blood pressure with each added year.

Following the regression formula, we’ll set (Intercept) to \(b_0\) and Age to \(b_1\).

# Set coefficients to betas
b0 <- m$coefficients[1] # 76.633
b1 <- m$coefficients[2] #  1.315

\[BP_{est} = b_0 + b_1*Age\\ BP_{est} = 76.663 + 1.315*Age\]

We can now add the regression line to the plot.

  {
    ggplot(data, aes(x = Age, y = BP)) +
    geom_point() + 
    geom_line(aes(y = b0 + b1*Age)) + 
    labs(title = "Estimated Regression Line: Blood Pressure by Age")
  } %>% 
  ggplotly()

Figure 2: Regression Line: Blood Pressure by Age

Hovering along the line, we can find the estimated blood pressure for any age in the range.

Fit and Validity

Note: In this case, (20,103) and (58,153) appear to begin and end the regression line. While it occurs for these data, it’s not common for a regression line to start and end with “true” subject values.

Notice how the regression line appears to sit in the middle of the points for younger subjects, but after x=40 the line tends to fall under the points. While the regression line appears to be a good fit for the younger subjects, our model frequently underestimates the blood pressure of older subjects.

So how good of an estimate is our line? The next post will talk more about residual errors, goodness-of-fit, and diagnostics.