Lecture Goals

  • Learn how to
    • Model nonlinear relationships
    • Perform nonparametric regression w/ basis functions
    • Use nonlinear models for prediction
  • Readings

Nonlinear Relationships

  • Model response as \(Y = f(X) + \epsilon\), where \(f(\cdot)\) is nonlinear function

Two ways to estimate \(f(\cdot)\)

  • Parametric: assume form of \(f(\cdot)\) is known (e.g. exponential, quadratic), and tune fit with parameters
    • E.g. polynomial/logarithmic regression (transformations)
  • Nonparametric: minimal assumptions, more flexible forms for \(f(\cdot)\)

Parametric Model

  • Gapminder data, logarithmic regression \[Y \sim \alpha + \beta \log(X) \]
gap07 %>% mutate( log_GDP = log(gdpPercap) ) %>% 
  lm( lifeExp ~ log_GDP, data = .) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     4.95     3.86       1.28 2.02e- 1
## 2 log_GDP         7.20     0.442     16.3  4.12e-34

Parametric Model

Nonparametric Regression

  • Fit flexible model that can take different forms
gap07 %>% ggplot( aes(y=lifeExp, x=gdpPercap) ) + 
  geom_point() + geom_smooth(method = "gam", formula = y ~ s(x))

Basis Functions

  • Certain functions can be expressed as compositions of simpler basis functions
    • E.g. polynomial is combination of powers of \(x\) \[f(x) = \beta_0 + \beta_1 x + \ldots \beta_q x^q\]
  • More generally, for set of basis functions \(\{h_i(\cdot)\}\) \[f(x) = \beta_0 + \beta_1 h_1(x) + \ldots + \beta_m h_m(x)\]

  • Fit similarly to multiple regression on basis functions
    • Number/choice of basis functions can depend on data

Example

  • Cubic regression
gap07 %>% 
  mutate( X1 = gdpPercap, X2 = X1^2, X3 = X1^3) %>% 
  lm( lifeExp ~ X1 + X2 +X3, data = .) %>% tidy()
## # A tibble: 4 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  5.30e+ 1  1.31e+ 0     40.4  2.33e-78
## 2 X1           2.73e- 3  3.55e- 4      7.69 2.47e-12
## 3 X2          -9.27e- 8  1.99e- 8     -4.65 7.76e- 6
## 4 X3           1.01e-12  3.00e-13      3.37 9.83e- 4

Example

Smooth Nonparametric Regression

  • Smooth nonlinear functions can be represented by local polynomial bases called splines

Generalised Additive Model

  • Generalised additive model (GAM) describes \(Y\) as sum of (smooth) functions of \(X\)'s
    \[Y \sim s_1(X_1) + s_2(X_2) + \ldots\]

  • gam() function in mgcv package
    • formula similar to lm()
    • use s() for smooth function (spline-based)
    • include linear or categorial terms
library(mgcv)
gam_out = gam( lifeExp ~ s(gdpPercap) + continent, data = gap07)

Example

gap07 %>% mutate( pred = predict( gam_out, gap07 ) ) %>% 
  ggplot( aes(y=lifeExp, x=gdpPercap, col = continent) ) + 
  geom_point() + geom_line( aes(y = pred) )

Regression Trees

  • Regression tree models \(Y\) as (multivariate) step function of \(X\)'s
    • Non-smooth, nonparametric model

Example

  • Trees can seamlessly accomodate categorical \(X\)'s
library(rpart)
tree = rpart( lifeExp ~ gdpPercap + continent, data = gap07)

Example

library(rpart.plot); rpart.plot(tree)

Overfitting

  • Nonparametric models are prone to overfitting
    • Too much flexibility can harm out-of-sample performance
tree = rpart( lifeExp ~ gdpPercap, data = gap07, 
       control = rpart.control(cp = 0, minbucket=3))

Predictions

  • Model predictions generated with predict()
    • Works with any nonparametric model
  • In-sample performance tends to too optimistic
    • Under-estimate standard deviation of prediction errors
  • Use test set to estimate out-of-sample performance