Simulation Studies

Linear and Generalized Linear Models

R Packages

Code
library(tidyverse)
library(Hmisc)
library(rms)
library(gam)
theme_set(theme_bw())

Simulation Studies

  • Simulation Studies

  • Linear Models

  • Generalized Linear Models

  • Nonparameteric Models

Linear Models

  • Simulation Studies

  • Linear Models

  • Generalized Linear Models

  • Nonparameteric Models

Generalized Linear Models

  • Simulation Studies

  • Linear Models

  • Generalized Linear Models

  • Nonparameteric Models

Exponential Family of Distributions

An exponential family of distributions are random variables that allow their probability density function to have the following form:

\[ f(y; \theta,\phi) = a(y,\phi)\exp\left\{\frac{y\theta-\kappa(\theta)}{\phi}\right\} \]

  • \(\theta\): is the canonical parameter (also a function of other parameters)

  • \(\kappa(\theta)\): is a known cumulant function

  • \(\phi>0\): dispersion parameter function

  • \(a(y,\phi)\): normalizing constant

Canonical Parameter

The canonical parameter represents the relationship between the random variable and the \(E(Y)=\mu\)

Normal Distribution

\[ f(y;\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y-\mu)^2}{2\sigma^2}} \]

Binomial Distribution

\[ f(y;n,p)=\big(^n_y\big) p^y(1-p)^{n-y} \]

Poisson Distribution

\[ f(y;\lambda) = \frac{e^{-\lambda}\lambda^y}{y!} \]

Negative Binomial Distribution

\[ f(y;\mu, \theta) = \left(\begin{array}{c} y+\theta+1\\ y \end{array}\right) \left(\frac{\mu}{\mu+\theta}\right)^y \left(\frac{\theta}{\mu+\theta}\right)^\theta \]

  • \(\mu\): average count
  • \(\theta\): dispersion of data
  • \(E(Y)=\mu\)
  • \(Var(Y) = \mu + \mu^2/\theta\)

Gamma Distribution

Common Distributions and Canonical Parameters

Random Variable Canonical Parameter
Normal \(\mu\)
Binomial \(\log\left(\frac{\mu}{1-\mu}\right)\)
Negative Binomial \(\log\left(\frac{\mu}{\mu+k}\right)\)
Poisson \(\log(\mu)\)
Gamma \(-\frac{1}{\mu}\)
Inverse Gaussian \(-\frac{1}{2\mu^2}\)

Generalized Linear Models

A generalized linear model (GLM) is used to model the association between an outcome variable (of any data type) and a set of predictor values. We estimate a set of regression coefficients \(\boldsymbol \beta\) to explain how each predictor is related to the expected value of the outcome.

Generalized Linear Models

A GLM is composed of a systematic and random component.

Random Component

The random component is the random variable that defines the randomness and variation of the outcome variable.

Systematic Component

The systematic component is the linear model that models the association between a set of predictors and the expected value of Y:

\[ g(\mu)=\eta=\boldsymbol X_i^\mathrm T \boldsymbol \beta \]

  • \(\boldsymbol\beta\): regression coefficients

  • \(\boldsymbol X_i=(1, X_{i1}, \ldots, X_{ip})^\mathrm T\): design vector

  • \(\eta\): linear model

  • \(\mu=E(Y)\)

  • \(g(\cdot)\): link function

Simulating Bernoulli

Code
x <- rnorm(1000)
eta <- boot::inv.logit(-0.85 + 1.3 * x)
y <- rbinom(1000, size = 1, prob = eta)

Simulating Poisson RV

Code
x <- rnorm(1000)
eta <- exp(-0.85 + 1.3 * x)
y <- rpois(1000, lambda = eta)

Simulating Negative Binomial RV

Code
x <- rnorm(1000)
eta <- exp(-0.85 + 1.3 * x)
y <- rnbinom(1000, mu = eta, size = 0.5)

Simulating a Gamma RV

Code
x <- rnorm(1000)
y_true <- exp(0.75 + 1.3 * x)
y <- rgamma(1000, rate = 10 / y_true, shape = 10)

Bayesian Regression Model in R

Code
library(brms)
1brm(formula,
2    data,
3    family)
1
Supply a formula for R
2
Supply the data frame
3
Which family and link function is used to model data

Logistic (Binomial) Regression

Logistic Regression is used when your outcome is binary:

Code
brm(y~x, 
    data, 
    family = bernoulli())

Poisson Regression

Poisson Regression is used when the outcome is count data:

Code
brm(y~x, 
    data, 
    family = poisson())

Gamma Regression

Gamma Regression is used when modeling the association between predictors and positive continuous values:

Code
brm(y~x, 
    data, 
    family = Gamma())

Negative Binomial Regression

Negative Binomial Regression is used four with overdispersed count data, where the variance is larger than expected.

Code
brm(y~x, 
    data,
    family = negbinomial())

Nonparameteric Models

  • Simulation Studies

  • Linear Models

  • Generalized Linear Models

  • Nonparameteric Models

Nonparameteric Models

\[ Y = f(X) + \varepsilon \]

Nonlinear Models

\[ \hat Y = \hat f(X) \]

Nonlinear Models

  • Parameteric Model
  • Regression Splines
  • Penalized Splines

Nonlinear Models

Code
x <- rnorm(1000, 2)
y <- 3 + sinpi(x/2) + rnorm(1000, sd = 0.3)
df <- tibble(x, y)
xlm <- df |> lm(y ~ rcs(x, parms = 10), data = _)
df |> ggplot(aes(x, y)) +
  geom_point() +
  geom_line(aes(x, predict(xlm)), col = "red") 

Nonlinear Models

Code
x <- rnorm(1000, -4)
y <- 3 +  cospi(x/2) + rnorm(1000, sd = 0.3)
df <- tibble(x, y)
xlm <- df |> lm(y ~ rcs(x, parms = 10), data = _)
df |> ggplot(aes(x, y)) +
  geom_point() +
  geom_line(aes(x, predict(xlm)), col = "red") 

Generalized Additive Models

Code
x1 <- rnorm(1000, 2)
x2 <- rnorm(1000, -4)
y <- sinpi(x1/2) + cospi(x2/2) + rnorm(1000, sd = 0.5)
df <- tibble(x1, x2, y)

xgam <- gam(y ~ rcs(x1,10) + rcs(x2, 10))
xgam$coefficients

df1 <- tibble(x = x1,
              y = sinpi(x1/2),
              pred = rcs(x1,10) %*% xgam$coefficients[2:10]) 
df2 <- tibble(x = x2,
              y = cospi(x2/2),
              pred = rcs(x2,10) %*% xgam$coefficients[11:19]) 
df1 |> ggplot(aes(x, y)) +
  geom_point() +
  geom_line(aes(x, pred), col = "red") 

df2 |> ggplot(aes(x, y)) +
  geom_point() +
  geom_line(aes(x, pred), col = "red") 

plot(xgam)