Monte Carlo Methods

Statistical Inference

R Packages

library(tidyverse)
theme_set(theme_bw())

Hypothesis Testing

  • Hypothesis Testing

  • t-tests

  • Sampling Distribution

  • Monte Carlo Hypothesis Testing

  • Power Analysis

Hypothesis Tests

Hypothesis tests are used to test whether claims are valid or not. This is conducted by collecting data, setting the Null and Alternative Hypothesis.

Null Hypothesis \(H_0\)

The null hypothesis is the claim that is initially believed to be true. For the most part, it is always equal to the hypothesized value.

Alternative Hypothesis \(H_a\)

The alternative hypothesis contradicts the null hypothesis.

Example of Null and Alternative Hypothesis

We want to see if \(\mu\) is different from \(\mu_0\)

Null Hypothesis Alternative Hypothesis
\(H_0: \mu=\mu_0\) \(H_a: \mu\ne\mu_0\)
\(H_0: \mu\le\mu_0\) \(H_a: \mu>\mu_0\)
\(H_0: \mu\ge\mu_0\) \(H_0: \mu<\mu_0\)

One-Side vs Two-Side Hypothesis Tests

Notice how there are 3 types of null and alternative hypothesis, The first type of hypothesis (\(H_a:\mu\ne\mu_0\)) is considered a 2-sided hypothesis because the rejection region is located in 2 regions. The remaining two hypotheses are considered 1-sided because the rejection region is located on one side of the distribution.

Null Hypothesis Alternative Hypothesis Side
\(H_0: \mu=\mu_0\) \(H_a: \mu\ne\mu_0\) 2-sided
\(H_0: \mu\le\mu_0\) \(H_a: \mu>\mu_0\) 1-sided
\(H_0: \mu\ge\mu_0\) \(H_0: \mu<\mu_0\) 1-sided

t-tests

  • Hypothesis Testing

  • t-tests

  • Sampling Distribution

  • Monte Carlo Hypothesis Testing

  • Power Analysis

t-tests

t-tests are commonly used to determine whether a small sample is different from a hypothesized value.

\[ H_0:\ \mu = \mu_0 \]

Assumptions

  • Data Comes from a normal distribution (\(N(\mu, \sigma^2)\))
  • The population variance \(\sigma^2\) is unknown
  • Sample is small (\(n<30\))
  • When \(n\geq30\), use a z-test thanks to CLT

T-statistic

\[ T = \frac{\bar x -\mu_0}{s/\sqrt{n}} \sim t(n-1) \]

  • \(\bar x\): sample mean
  • \(s\): sample standard deviation
  • \(n\): sample size

Why t distribution?

\[ t = \frac{N(0,1)}{\sqrt{\chi^2(n-1)/(n-1)}} \sim t(n-1) \] Knowing the distribution will allow us to compute the p-value!

Monte Carlo Simulations

Use Monte Carlo Simulations to show that transforming a data set to the T-Statistic will yield a t-distribution. Use \(H_0: \mu = 3\) and simulate 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code
x <- rnorm(24, 3, 3)

Answer:

Code
t_stat <- function(i){
  x <- rnorm(24, 3, 3)
  tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat)
xe <- seq(-5, 5, length.out = 100)
t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dt(xe, 23)),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Sampling Distribution

  • Hypothesis Testing

  • t-tests

  • Sampling Distribution

  • Monte Carlo Hypothesis Testing

  • Power Analysis

Sample

A sample is a collection of random variables (data) that is believed to be independent and identically distributed (come from the same distribution).

\[ \bX = (X_1, X_2, \cdots, X_n)^{\mrT} \]

Statistic

A statistic is a transformation of data by a function.

\[ T(\bX) \]

Common Statistics

Statistic Function
Min \(\mathrm{min}(\bX)\)
Max \(\mathrm{max}(\bX)\)
Mean \(\frac{1}{n}X_i\)
Median \(P(X_0 < \bX) = 0.5\)
SD \(\frac{1}{n-1}\sum^n_{i=1}(X_i-\bar X)^2\)

Sampling Distributions

A sampling distribution is the distribution of a statistic.

\[ X_i \stackrel{iid}{\sim} N(\mu, \sigma^2) \]

Statistic Distribution
\(\bar X\) \(N(\mu, \sigma^2)\)
\((n-1)s^2/\sigma^2\) \(\chi^2(n-1)\)

Central Limit Theorem

Let \(X_1, X_2, \ldots, X_n\) be identical and independent distributed random variables with \(E(X_i)=\mu\) and \(Var(X_i) = \sigma²\). We define

\[ Y_n = \sqrt n \left(\frac{\bar X-\mu}{\sigma}\right) \mathrm{ where }\ \bar X = \frac{1}{n}\sum^n_{i=1}X_i. \]

Then, the distribution of the function \(Y_n\) converges to a standard normal distribution function as \(n\rightarrow \infty\).

Obtaining Sampling Distributions

Several statistics have distributions that we can conduct inference on such as \(t\), \(\chi^2\), and \(F\).

How do we obtain sampling distributions for other types of statistics?

  • Distribution Functions
  • Density Functions
  • Moment-Generating Functions

Using Monte Carlo Methods

  • The sample generated by simulating random variables is said to follow a distribution function.
  • The sample itself are the empirical cumulative density function (ECDF) of the true CDF.
  • Transforming the simulated random variables of the sample are said to construct the ECDF of the statistic’s sampling distribution.

Sampling Distribution: \(\bar X\)

\[ X_i \stackrel{iid}{\sim} N(5, 2) \]

\[ \bar X \sim N(5, 2/n) \]

MC Sampling Distribution: \(\bar X\)

Code
sim1 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(mean(x))
}
results <- replicate(10000, sim1(1))

xe <- seq(4.8, 5.2, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dnorm(xe, 5, sqrt(2/1000))),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

MC Sampling Distribution: \(\hat s^2\)

Code
sim2 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(var(x))
}
results <- replicate(10000, sim2(1))

xe <- seq(850, 1150, length.out = 100)
tibble(x = 999 * results / 2) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dchisq(xe, 999)),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Maximum

\[ f_{max} (x) = n f(x) F(x)^{n-1} \]

Code
sim3 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(max(x))
}
results <- replicate(10000, sim3(1))

xe <- seq(8, 12, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, 
                          y = 1000 * dnorm(xe, 5, sqrt(2)) * pnorm(xe, 5, sqrt(2))^999),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Minimum

\[ f_{min}(x) = n\{1-F(x)\}^{n-1}f(x) \]

Code
sim4 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(min(x))
}
results <- replicate(10000, sim4(1))

xe <- seq(-2.5, 2, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, 
                          y = 1000 * (1 - pnorm(xe, 5, sqrt(2)))^999 * dnorm(xe, 5, sqrt(2))),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Monte Carlo Hypothesis Testing

  • Hypothesis Testing

  • t-tests

  • Sampling Distribution

  • Monte Carlo Hypothesis Testing

  • Power Analysis

Monte Carlo Hypothesis Testing

Monte Carlo hypothesis testing is the process of constructing a sampling distribution of the test statistic given a null distribution.

Using the test statistic constructed from the sample, identify its the location compared to the emperical cdf.

Conduct a hypothesis test based on the location with a Monte Carlo p-value.

Monte Carlo Algorithm

  1. Simulate data from the null distribution (\(H_0\))

  2. Construct the test statistic from the simulated data.

  3. Repeat steps 1 and 2 to construct an empirical distribution of the null hypothesis.

  4. Construct the test statistic from the sample data.

  5. Count the number of simulated test statistics that are more extreme than the sample test statistic.

  6. Compute:

\[ p = \frac{m +1}{n + 1} \]

  • \(m\): number of extreme values
  • \(n\): number of simulated test statistics

Extreme values

Extreme values are values that satisfy these condition:

Hypothesis Condition
\(H_a: \mu\ne\mu_0\) \(|t_{data}| \leq |t_{sim}|\)
\(H_a: \mu>\mu_0\) \(t_{data} \leq t_{sim}\)
\(H_0: \mu<\mu_0\) \(t_{data} \geq t_{sim}\)

Specifying the Null Distribution

The null distribution follows the theorized data distribution with parameters related to the null hypothesis and sample size!

\[ H_0: \mu =\mu_0 \] \[ X\sim N(\mu, \sigma^2) \]

t-test Example

Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 3\) and \(H_a: \mu \neq 3\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code
x <- rnorm(24, 3, 3)

Answer:

Code
tstat <- (mean(x) - 3)/(sd(x)/sqrt(24))


t_stat_sim <- function(i){
  x <- rnorm(24, 3, 3)
  tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")
Code
sum(abs(tstat) < abs(t_dist))
#> [1] 69032
Code
(sum(abs(tstat) < abs(t_dist)) + 1) / (length(t_dist) + 1)
#> [1] 0.6903231

t-test Example

Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 6\) and \(H_a: \mu \neq 6\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code
x <- rnorm(24, 3, 3)

Answer:

Code
tstat <- (mean(x) - 6)/(sd(x)/sqrt(24))


t_stat_sim <- function(i){
  x <- rnorm(24, 6, 3)
  tt <- (mean(x) - 6)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(abs(tstat) < abs(t_dist))
(sum(abs(tstat) < abs(t_dist)) + 1) / (length(t_dist) + 1)

Uniform Example

Using the sample data, test whether \(H_0: \theta = 10\) vs \(H_a: \theta < 10\) for a sample of 35. You believe that the data came from the a \(U(0, \thetat)\).

y <- runif(35, 0, 8)

Answer:

Code
tstat <- max(y)

t_stat_sim <- function(i){
  x <- runif(35, 0, 10)
  tt <- max(x)
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)

Example

The fdeaths data set in R contains the monthly death count from specific lung diseases in the UK from 1974 to 1979. Test whether the monthly average (\(theta\)) death count is fdeaths is greater than 550?

Answer:

Code
tstat <- mean(fdeaths)

t_stat_sim <- function(i){
  x <- rpois(72, 550)
  tt <- mean(x)
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)

Power Analysis

  • Hypothesis Testing

  • t-tests

  • Sampling Distribution

  • Monte Carlo Hypothesis Testing

  • Power Analysis

Type 1 and 2 Error

Code
a1 <- data.frame(
   x = c(2.5,3.5),
   y = c(41, 41),
   label = c("Yes", "No")
)

a2 <- data.frame(
   x = c(1.9),
   y = c(25, 35),
   label = c("False", "True")
)
a3 <- tibble(x = c(1.7,3), 
             y = c(30, 43), 
             label = c("H0", "Reject H0"))

yay <- tibble(x = c(2.5, 3.5), 
              y = c(25, 35), 
              label = c("Yay!", "Yay!"))

type1 <- tibble(x = c(2.5, 3.5),
                y = c(36.5, 26.5),
                label = c("Type I Error", "Type II Error"))

type2 <- tibble(x = c(3.5, 2.5),
                y = c(24.5, 34.5),
                label = c("beta", "alpha"))


# basic graph
p <- ggplot() + theme_void()

# Add rectangles
p + annotate("rect", 
             xmin=c(2,3), xmax=c(3,4), 
             ymin=c(20,20), ymax=c(30,30), 
             alpha=0.2, color="green", fill="green") + 
    annotate("rect", 
             xmin=c(2,3), xmax=c(3,4), 
             ymin=c(30,30), ymax=c(40,40), 
             alpha=0.2, color="red", fill="red") +
    annotate("rect",
             xmin=c(3,2), xmax=c(4,3),
             ymin=c(30, 20), ymax=c(40, 30),
             alpha=0.8, color="royalblue", fill="royalblue") +
    geom_text(data=a1, aes(x=x, y=y, label=label),
             size=10 , fontface="bold" ) +
    geom_text(data=a3, aes(x=x, y=y, label=label),
             size=10 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x, y=y, label=label),
             size=10, angle = 90, fontface="bold" ) +
    geom_text(data=yay, aes(x=x, y=y, label=label),
             size=10, fontface="bold.italic" ) +
    geom_text(data=type1, aes(x=x, y=y, label=label),
             size=10, fontface="bold") +
    geom_text(data=type2, aes(x=x, y=y, label=label),
             size=10, fontface="bold", parse = T ) 

Type 1 and 2 Error

Code
x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
            y <- dnorm(x)
            y[x < qnorm(.94)] <-NA
            return(y)
}
dt_two<-function(x){
            y <- dnorm(x, mean = 3)
            y[x > 1.645] <-NA
            return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))

a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))

df1 |> 
  ggplot(aes(x, y)) +
    stat_function(fun = dt_one, geom = "area", fill = "red") +
    geom_line() +
    geom_line(aes(x, y), df2) +
    stat_function(fun = dt_two, geom = "area", fill = "green") +
    geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
    geom_text(data=a1, aes(x=x,y=y, label=label),
              size=14 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x,y=y, label=label),
              size=10 , fontface="bold.italic", parse = T) +
    theme_bw()

Power

Power is the probability of rejecting \(H_0\) given that \(H_1\).

\[ Power = 1-\beta \]

Power

We want to ensure we have a high power and low type I error rate (\(\alpha\)).

As practitioners, we control \(\alpha\). Set it before we conduct a study. Usually set at 0.05.

We cannot control power because it is dependent by effect size, \(\alpha\), and sample size.

Power Relationships

  • \(\alpha \uparrow\) \(\rightarrow\) \(Power\ \uparrow\)
  • \(n \uparrow\) \(\rightarrow\) \(Power\ \uparrow\)

Power Analysis in R

The pwr package in R will conduct power and sample size for several common statistical tests. For more information, you can check this vignette and their github.

Conceptial

Code
x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
            y <- dnorm(x)
            y[x < qnorm(.94)] <-NA
            return(y)
}
dt_two<-function(x){
            y <- dnorm(x, mean = 3)
            y[x > 1.645] <-NA
            return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))

a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))

df1 |> 
  ggplot(aes(x, y)) +
    stat_function(fun = dt_one, geom = "area", fill = "red") +
    geom_line() +
    geom_line(aes(x, y), df2) +
    stat_function(fun = dt_two, geom = "area", fill = "green") +
    geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
    geom_text(data=a1, aes(x=x,y=y, label=label),
              size=14 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x,y=y, label=label),
              size=10 , fontface="bold.italic", parse = T) +
    theme_bw()