Monte Carlo Methods

Statistical Inference

R Packages

library(tidyverse)
theme_set(theme_bw())

Hypothesis Testing

Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis

Hypothesis Tests

Hypothesis tests are used to test whether claims are valid or not. This is conducted by collecting data, setting the Null and Alternative Hypothesis.

Null Hypothesis \(H_0\)

The null hypothesis is the claim that is initially believed to be true. For the most part, it is always equal to the hypothesized value.

Alternative Hypothesis \(H_a\)

The alternative hypothesis contradicts the null hypothesis.

Example of Null and Alternative Hypothesis

We want to see if \(\mu\) is different from \(\mu_0\)

Null Hypothesis	Alternative Hypothesis
\(H_0: \mu=\mu_0\)	\(H_a: \mu\ne\mu_0\)
\(H_0: \mu\le\mu_0\)	\(H_a: \mu>\mu_0\)
\(H_0: \mu\ge\mu_0\)	\(H_0: \mu<\mu_0\)

One-Side vs Two-Side Hypothesis Tests

Notice how there are 3 types of null and alternative hypothesis, The first type of hypothesis (\(H_a:\mu\ne\mu_0\)) is considered a 2-sided hypothesis because the rejection region is located in 2 regions. The remaining two hypotheses are considered 1-sided because the rejection region is located on one side of the distribution.

Null Hypothesis	Alternative Hypothesis	Side
\(H_0: \mu=\mu_0\)	\(H_a: \mu\ne\mu_0\)	2-sided
\(H_0: \mu\le\mu_0\)	\(H_a: \mu>\mu_0\)	1-sided
\(H_0: \mu\ge\mu_0\)	\(H_0: \mu<\mu_0\)	1-sided

t-tests

Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis

t-tests

t-tests are commonly used to determine whether a small sample is different from a hypothesized value.

\[ H_0:\ \mu = \mu_0 \]

Assumptions

Data Comes from a normal distribution (\(N(\mu, \sigma^2)\))
The population variance \(\sigma^2\) is unknown
Sample is small (\(n<30\))
When \(n\geq30\), use a z-test thanks to CLT

T-statistic

\[ T = \frac{\bar x -\mu_0}{s/\sqrt{n}} \sim t(n-1) \]

\(\bar x\): sample mean
\(s\): sample standard deviation
\(n\): sample size

Why t distribution?

\[ t = \frac{N(0,1)}{\sqrt{\chi^2(n-1)/(n-1)}} \sim t(n-1) \] Knowing the distribution will allow us to compute the p-value!

Monte Carlo Simulations

Use Monte Carlo Simulations to show that transforming a data set to the T-Statistic will yield a t-distribution. Use \(H_0: \mu = 3\) and simulate 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code

x <- rnorm(24, 3, 3)

Answer:

Code

t_stat <- function(i){
  x <- rnorm(24, 3, 3)
  tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat)
xe <- seq(-5, 5, length.out = 100)
t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dt(xe, 23)),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Sampling Distribution

Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis

Sample

A sample is a collection of random variables (data) that is believed to be independent and identically distributed (come from the same distribution).

\[ \bX = (X_1, X_2, \cdots, X_n)^{\mrT} \]

Statistic

A statistic is a transformation of data by a function.

\[ T(\bX) \]

Common Statistics

Statistic	Function
Min	\(\mathrm{min}(\bX)\)
Max	\(\mathrm{max}(\bX)\)
Mean	\(\frac{1}{n}X_i\)
Median	\(P(X_0 < \bX) = 0.5\)
SD	\(\frac{1}{n-1}\sum^n_{i=1}(X_i-\bar X)^2\)

Sampling Distributions

A sampling distribution is the distribution of a statistic.

\[ X_i \stackrel{iid}{\sim} N(\mu, \sigma^2) \]

Statistic	Distribution
\(\bar X\)	\(N(\mu, \sigma^2)\)
\((n-1)s^2/\sigma^2\)	\(\chi^2(n-1)\)

Central Limit Theorem

Let \(X_1, X_2, \ldots, X_n\) be identical and independent distributed random variables with \(E(X_i)=\mu\) and \(Var(X_i) = \sigma²\). We define

\[ Y_n = \sqrt n \left(\frac{\bar X-\mu}{\sigma}\right) \mathrm{ where }\ \bar X = \frac{1}{n}\sum^n_{i=1}X_i. \]

Then, the distribution of the function \(Y_n\) converges to a standard normal distribution function as \(n\rightarrow \infty\).

Obtaining Sampling Distributions

Several statistics have distributions that we can conduct inference on such as \(t\), \(\chi^2\), and \(F\).

How do we obtain sampling distributions for other types of statistics?

Distribution Functions
Density Functions
Moment-Generating Functions

Using Monte Carlo Methods

The sample generated by simulating random variables is said to follow a distribution function.
The sample itself are the empirical cumulative density function (ECDF) of the true CDF.
Transforming the simulated random variables of the sample are said to construct the ECDF of the statistic’s sampling distribution.

Sampling Distribution: \(\bar X\)

\[ X_i \stackrel{iid}{\sim} N(5, 2) \]

\[ \bar X \sim N(5, 2/n) \]

MC Sampling Distribution: \(\bar X\)

Code

sim1 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(mean(x))
}
results <- replicate(10000, sim1(1))

xe <- seq(4.8, 5.2, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dnorm(xe, 5, sqrt(2/1000))),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

MC Sampling Distribution: \(\hat s^2\)

Code

sim2 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(var(x))
}
results <- replicate(10000, sim2(1))

xe <- seq(850, 1150, length.out = 100)
tibble(x = 999 * results / 2) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, y = dchisq(xe, 999)),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Maximum

\[ f_{max} (x) = n f(x) F(x)^{n-1} \]

Code

sim3 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(max(x))
}
results <- replicate(10000, sim3(1))

xe <- seq(8, 12, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, 
                          y = 1000 * dnorm(xe, 5, sqrt(2)) * pnorm(xe, 5, sqrt(2))^999),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Minimum

\[ f_{min}(x) = n\{1-F(x)\}^{n-1}f(x) \]

Code

sim4 <- function(i){
  x <- rnorm(1000, mean = 5, sd = sqrt(2))
  return(min(x))
}
results <- replicate(10000, sim4(1))

xe <- seq(-2.5, 2, length.out = 100)
tibble(x = results) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_line(data = tibble(x = xe, 
                          y = 1000 * (1 - pnorm(xe, 5, sqrt(2)))^999 * dnorm(xe, 5, sqrt(2))),
            mapping = aes(x,y),
            col =  "red", lwd = 2)

Monte Carlo Hypothesis Testing

Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis

Monte Carlo Hypothesis Testing

Monte Carlo hypothesis testing is the process of constructing a sampling distribution of the test statistic given a null distribution.

Using the test statistic constructed from the sample, identify its the location compared to the emperical cdf.

Conduct a hypothesis test based on the location with a Monte Carlo p-value.

Monte Carlo Algorithm

Simulate data from the null distribution (\(H_0\))
Construct the test statistic from the simulated data.
Repeat steps 1 and 2 to construct an empirical distribution of the null hypothesis.
Construct the test statistic from the sample data.
Count the number of simulated test statistics that are more extreme than the sample test statistic.
Compute:

\[ p = \frac{m +1}{n + 1} \]

\(m\): number of extreme values
\(n\): number of simulated test statistics

Extreme values

Extreme values are values that satisfy these condition:

Hypothesis	Condition
\(H_a: \mu\ne\mu_0\)	\(\|t_{data}\| \leq \|t_{sim}\|\)
\(H_a: \mu>\mu_0\)	\(t_{data} \leq t_{sim}\)
\(H_0: \mu<\mu_0\)	\(t_{data} \geq t_{sim}\)

Specifying the Null Distribution

The null distribution follows the theorized data distribution with parameters related to the null hypothesis and sample size!

\[ H_0: \mu =\mu_0 \] \[ X\sim N(\mu, \sigma^2) \]

t-test Example

Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 3\) and \(H_a: \mu \neq 3\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code

x <- rnorm(24, 3, 3)

Answer:

Code

tstat <- (mean(x) - 3)/(sd(x)/sqrt(24))


t_stat_sim <- function(i){
  x <- rnorm(24, 3, 3)
  tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

Code

sum(abs(tstat) < abs(t_dist))

#> [1] 69032

Code

(sum(abs(tstat) < abs(t_dist)) + 1) / (length(t_dist) + 1)

#> [1] 0.6903231

t-test Example

Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 6\) and \(H_a: \mu \neq 6\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.

Code

x <- rnorm(24, 3, 3)

Answer:

Code

tstat <- (mean(x) - 6)/(sd(x)/sqrt(24))


t_stat_sim <- function(i){
  x <- rnorm(24, 6, 3)
  tt <- (mean(x) - 6)/(sd(x)/sqrt(24))
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(abs(tstat) < abs(t_dist))
(sum(abs(tstat) < abs(t_dist)) + 1) / (length(t_dist) + 1)

Uniform Example

Using the sample data, test whether \(H_0: \theta = 10\) vs \(H_a: \theta < 10\) for a sample of 35. You believe that the data came from the a \(U(0, \thetat)\).

y <- runif(35, 0, 8)

Answer:

Code

tstat <- max(y)

t_stat_sim <- function(i){
  x <- runif(35, 0, 10)
  tt <- max(x)
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)

Example

The fdeaths data set in R contains the monthly death count from specific lung diseases in the UK from 1974 to 1979. Test whether the monthly average (\(theta\)) death count is fdeaths is greater than 550?

Answer:

Code

tstat <- mean(fdeaths)

t_stat_sim <- function(i){
  x <- rpois(72, 550)
  tt <- mean(x)
  return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)

t_dist |> tibble(x = _) |> 
  ggplot(aes(x, y = ..density..)) +
  geom_histogram() +
  geom_vline(aes(xintercept = tstat), col = "red")

sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)

Power Analysis

Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis

Type 1 and 2 Error

Code

a1 <- data.frame(
   x = c(2.5,3.5),
   y = c(41, 41),
   label = c("Yes", "No")
)

a2 <- data.frame(
   x = c(1.9),
   y = c(25, 35),
   label = c("False", "True")
)
a3 <- tibble(x = c(1.7,3), 
             y = c(30, 43), 
             label = c("H0", "Reject H0"))

yay <- tibble(x = c(2.5, 3.5), 
              y = c(25, 35), 
              label = c("Yay!", "Yay!"))

type1 <- tibble(x = c(2.5, 3.5),
                y = c(36.5, 26.5),
                label = c("Type I Error", "Type II Error"))

type2 <- tibble(x = c(3.5, 2.5),
                y = c(24.5, 34.5),
                label = c("beta", "alpha"))


# basic graph
p <- ggplot() + theme_void()

# Add rectangles
p + annotate("rect", 
             xmin=c(2,3), xmax=c(3,4), 
             ymin=c(20,20), ymax=c(30,30), 
             alpha=0.2, color="green", fill="green") + 
    annotate("rect", 
             xmin=c(2,3), xmax=c(3,4), 
             ymin=c(30,30), ymax=c(40,40), 
             alpha=0.2, color="red", fill="red") +
    annotate("rect",
             xmin=c(3,2), xmax=c(4,3),
             ymin=c(30, 20), ymax=c(40, 30),
             alpha=0.8, color="royalblue", fill="royalblue") +
    geom_text(data=a1, aes(x=x, y=y, label=label),
             size=10 , fontface="bold" ) +
    geom_text(data=a3, aes(x=x, y=y, label=label),
             size=10 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x, y=y, label=label),
             size=10, angle = 90, fontface="bold" ) +
    geom_text(data=yay, aes(x=x, y=y, label=label),
             size=10, fontface="bold.italic" ) +
    geom_text(data=type1, aes(x=x, y=y, label=label),
             size=10, fontface="bold") +
    geom_text(data=type2, aes(x=x, y=y, label=label),
             size=10, fontface="bold", parse = T )

Type 1 and 2 Error

Code

x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
            y <- dnorm(x)
            y[x < qnorm(.94)] <-NA
            return(y)
}
dt_two<-function(x){
            y <- dnorm(x, mean = 3)
            y[x > 1.645] <-NA
            return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))

a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))

df1 |> 
  ggplot(aes(x, y)) +
    stat_function(fun = dt_one, geom = "area", fill = "red") +
    geom_line() +
    geom_line(aes(x, y), df2) +
    stat_function(fun = dt_two, geom = "area", fill = "green") +
    geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
    geom_text(data=a1, aes(x=x,y=y, label=label),
              size=14 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x,y=y, label=label),
              size=10 , fontface="bold.italic", parse = T) +
    theme_bw()

Power

Power is the probability of rejecting \(H_0\) given that \(H_1\).

\[ Power = 1-\beta \]

Power

We want to ensure we have a high power and low type I error rate (\(\alpha\)).

As practitioners, we control \(\alpha\). Set it before we conduct a study. Usually set at 0.05.

We cannot control power because it is dependent by effect size, \(\alpha\), and sample size.

Power Relationships

\(\alpha \uparrow\) \(\rightarrow\) \(Power\ \uparrow\)
\(n \uparrow\) \(\rightarrow\) \(Power\ \uparrow\)

Power Analysis in R

The pwr package in R will conduct power and sample size for several common statistical tests. For more information, you can check this vignette and their github.

Conceptial

Code

x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
            y <- dnorm(x)
            y[x < qnorm(.94)] <-NA
            return(y)
}
dt_two<-function(x){
            y <- dnorm(x, mean = 3)
            y[x > 1.645] <-NA
            return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))

a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))

df1 |> 
  ggplot(aes(x, y)) +
    stat_function(fun = dt_one, geom = "area", fill = "red") +
    geom_line() +
    geom_line(aes(x, y), df2) +
    stat_function(fun = dt_two, geom = "area", fill = "green") +
    geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
    geom_text(data=a1, aes(x=x,y=y, label=label),
              size=14 , fontface="bold.italic" ) +
    geom_text(data=a2, aes(x=x,y=y, label=label),
              size=10 , fontface="bold.italic", parse = T) +
    theme_bw()