Statistical Inference
Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis
Hypothesis tests are used to test whether claims are valid or not. This is conducted by collecting data, setting the Null and Alternative Hypothesis.
The null hypothesis is the claim that is initially believed to be true. For the most part, it is always equal to the hypothesized value.
The alternative hypothesis contradicts the null hypothesis.
We want to see if \(\mu\) is different from \(\mu_0\)
Null Hypothesis | Alternative Hypothesis |
---|---|
\(H_0: \mu=\mu_0\) | \(H_a: \mu\ne\mu_0\) |
\(H_0: \mu\le\mu_0\) | \(H_a: \mu>\mu_0\) |
\(H_0: \mu\ge\mu_0\) | \(H_0: \mu<\mu_0\) |
Notice how there are 3 types of null and alternative hypothesis, The first type of hypothesis (\(H_a:\mu\ne\mu_0\)) is considered a 2-sided hypothesis because the rejection region is located in 2 regions. The remaining two hypotheses are considered 1-sided because the rejection region is located on one side of the distribution.
Null Hypothesis | Alternative Hypothesis | Side |
---|---|---|
\(H_0: \mu=\mu_0\) | \(H_a: \mu\ne\mu_0\) | 2-sided |
\(H_0: \mu\le\mu_0\) | \(H_a: \mu>\mu_0\) | 1-sided |
\(H_0: \mu\ge\mu_0\) | \(H_0: \mu<\mu_0\) | 1-sided |
Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis
t-tests are commonly used to determine whether a small sample is different from a hypothesized value.
\[ H_0:\ \mu = \mu_0 \]
\[ T = \frac{\bar x -\mu_0}{s/\sqrt{n}} \sim t(n-1) \]
\[ t = \frac{N(0,1)}{\sqrt{\chi^2(n-1)/(n-1)}} \sim t(n-1) \] Knowing the distribution will allow us to compute the p-value!
Use Monte Carlo Simulations to show that transforming a data set to the T-Statistic will yield a t-distribution. Use \(H_0: \mu = 3\) and simulate 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.
Answer:
t_stat <- function(i){
x <- rnorm(24, 3, 3)
tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
return(tt)
}
t_dist <- sapply(1:100000, t_stat)
xe <- seq(-5, 5, length.out = 100)
t_dist |> tibble(x = _) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_line(data = tibble(x = xe, y = dt(xe, 23)),
mapping = aes(x,y),
col = "red", lwd = 2)
Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis
A sample is a collection of random variables (data) that is believed to be independent and identically distributed (come from the same distribution).
\[ \bX = (X_1, X_2, \cdots, X_n)^{\mrT} \]
A statistic is a transformation of data by a function.
\[ T(\bX) \]
Statistic | Function |
---|---|
Min | \(\mathrm{min}(\bX)\) |
Max | \(\mathrm{max}(\bX)\) |
Mean | \(\frac{1}{n}X_i\) |
Median | \(P(X_0 < \bX) = 0.5\) |
SD | \(\frac{1}{n-1}\sum^n_{i=1}(X_i-\bar X)^2\) |
A sampling distribution is the distribution of a statistic.
\[ X_i \stackrel{iid}{\sim} N(\mu, \sigma^2) \]
Statistic | Distribution |
---|---|
\(\bar X\) | \(N(\mu, \sigma^2)\) |
\((n-1)s^2/\sigma^2\) | \(\chi^2(n-1)\) |
Let \(X_1, X_2, \ldots, X_n\) be identical and independent distributed random variables with \(E(X_i)=\mu\) and \(Var(X_i) = \sigma²\). We define
\[ Y_n = \sqrt n \left(\frac{\bar X-\mu}{\sigma}\right) \mathrm{ where }\ \bar X = \frac{1}{n}\sum^n_{i=1}X_i. \]
Then, the distribution of the function \(Y_n\) converges to a standard normal distribution function as \(n\rightarrow \infty\).
Several statistics have distributions that we can conduct inference on such as \(t\), \(\chi^2\), and \(F\).
How do we obtain sampling distributions for other types of statistics?
\[ X_i \stackrel{iid}{\sim} N(5, 2) \]
\[ \bar X \sim N(5, 2/n) \]
sim1 <- function(i){
x <- rnorm(1000, mean = 5, sd = sqrt(2))
return(mean(x))
}
results <- replicate(10000, sim1(1))
xe <- seq(4.8, 5.2, length.out = 100)
tibble(x = results) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_line(data = tibble(x = xe, y = dnorm(xe, 5, sqrt(2/1000))),
mapping = aes(x,y),
col = "red", lwd = 2)
sim2 <- function(i){
x <- rnorm(1000, mean = 5, sd = sqrt(2))
return(var(x))
}
results <- replicate(10000, sim2(1))
xe <- seq(850, 1150, length.out = 100)
tibble(x = 999 * results / 2) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_line(data = tibble(x = xe, y = dchisq(xe, 999)),
mapping = aes(x,y),
col = "red", lwd = 2)
\[ f_{max} (x) = n f(x) F(x)^{n-1} \]
sim3 <- function(i){
x <- rnorm(1000, mean = 5, sd = sqrt(2))
return(max(x))
}
results <- replicate(10000, sim3(1))
xe <- seq(8, 12, length.out = 100)
tibble(x = results) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_line(data = tibble(x = xe,
y = 1000 * dnorm(xe, 5, sqrt(2)) * pnorm(xe, 5, sqrt(2))^999),
mapping = aes(x,y),
col = "red", lwd = 2)
\[ f_{min}(x) = n\{1-F(x)\}^{n-1}f(x) \]
sim4 <- function(i){
x <- rnorm(1000, mean = 5, sd = sqrt(2))
return(min(x))
}
results <- replicate(10000, sim4(1))
xe <- seq(-2.5, 2, length.out = 100)
tibble(x = results) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_line(data = tibble(x = xe,
y = 1000 * (1 - pnorm(xe, 5, sqrt(2)))^999 * dnorm(xe, 5, sqrt(2))),
mapping = aes(x,y),
col = "red", lwd = 2)
Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis
Monte Carlo hypothesis testing is the process of constructing a sampling distribution of the test statistic given a null distribution.
Using the test statistic constructed from the sample, identify its the location compared to the emperical cdf.
Conduct a hypothesis test based on the location with a Monte Carlo p-value.
Simulate data from the null distribution (\(H_0\))
Construct the test statistic from the simulated data.
Repeat steps 1 and 2 to construct an empirical distribution of the null hypothesis.
Construct the test statistic from the sample data.
Count the number of simulated test statistics that are more extreme than the sample test statistic.
Compute:
\[ p = \frac{m +1}{n + 1} \]
Extreme values are values that satisfy these condition:
Hypothesis | Condition |
---|---|
\(H_a: \mu\ne\mu_0\) | \(|t_{data}| \leq |t_{sim}|\) |
\(H_a: \mu>\mu_0\) | \(t_{data} \leq t_{sim}\) |
\(H_0: \mu<\mu_0\) | \(t_{data} \geq t_{sim}\) |
The null distribution follows the theorized data distribution with parameters related to the null hypothesis and sample size!
\[ H_0: \mu =\mu_0 \] \[ X\sim N(\mu, \sigma^2) \]
Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 3\) and \(H_a: \mu \neq 3\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.
Answer:
tstat <- (mean(x) - 3)/(sd(x)/sqrt(24))
t_stat_sim <- function(i){
x <- rnorm(24, 3, 3)
tt <- (mean(x) - 3)/(sd(x)/sqrt(24))
return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)
t_dist |> tibble(x = _) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_vline(aes(xintercept = tstat), col = "red")
#> [1] 69032
#> [1] 0.6903231
Use a Monte Carlo Hypothesis Test for \(H_0: \mu = 6\) and \(H_a: \mu \neq 6\). Simulate fake data from 24 RV’s from \(N(3, 9)\). Testing how the distribution will look like assuming that the null hypothesis is true.
Answer:
tstat <- (mean(x) - 6)/(sd(x)/sqrt(24))
t_stat_sim <- function(i){
x <- rnorm(24, 6, 3)
tt <- (mean(x) - 6)/(sd(x)/sqrt(24))
return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)
t_dist |> tibble(x = _) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_vline(aes(xintercept = tstat), col = "red")
sum(abs(tstat) < abs(t_dist))
(sum(abs(tstat) < abs(t_dist)) + 1) / (length(t_dist) + 1)
Using the sample data, test whether \(H_0: \theta = 10\) vs \(H_a: \theta < 10\) for a sample of 35. You believe that the data came from the a \(U(0, \thetat)\).
Answer:
tstat <- max(y)
t_stat_sim <- function(i){
x <- runif(35, 0, 10)
tt <- max(x)
return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)
t_dist |> tibble(x = _) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_vline(aes(xintercept = tstat), col = "red")
sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)
The fdeaths
data set in R contains the monthly death count from specific lung diseases in the UK from 1974 to 1979. Test whether the monthly average (\(theta\)) death count is fdeaths is greater than 550?
Answer:
tstat <- mean(fdeaths)
t_stat_sim <- function(i){
x <- rpois(72, 550)
tt <- mean(x)
return(tt)
}
t_dist <- sapply(1:100000, t_stat_sim)
t_dist |> tibble(x = _) |>
ggplot(aes(x, y = ..density..)) +
geom_histogram() +
geom_vline(aes(xintercept = tstat), col = "red")
sum(tstat > t_dist)
(sum(tstat > t_dist) + 1) / (length(t_dist) + 1)
Hypothesis Testing
t-tests
Sampling Distribution
Monte Carlo Hypothesis Testing
Power Analysis
a1 <- data.frame(
x = c(2.5,3.5),
y = c(41, 41),
label = c("Yes", "No")
)
a2 <- data.frame(
x = c(1.9),
y = c(25, 35),
label = c("False", "True")
)
a3 <- tibble(x = c(1.7,3),
y = c(30, 43),
label = c("H0", "Reject H0"))
yay <- tibble(x = c(2.5, 3.5),
y = c(25, 35),
label = c("Yay!", "Yay!"))
type1 <- tibble(x = c(2.5, 3.5),
y = c(36.5, 26.5),
label = c("Type I Error", "Type II Error"))
type2 <- tibble(x = c(3.5, 2.5),
y = c(24.5, 34.5),
label = c("beta", "alpha"))
# basic graph
p <- ggplot() + theme_void()
# Add rectangles
p + annotate("rect",
xmin=c(2,3), xmax=c(3,4),
ymin=c(20,20), ymax=c(30,30),
alpha=0.2, color="green", fill="green") +
annotate("rect",
xmin=c(2,3), xmax=c(3,4),
ymin=c(30,30), ymax=c(40,40),
alpha=0.2, color="red", fill="red") +
annotate("rect",
xmin=c(3,2), xmax=c(4,3),
ymin=c(30, 20), ymax=c(40, 30),
alpha=0.8, color="royalblue", fill="royalblue") +
geom_text(data=a1, aes(x=x, y=y, label=label),
size=10 , fontface="bold" ) +
geom_text(data=a3, aes(x=x, y=y, label=label),
size=10 , fontface="bold.italic" ) +
geom_text(data=a2, aes(x=x, y=y, label=label),
size=10, angle = 90, fontface="bold" ) +
geom_text(data=yay, aes(x=x, y=y, label=label),
size=10, fontface="bold.italic" ) +
geom_text(data=type1, aes(x=x, y=y, label=label),
size=10, fontface="bold") +
geom_text(data=type2, aes(x=x, y=y, label=label),
size=10, fontface="bold", parse = T )
x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
y <- dnorm(x)
y[x < qnorm(.94)] <-NA
return(y)
}
dt_two<-function(x){
y <- dnorm(x, mean = 3)
y[x > 1.645] <-NA
return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))
a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))
df1 |>
ggplot(aes(x, y)) +
stat_function(fun = dt_one, geom = "area", fill = "red") +
geom_line() +
geom_line(aes(x, y), df2) +
stat_function(fun = dt_two, geom = "area", fill = "green") +
geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
geom_text(data=a1, aes(x=x,y=y, label=label),
size=14 , fontface="bold.italic" ) +
geom_text(data=a2, aes(x=x,y=y, label=label),
size=10 , fontface="bold.italic", parse = T) +
theme_bw()
Power is the probability of rejecting \(H_0\) given that \(H_1\).
\[ Power = 1-\beta \]
We want to ensure we have a high power and low type I error rate (\(\alpha\)).
As practitioners, we control \(\alpha\). Set it before we conduct a study. Usually set at 0.05.
We cannot control power because it is dependent by effect size, \(\alpha\), and sample size.
The pwr
package in R will conduct power and sample size for several common statistical tests. For more information, you can check this vignette and their github.
x <- seq(-4, 4, length.out = 1000)
xx <- seq(-1, 7, length.out = 1000)
dt_one<-function(x){
y <- dnorm(x)
y[x < qnorm(.94)] <-NA
return(y)
}
dt_two<-function(x){
y <- dnorm(x, mean = 3)
y[x > 1.645] <-NA
return(y)
}
df1 <- tibble(x = x, y = dnorm(x))
df2 <- tibble(x = xx, y = dnorm(xx, mean = 3))
a1 <- tibble(x = c(0,3), y = c(0.43, 0.43), label = c("H0", "H1"))
a2 <- tibble(x = c(2.05,1.25), y = c(0.02, 0.02), label = c(paste("alpha"), "beta"))
df1 |>
ggplot(aes(x, y)) +
stat_function(fun = dt_one, geom = "area", fill = "red") +
geom_line() +
geom_line(aes(x, y), df2) +
stat_function(fun = dt_two, geom = "area", fill = "green") +
geom_vline(aes(xintercept = 1.645), lwd = 1.5) +
geom_text(data=a1, aes(x=x,y=y, label=label),
size=14 , fontface="bold.italic" ) +
geom_text(data=a2, aes(x=x,y=y, label=label),
size=10 , fontface="bold.italic", parse = T) +
theme_bw()