Zero Models
Zero-Inflated Models
Hurdle Models
Cox Proportional Hazard Models
Zero-inflated models is a mixture model where random variables are generated from 2 or more distributions.
On distribution that states the value came from 2 possible distributions. All other values could have come from a common distribution.
\[ f(x) = \pi g(x) + (1-\pi)h(x) \]
A zero-inflated distribution is a mixture model, where the observed value of 0 can com from 2 distributions.
This forces the data to produce more 0’s than from a standard distribution.
\[ P(x=0) = \left\{\begin{array}{cc} \pi + (1-\pi)g(x) & X=0 \\ (1-\pi)g(x) & otherwise \end{array} \right. \]
When modeling a ZI-model, we need to determine where the 0 came from.
This is modeled using a Binary (logistic) GLM.
If you believe that the data comes from a Poisson distribution, but there is an abnormal amount of 0’s, It may indicate that 0 may come from 2 sources of distributions.
Hence the Zero-inflated Poison model may be appropriate.
\[ g(\pi) = -1.85 + 1.1 X_1 \]
\[ h(\lambda) = 4-2X_2 \]
If you believe that the data comes from a negative binomial distribution, but there is an abnormal amount of 0’s, It may indicate that 0 may come from 2 sources of distributions.
Hence the Zero-inflated Poison model may be appropriate.
\[ g(\pi) = -1.85 + 1.1 X_1 \]
\[ h(\mu) = -0.85-1.3X_2 \]
Zero-Inflated Models
Hurdle Models
Cox Proportional Hazard Models
Hurdle models are used to model data where the value 0 is believed to come from one distribution, and all the other values come from a separate distribution.
It can be thought of as data points had to overcome a hurdle to escape 0 to obtain a different value.
\[ P(x=0) = \left\{\begin{array}{cc} \pi & X=0 \\ (1-\pi)g(x) & otherwise \end{array} \right. \]
\[ \begin{array}{c} \pi \\ (1-\pi)g(x) \end{array} \]
\[ \begin{array}{cc} \pi + (1-\pi)g(x) & X=0 \\ (1-\pi)g(x) & otherwise \end{array} \]
When modeling a ZI-model, we need to determine where the 0 came from.
This is modeled using a Binary (logistic) GLM.
Both Negative Binomial and Poisson Models can generate a 0 value.
Therefore, these distributions are truncated at 1 to be utilized in a hurdle model.
\[ g(\pi) = 1.1 + 3.1 X \]
\[ h(\lambda) = 0.75+1.3X \]
\[ g(\pi) = 1.1 + 3.1 X \]
\[ h(\lambda) = 0.75+1.3X \]
Zero-Inflated Models
Hurdle Models
Cox Proportional Hazard Models
Cox proportional hazard model was used to model the association between a set of predictors and a time-to-event, in the presence of an independent censoring mechanism.
Data is typically recorded as time-to-event data
For biomedical studies, researchers are interested in time from diagnosis to death, known as time-to-death
Censoring is a mechanism where we do not observe the true time-to-event
Not all the time is observed
Three common types of censoring mechanisms: Right, Left, and Interval
dat <- data.frame(ID = 1:10,
t1 = c(7, 9, 10, 5, 5, 10, 5, 6, 6, 7) ,
censored = c(1, 1, 1, 0, 0, 1, 0, 1, 1, 1))
ggplot(dat, aes(x = ID, y = t1, shape = ifelse(censored, "Death", "Censored"))) + geom_point(size = 4) +
geom_linerange(aes(ymin = 0, ymax = t1)) +
geom_hline(yintercept = 5, lty = 2) +
coord_flip() +
scale_shape_manual(name = "Event", values = c(19, 15)) +
# ggtitle("Left Censoring") +
xlab("Patient ID") + ylab("Months") +
theme_bw()
\(T_i^*\): True time-to-event
\(C_i\): Censoring Time
\(T_i=\min(T_i^*,C_i)\): Observed time-to-event
\(\delta_i = I(T_i^*<C_i)\): Event indicator
\[ h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \le T^* < t + \Delta t \mid T^* \ge t)}{\Delta t} \]
\[ h(t \mid \boldsymbol X) = h_0(t) \exp(\boldsymbol\beta^\mathrm T\boldsymbol X) \] - \(h_0(t)\): baseline hazard function
\(\boldsymbol\beta\): regression coefficients
\(\boldsymbol X\): predictor variables
\[ h(t \mid \boldsymbol X) = 1.5t^{0.5} \exp(-1.5X_1 + 2.3X_2) \] \[ C_i\sim Exp(3.5) \]
n <- 10000
tdf <- tibble()
for (i in 1:10000){
phi <- 1.5
x.vec <- c(mvtnorm::rmvnorm(1,c(0,1.5),diag(c(1,.5))))
invS = function(t, u){
h = function(s)
{
phi * s^(phi - 1) * exp(-x.vec[1] * 1.5 + x.vec[2] * 2.3)
}
integrate(h, lower = 0, upper = t, subdivisions = 2000)$value + log(u)
}
u = runif(1)
Up <- 1
Root <- try(uniroot(invS, interval = c(0, Up), u = u)$root, TRUE)
while(inherits(Root, "try-error") | Root == 0) {
Up <- Up + 2
Root <- try(uniroot(invS, interval = c(0, Up), u = u)$root, TRUE)
if (Root == 0) {u <- runif(1)}
}
trueTimes <- if (!inherits(Root, "try-error")) Root else NA
rateC = 3.5
Ctimes = rexp(1, rate = rateC)
event.times = pmin(trueTimes, Ctimes,1)
delta_i = as.numeric(trueTimes <= min(Ctimes,1))
wdf <- tibble(x1 = x.vec[1], x2 = x.vec[2], time = event.times, censor = delta_i)
tdf <- bind_rows(tdf, wdf)
}