Published on

Negative Binomial Model

Authors
  • avatar
    Name
    Kevin Navarrete-Parra
    Twitter

I am writing quick and easy R guides for my didactic purposes and to provide useful starting places for my peers in grad school. If you see that I have made a mistake or would like to suggest some way to make the post better or more accurate, please feel free to email me. I am always happy to learn from others' experiences!

Table of contents

  1. Negative Binomial Model
    1. Model Equation
    2. Negative Binomial Distribution
  2. Incidence Rate Ratios
  3. Running it in R
  4. Diagnostic Statistics

Negative Binomial Model

Model Equation

This model is similar to the Poisson Model in that it is suited for count dependent variables. However, the Poisson model assumes that the response variable's mean and variance are equal--a condition that most real data commonly violate. If your response variable suffers from overdispersion (i.e., the dependent variable's variance is greater than its mean), then the negative binomial model is a good alternative. Whereas the Poisson model assumes the variable follows a Poisson distribution, the negative binomial model works off of the negative binomial distribution, which relaxes the dispersion assumption. In order to do so, the Variance(Y)Variance (Y) is made into a function of the mean μ\mu and a dispersion parameter α\alpha. Therefore, Variance(Y)=μ(1+αμ).WhenVariance (Y) = \mu(1 + \alpha \mu). When \alpha = 0$, the variance is equal to the mean, making the negative binomial identical to the Poisson, under these circumstances.

The negative binomial model is similar to the Poisson model in that it can be modeled as

ln(μ)=β0+β1X1+β2X2+...+βpXp+ϵln(\mu) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon

where mumu is the mean, β0\beta_0 is the intercept, the remaining β\beta values are coefficients, and ϵ\epsilon is an error term. The equation's left side contains the log link function. We include an error term in this model to reflect the overdispersion. Notably, there are two ways of expressing the response variable's variance: as a linear equation Variance(Y)=μ(1+μ)Variance (Y) = \mu(1 + \mu) or a quadratic equation Variance(Y)=μ(1+αμ)=μ+αμ2Variance (Y) = \mu(1 + \alpha \mu) = \mu + \alpha \mu^2. The former is often called the NB1 model and the latter the NB2 model, and the latter is the one that most people use.

When trying to predict the response variable's mean, we exponentiate both sides of the negative binomial model.

μ=exp(β0+β1X1+β2X2+...+βpXp)\mu = exp(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p)

Notice that the equation above does not include the ϵ\epsilon value. That is because exp(ϵ)exp(\epsilon) is assumed to be equal to 1 in this model as with the Poisson model, so it would be redundant to include it.

Additionally, as with the Poisson model, we can incorporate a temporal element if we want to represent a count of occurrences in a given timeframe tt. If we do define an incidence rate, the model can be re-expressed as

ln(μ/t)=β0+β1X1+β2X2+...+βpXp+ϵln(\mu / t) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon

where tt is the time period and ln(μ/t)ln(\mu / t) represents the log of the incidence rate. We can also rewrite the equation above as

ln(μ)=ln(t)+β0+β1X1+β2X2+...+βpXp+ϵln(\mu) = ln(t) + \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon

where ln(t)ln(t) is the offset in the model.

Negative Binomial Distribution

As we saw above, the negative binomial model handles overdispersion in count data better than the Poisson model. That is because the negative binomial probability distribution handles greater variance better than the Poisson distribution. In this probability distribution, we count all the independent Bernoulli trials before a given number of achieved successes. We can express the negative binomial probability distribution as

P(Y=y)=(y+k1y)pk(1p)yP(Y = y) = \binom{y + k - 1}{y} p^k (1-p)^y

where (y=k1y)\binom{y = k - 1}{y} is the negative binomial coefficient, yy is the number of trials before an achieved success, k is the number of successes in y+ky + k trials, and pp is the given trial's success probability.

Incidence Rate Ratios

The negative binomial model employs incidence rate ratios like the Poisson model, estimating the response variable's log incidence rate. In order to get this value, then, we exponentiation both sides of the model.

Running it in R

When you want to fit a negative binomial model in R, you can use the glm.nb function from the MASS package. The code would look like


nb.mod <- glm.nb(dv ~ iv1 + iv2 + ivp, data = data)
summary(nb.mod)

You can also run the same model using the vglm function from VGAM using the following code:


nb.vglm <- vglm(dv ~ iv1 + iv2 + ivp, family = neginomial, data = data)
summary(nb.vglm)

Using the glm.nb might be somewhat easier, unless you are specifying other models in the VGAM package.

Diagnostic Statistics

The diagnostic statistics for this model are mostly the same as those used for the Poisson model and others I've covered here.