Log transforming data with zeros

I have data on bee viruses that I am comparing between groups of bees from two site types. There are nine sites, 4 of one type and 5 of the other.
The data are more normal when log transformed, and log transformation seems to be a good fit. However, there are lots of zeros in the data, and when I log transform, the data become "-lnf". This becomes a problem when I try to run a GLM model on the viral data, with virus ~ site type, which was one idea about how to analyze it.

The other idea was to run an ANOVA with linear contrast. I'm looking for input about using a GLM vs ANOVA and linear contrast? How do we decide which is appropriate? And how do you deal with zeros in log transformed data?

Hi @zoep,

One simple approach is adding a constant to the data. For example, you could add 1 to every point, then log transform.

x <- c(0, 1, 2, 3)


log(x + 1)

As a follow-up there is a package called bestNormalize which will test out multiple "normalizing" transformations for you, you can try that if you want. If you inspect the log_x transformation it does exactly as I described, adds a small constant to every data point (max(0, -min(x) + eps)).

x <- rexp(100)
xtrans <- bestNormalize(x)

If the data you have are counts, you might be interested in this paper:
Do not log-transform count data

An ANOVA is just another way to compare two linear models. You can do an ANOVA on a GLM! So first you have to decide what kind of model to fit. From what you've written your response variable is virus, so work through these questions to see if they help you decide what kind of model to fit. This is a cut down version of a handout I give my ecological statistics students.

  1. Is the response variable continuous (a real number) or discrete (an integer)?
  • Continuous, go to 2
  • Discrete, go to 5
  1. Is the continuous response bounded on the bottom (e.g. at zero).
  • No, use a model with a normal error distribution (e.g. lm())
  • Yes, go to 3
  1. Does the response variable sometimes take a value of exactly zero?
  • No, consider log transforming the response prior to using a model with a normal error distribution (e.g. lm()), or use a gamma error distribution (e.g. glm(…,family=gamma).
  • Yes. Go to 4.
  1. Choose from the following two options:
  • Add a small (≤1) constant to all values of the response, log transform and use a model with a normal error distribution.
  • Split the analysis into a binomial presence/absence model, and a normal error model of the log transformed observations > 0.
  1. Is the discrete response bounded on the bottom (e.g. are negative values possible)?
  • No, consider a model with a normal error distribution, but check carefully for heteroscedasticity.
  • Yes, go to 12.
  1. Does the discrete response have a maximum value (upper bound)? The value may differ for each observation.
  • No, consider the response to be Poisson and go to 9
  • Yes, go to 7
  1. How many different discrete outcomes are possible for each response?
  • Two (yes/no, present/absent, true/false). Consider the response to be binomial ( n successes in m trials; use glm(…,family=binomial)). Check for overdispersion if m >1.
  • More than two. Go to 8.
  1. Are all the covariates categorical?
  • Yes, use contingency tables and related techniques (get expert help)
  • No, consider multinomial regression (get expert help) or convert to numeric scores and treat as normal (get expert help here too – psychometrics does this alot).
  1. Does each observational unit represent the same amount of time or space?
  • No, use an offset in the formula to account for variation in sampling effort between observations, and proceed with using glm(. ~ . + offset(log(effort.variable)), …,family=poisson). Be sure to check for overdispersion.

  • Yes, proceed with using glm(…,family=poisson). Be sure to check for overdispersion.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.