 # Log transforming data with zeros

I have data on bee viruses that I am comparing between groups of bees from two site types. There are nine sites, 4 of one type and 5 of the other.
The data are more normal when log transformed, and log transformation seems to be a good fit. However, there are lots of zeros in the data, and when I log transform, the data become "-lnf". This becomes a problem when I try to run a GLM model on the viral data, with virus ~ site type, which was one idea about how to analyze it.

The other idea was to run an ANOVA with linear contrast. I'm looking for input about using a GLM vs ANOVA and linear contrast? How do we decide which is appropriate? And how do you deal with zeros in log transformed data?

Hi @zoep,

One simple approach is adding a constant to the data. For example, you could add 1 to every point, then log transform.

``````x <- c(0, 1, 2, 3)

log(x)

log(x + 1)
``````

As a follow-up there is a package called `bestNormalize` which will test out multiple "normalizing" transformations for you, you can try that if you want. If you inspect the `log_x` transformation it does exactly as I described, adds a small constant to every data point (`max(0, -min(x) + eps)`).

``````library(bestNormalize)
x <- rexp(100)
hist(x)
xtrans <- bestNormalize(x)
hist(xtrans\$x.t)
``````
1 Like

If the data you have are counts, you might be interested in this paper:
Do not log-transform count data

An ANOVA is just another way to compare two linear models. You can do an ANOVA on a GLM! So first you have to decide what kind of model to fit. From what you've written your response variable is `virus`, so work through these questions to see if they help you decide what kind of model to fit. This is a cut down version of a handout I give my ecological statistics students.

1. Is the response variable continuous (a real number) or discrete (an integer)?
• Continuous, go to 2
• Discrete, go to 5
1. Is the continuous response bounded on the bottom (e.g. at zero).
• No, use a model with a normal error distribution (e.g. lm())
• Yes, go to 3
1. Does the response variable sometimes take a value of exactly zero?
• No, consider log transforming the response prior to using a model with a normal error distribution (e.g. lm()), or use a gamma error distribution (e.g. glm(…,family=gamma).
• Yes. Go to 4.
1. Choose from the following two options:
• Add a small (≤1) constant to all values of the response, log transform and use a model with a normal error distribution.
• Split the analysis into a binomial presence/absence model, and a normal error model of the log transformed observations > 0.
1. Is the discrete response bounded on the bottom (e.g. are negative values possible)?
• No, consider a model with a normal error distribution, but check carefully for heteroscedasticity.
• Yes, go to 12.
1. Does the discrete response have a maximum value (upper bound)? The value may differ for each observation.
• No, consider the response to be Poisson and go to 9
• Yes, go to 7
1. How many different discrete outcomes are possible for each response?
• Two (yes/no, present/absent, true/false). Consider the response to be binomial ( n successes in m trials; use glm(…,family=binomial)). Check for overdispersion if m >1.
• More than two. Go to 8.
1. Are all the covariates categorical?
• Yes, use contingency tables and related techniques (get expert help)
• No, consider multinomial regression (get expert help) or convert to numeric scores and treat as normal (get expert help here too – psychometrics does this alot).
1. Does each observational unit represent the same amount of time or space?
• No, use an offset in the formula to account for variation in sampling effort between observations, and proceed with using glm(. ~ . + offset(log(effort.variable)), …,family=poisson). Be sure to check for overdispersion.

• Yes, proceed with using glm(…,family=poisson). Be sure to check for overdispersion.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.