Simple R Function that I can't to get it to work

dpong · July 13, 2018, 10:18pm

2018-07-13_16-01-18

What I wanted to do is just to create a function where it will calculate for each column / metric in my data frame to find out the percentages of rows that are 0. And I can simply plug it into lapply to do the trick for me, I believe.

But I struggled to get this function to work.

Any guidance is appreciated.
-DP

jcblum · July 13, 2018, 11:17pm

Here are two ways to do it. The difference between them is how NAs are handled. In the first function, NAs are ignored completely, so you get the proportion of non-NA values that are zero. In the second function, the denominator includes NA rows, so you get the proportion of all values that are zero.

df <- data.frame(
  x = c(NA, sample(0:1, 99, replace = TRUE)),
  y = sample(0:4, 100, replace = TRUE),
  z = sample(0:9, 100, replace = TRUE)
)

pct_zero_1 <- function(x) {
  mean(x == 0, na.rm = TRUE)
}

lapply(df, pct_zero_1)
#> $x
#> [1] 0.5454545
#> 
#> $y
#> [1] 0.13
#> 
#> $z
#> [1] 0.08

pct_zero_2 <- function(x) {
  sum(x == 0, na.rm = TRUE) / length(x)
}

lapply(df, pct_zero_2)
#> $x
#> [1] 0.54
#> 
#> $y
#> [1] 0.13
#> 
#> $z
#> [1] 0.08

By the way, it’s better not to post screenshots of code. They can be hard to read, and are invisible to search. If you want to post an error message, you can copy and paste it from the console. To format your code properly, select your pasted code (or console output) and use the little </> button at the top of the posting box.

dpong · July 13, 2018, 11:26pm

Thanks a lot! This is exactly what I was looking for. I wasn't thinking. I was already using lapply, there is no reason to use the for loop. I got so hung up for the class matching. I thought I did something wrong when calculating a numeric with two numbers, i.e. the binary operations referred in the error message.

No worries, I already took care of all the NA in Vertica and recoded them to 0. So the NA.RM portion is not needed. But that's a good point to always think about the scope of the denominator.

Leon · July 14, 2018, 9:26am

As I see it, you ask for the percentage zero observations per variable (column) and then you mention lapply, so perhaps for a list of data frames?

If so, then this should get you going, similar to @jcblum's suggestion, but on a list of matrices:

# Reproducible example
set.seed(842569)

# Create function for making dummy matrices
mk_dummy_mat = function(n_row = 50, n_col = 20, n_zeros = 50){

  # Create dummy matrix
  d = matrix(rnorm(n_row * n_col), nrow = n_row, ncol = n_col)
  
  # Fill in some zeros
  d[sample(1:(n_row * n_col), n_zeros)] = 0

  # Done, return
  return(d)

}

# Create function for calculating percent zeros in data columns
percentage_0_check = function(x){
  
  # Look at each column, count n_zero and divide by column length
  perc_zeros = apply(x, 2, function(col){ return(mean(col == 0)) })
  
  # Done, return
  return(perc_zeros)

}

# Create a list of dummy matrices
my_mats = list(m1 = mk_dummy_mat(), m2 = mk_dummy_mat(), m3 = mk_dummy_mat())

# Calculate columnwise percentage zero content for all matrices
lapply(my_mats, percentage_0_check)
$m1
 [1] 0.08 0.08 0.06 0.02 0.02 0.02 0.12 0.06 0.04 0.04 0.02 0.08 0.04 0.04 0.02 0.08 0.02 0.06 0.04
[20] 0.06

$m2
 [1] 0.08 0.06 0.02 0.04 0.08 0.06 0.02 0.06 0.04 0.06 0.06 0.02 0.08 0.00 0.04 0.08 0.00 0.06 0.08
[20] 0.06

$m3
 [1] 0.06 0.04 0.06 0.06 0.02 0.10 0.06 0.08 0.00 0.04 0.08 0.06 0.04 0.04 0.02 0.04 0.06 0.00 0.02
[20] 0.12

Hope this helps