column means when dealing with numeric and non-numeric data frames

Hi everyone, so I am trying to find sum, means, and standard deviation of the below data set, and I get the error ,'x' must be numeric after doing colSums(x) and colMeans(x).

I then used this for loop (for(i in 1:ncol(sql_db)) {if (class(sql_db[,i])=='numeric') {colSums(sql_db)} else {0}
) so that it ignores classes of columns that arent 'numeric', still it didnt work. please what can I do?

sql_db
age sex credit_turnover acct_bal
1 23 f NaN 100
2 21 m 28 23
3 NaN f 64 46
4 28 f NaN 87
5 41 m 35 NaN

library(tidyverse)

(headiris <- head(iris) %>% mutate(Species=as.character(Species)))

colSums(headiris)

select_if(headiris,is.numeric) %>% colSums()

#cause an element to be NaN
headiris[1,1] <- NaN
headiris

select_if(headiris,is.numeric) %>% colSums()

select_if(headiris,is.numeric) %>% colSums(na.rm = TRUE)

Using the summarize_if function in the dplyr package, you could do something like this. Note that I'm using tribble to create a data frame from the data you provided, and then using pivot_longer from the tidyr package to make the output a bit easier to comprehend.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
sql_db <- tribble(~age, ~sex, ~credit_turnover, ~acct_bal,
        23, "f", NaN,  100,
        21, "m", 28 , 23,
        NaN, "f",  64,  46,
        28, "f", NaN,  87,
        41, "m", 35 , NaN)

sql_db %>% 
  summarize_if(is.numeric, 
               .funs = list("sum" = ~sum(., na.rm = TRUE),
                            "mean" = ~mean(., na.rm = TRUE), 
                            "sd" = ~sd(., na.rm = TRUE))) %>% 
  pivot_longer(everything())
#> # A tibble: 9 x 2
#>   name                  value
#>   <chr>                 <dbl>
#> 1 age_sum              113   
#> 2 credit_turnover_sum  127   
#> 3 acct_bal_sum         256   
#> 4 age_mean              28.2 
#> 5 credit_turnover_mean  42.3 
#> 6 acct_bal_mean         64   
#> 7 age_sd                 9.00
#> 8 credit_turnover_sd    19.1 
#> 9 acct_bal_sd           35.7

Created on 2020-06-30 by the reprex package (v0.3.0)

Thank you very much, this worked

Thank you very much, this worked perfectly. so I am trying to get count of 'Na' in each column of data set 'sql_db', my idea was to ask R if they are Na values in 'sql_db' using is.na and then it returns true and false for each cell, and then converted trues to 0 and false to '1', so I sum for each column to get the total Na's. However this sum do not work for a character variable. How do I go about this? or do you have a better approach to fetch "Na's" for each column of sql_db?

You can think of the map functions as a way of applying a function to every element in a list. For dataframes that means applying it to every column. But writing that out just now I realize you could of course do the same thing with summarize_all from dplyr, and then you wouldn't get a numeric vector as the output but rather a dataframe with one row:

sql_db$sex[1:2] <- "NA"

sql_db %>% 
  mutate_if(is.character,
            ~na_if(., "NA")) %>% 
  summarize_all(~sum(is.na(.)))
#> # A tibble: 1 x 4
#>     age   sex credit_turnover acct_bal
#>   <int> <int>           <int>    <int>
#> 1     1     2               2        1

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

I'm not sure why it doesn't work for you for the character columns. I'd do the same thing, using sum(is.na()).

To showcase this, lets' recreate the sql_db table. This time I'm also loading the purrr package, because I'm in the habit of using the map family of functions.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
sql_db <- tribble(~age, ~sex, ~credit_turnover, ~acct_bal,
                  23, "f", NaN,  100,
                  21, "m", 28 , 23,
                  NaN, "f",  64,  46,
                  28, "f", NaN,  87,
                  41, "m", 35 , NaN)

So here we use a map type function for summing the 1s / TRUEs together for each column in the dataframe. Specifically we're using the map_dbl function because we want the function to return a double / numeric vector.

sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               0               2               1

If we were to add an NA to first two values in the the sex column. You can see that this also works, even though sex is a character variable.

sql_db$sex[1:2] <- NA
  
sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               2               2               1

What I'm guessing could be the problem is that perhaps the "NA"s in your character variables aren't actually NA values, but rather character strings containing the characters "NA".

sql_db$sex[1:2] <- "NA"

sql_db
#> # A tibble: 5 x 4
#>     age sex   credit_turnover acct_bal
#>   <dbl> <chr>           <dbl>    <dbl>
#> 1    23 NA                NaN      100
#> 2    21 NA                 28       23
#> 3   NaN f                  64       46
#> 4    28 f                 NaN       87
#> 5    41 m                  35      NaN

sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               0               2               1

If this is the case then the is.na() function won't count them, as they're not actual NAs as far as R can tell. So we need to tell R that strings consisting of the characters "NA" should be NA.

So here we use a mutate_if function to say that if a column is a character column we will change that character column in a way that makes values NA if they simply consist of the characters "NA".

sql_db$sex[1:2] <- "NA" # Again, this isn't an actual NA

sql_db %>% 
  mutate_if(is.character,
            ~na_if(., "NA")) %>% # Here we make them actual NA values.
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               2               2               1

Here's hoping that this actually helps you. Best of luck and let me know if this is not what's causing the problem.

Thank you so much. I enjoyed your approach of solving the problem and this worked perfectly well. Just that I dont know if you can explain what the map_dbl does?