column means when dealing with numeric and non-numeric data frames

Hi everyone, so I am trying to find sum, means, and standard deviation of the below data set, and I get the error ,'x' must be numeric after doing colSums(x) and colMeans(x).

I then used this for loop (for(i in 1:ncol(sql_db)) {if (class(sql_db[,i])=='numeric') {colSums(sql_db)} else {0}
) so that it ignores classes of columns that arent 'numeric', still it didnt work. please what can I do?

sql_db
age sex credit_turnover acct_bal
1 23 f NaN 100
2 21 m 28 23
3 NaN f 64 46
4 28 f NaN 87
5 41 m 35 NaN

library(tidyverse)

(headiris <- head(iris) %>% mutate(Species=as.character(Species)))

colSums(headiris)

select_if(headiris,is.numeric) %>% colSums()

#cause an element to be NaN
headiris[1,1] <- NaN
headiris

select_if(headiris,is.numeric) %>% colSums()

select_if(headiris,is.numeric) %>% colSums(na.rm = TRUE)

Using the summarize_if function in the dplyr package, you could do something like this. Note that I'm using tribble to create a data frame from the data you provided, and then using pivot_longer from the tidyr package to make the output a bit easier to comprehend.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
sql_db <- tribble(~age, ~sex, ~credit_turnover, ~acct_bal,
        23, "f", NaN,  100,
        21, "m", 28 , 23,
        NaN, "f",  64,  46,
        28, "f", NaN,  87,
        41, "m", 35 , NaN)

sql_db %>% 
  summarize_if(is.numeric, 
               .funs = list("sum" = ~sum(., na.rm = TRUE),
                            "mean" = ~mean(., na.rm = TRUE), 
                            "sd" = ~sd(., na.rm = TRUE))) %>% 
  pivot_longer(everything())
#> # A tibble: 9 x 2
#>   name                  value
#>   <chr>                 <dbl>
#> 1 age_sum              113   
#> 2 credit_turnover_sum  127   
#> 3 acct_bal_sum         256   
#> 4 age_mean              28.2 
#> 5 credit_turnover_mean  42.3 
#> 6 acct_bal_mean         64   
#> 7 age_sd                 9.00
#> 8 credit_turnover_sd    19.1 
#> 9 acct_bal_sd           35.7

Created on 2020-06-30 by the reprex package (v0.3.0)

Thank you very much, this worked

Thank you very much, this worked perfectly. so I am trying to get count of 'Na' in each column of data set 'sql_db', my idea was to ask R if they are Na values in 'sql_db' using is.na and then it returns true and false for each cell, and then converted trues to 0 and false to '1', so I sum for each column to get the total Na's. However this sum do not work for a character variable. How do I go about this? or do you have a better approach to fetch "Na's" for each column of sql_db?

I'm not sure why it doesn't work for you for the character columns. I'd do the same thing, using sum(is.na()).

To showcase this, lets' recreate the sql_db table. This time I'm also loading the purrr package, because I'm in the habit of using the map family of functions.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
sql_db <- tribble(~age, ~sex, ~credit_turnover, ~acct_bal,
                  23, "f", NaN,  100,
                  21, "m", 28 , 23,
                  NaN, "f",  64,  46,
                  28, "f", NaN,  87,
                  41, "m", 35 , NaN)

So here we use a map type function for summing the 1s / TRUEs together for each column in the dataframe. Specifically we're using the map_dbl function because we want the function to return a double / numeric vector.

sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               0               2               1

If we were to add an NA to first two values in the the sex column. You can see that this also works, even though sex is a character variable.

sql_db$sex[1:2] <- NA
  
sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               2               2               1

What I'm guessing could be the problem is that perhaps the "NA"s in your character variables aren't actually NA values, but rather character strings containing the characters "NA".

sql_db$sex[1:2] <- "NA"

sql_db
#> # A tibble: 5 x 4
#>     age sex   credit_turnover acct_bal
#>   <dbl> <chr>           <dbl>    <dbl>
#> 1    23 NA                NaN      100
#> 2    21 NA                 28       23
#> 3   NaN f                  64       46
#> 4    28 f                 NaN       87
#> 5    41 m                  35      NaN

sql_db %>% 
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               0               2               1

If this is the case then the is.na() function won't count them, as they're not actual NAs as far as R can tell. So we need to tell R that strings consisting of the characters "NA" should be NA.

So here we use a mutate_if function to say that if a column is a character column we will change that character column in a way that makes values NA if they simply consist of the characters "NA".

sql_db$sex[1:2] <- "NA" # Again, this isn't an actual NA

sql_db %>% 
  mutate_if(is.character,
            ~na_if(., "NA")) %>% # Here we make them actual NA values.
  map_dbl(~sum(is.na(.)))
#>             age             sex credit_turnover        acct_bal 
#>               1               2               2               1

Here's hoping that this actually helps you. Best of luck and let me know if this is not what's causing the problem.

Thank you so much. I enjoyed your approach of solving the problem and this worked perfectly well. Just that I dont know if you can explain what the map_dbl does?

You can think of the map functions as a way of applying a function to every element in a list. For dataframes that means applying it to every column. But writing that out just now I realize you could of course do the same thing with summarize_all from dplyr, and then you wouldn't get a numeric vector as the output but rather a dataframe with one row:

sql_db$sex[1:2] <- "NA"

sql_db %>% 
  mutate_if(is.character,
            ~na_if(., "NA")) %>% 
  summarize_all(~sum(is.na(.)))
#> # A tibble: 1 x 4
#>     age   sex credit_turnover acct_bal
#>   <int> <int>           <int>    <int>
#> 1     1     2               2        1

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.