Generating a variable with the column sum of 'NA's in multiple given other variables

rasmustolstrup · September 25, 2019, 7:25am

Hi

I have some trouble generating a new variable with values from frequencies of NAs in other variables. I think it should be pretty straight forward, but after spending a couple of hours trying, I need help.

My work dataset is called 'WD' and the variables I would like to use in the freq count is called 'steps_day1', 'steps_day2', 'steps_day3', 'steps_day4, and 'steps_day5'.

I am trying to run this:

library(questionr)
WD$missingcount <- freq.na(WD$steps_day1,WD$steps_day2,WD$steps_day3,WD$steps_day4,WD$steps_day5,WD$steps_day6,WD$steps_day7)

But it does not work like I want it to. Can anybody assist me?

pieterjanvc · September 25, 2019, 11:05am

Hi,

In order for us to help you with your question, please provide us a minimal reprocudible example where you provide a minimal (dummy) dataset and code that can recreate the issue. One we have that, we can go from there. For help on creating a Reprex, see this guide:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck!
PJ

rasmustolstrup · September 25, 2019, 11:27am

Hi, sorry about that.

Here is a minimal reproducible example

WD <- data.frame(
   steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
   steps_day2 = c(3224, 3632, 4801, 970, 508, 1771, 15560, 2590, 3004, 10773),
   steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
   steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400),
   steps_day6 = c(4148, 1468, 4303, 2472, 1540, 1548, NA, 6439, 2590, 12786),
   steps_day7 = c(1112, 6871, 2895, 843, 1889, 1254, 5590, 4314, 2629, 13624),
   steps_day8 = c(2177, 5290, 8666, 4454, 3687, 5988, 15366, 5867, 4662, 14649),
   steps_day9 = c(294, 1813, 843, 1294, 1745, NA, 2804, 5092, 3440, 14555)
)

My code:

library(questionr)
WD$missingcount <- freq.na(WD$steps_day1,WD$steps_day2,WD$steps_day3,WD$steps_day4,WD$steps_day5,WD$steps_day6,WD$steps_day7)

Does this work?

raytong · September 25, 2019, 2:49pm

Hi @rasmustolstrup. The freq.na is functioning. But you code is not working because the dimension of the result of freq.na is c(8, 2) which is not match with the dimension of the data.frame WD. So, you cannot add the result of freq.na to a column of WD.

WD <- data.frame(
  steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
  steps_day2 = c(3224, 3632, 4801, 970, 508, 1771, 15560, 2590, 3004, 10773),
  steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
  steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400),
  steps_day6 = c(4148, 1468, 4303, 2472, 1540, 1548, NA, 6439, 2590, 12786),
  steps_day7 = c(1112, 6871, 2895, 843, 1889, 1254, 5590, 4314, 2629, 13624),
  steps_day8 = c(2177, 5290, 8666, 4454, 3687, 5988, 15366, 5867, 4662, 14649),
  steps_day9 = c(294, 1813, 843, 1294, 1745, NA, 2804, 5092, 3440, 14555)
)

WD
#>    steps_day1 steps_day2 steps_day4 steps_day5 steps_day6 steps_day7
#> 1        2146       3224       1760       3108       4148       1112
#> 2        2093       3632       3165       3178       1468       6871
#> 3        3243       4801       2533       5426       4303       2895
#> 4        1030        970       1034       1401       2472        843
#> 5          NA        508       1637       3076       1540       1889
#> 6          NA       1771       1702       1330       1548       1254
#> 7        8944      15560      15726       4120         NA       5590
#> 8        5835       2590       4504       4030       6439       4314
#> 9        1907       3004       5969       2288       2590       2629
#> 10       2278      10773      11081      15400      12786      13624
#>    steps_day8 steps_day9
#> 1        2177        294
#> 2        5290       1813
#> 3        8666        843
#> 4        4454       1294
#> 5        3687       1745
#> 6        5988         NA
#> 7       15366       2804
#> 8        5867       5092
#> 9        4662       3440
#> 10      14649      14555

questionr::freq.na(WD)
#>            missing  %
#> steps_day1       2 20
#> steps_day6       1 10
#> steps_day9       1 10
#> steps_day2       0  0
#> steps_day4       0  0
#> steps_day5       0  0
#> steps_day7       0  0
#> steps_day8       0  0

^{Created on 2019-09-25 by the reprex package (v0.3.0)}

rasmustolstrup · September 26, 2019, 7:04am

Okay. Thank you for elaborating.

Do you know any package or function that can do that?

raytong · September 26, 2019, 8:53am

Hi @rasmustolstrup. Can you explain more what you expect the resulting table would be.

rasmustolstrup · September 26, 2019, 10:15am

The easiest explanation would be this:

 WD <- data.frame(
         steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
         steps_day2 = c(3224, NA, NA, NA, NA, 1771, 15560, 2590, 3004, 10773),
         steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
         steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400)
       )

Then a new code/function

And I would expect WD to look like this

steps_day1 steps_day2 steps_day4 steps_day5 missing
1 2146 3224 1760 3108 0
2 2093 NA 3165 3178 1
3 3243 NA 2533 5426 1
4 1030 NA 1034 1401 1
5 NA NA 1637 3076 2
6 NA 1771 1702 1330 1
7 8944 15560 15726 4120 0
8 5835 2590 4504 4030 0
9 1907 3004 5969 2288 0
10 2278 10773 11081 15400 0

raytong · September 26, 2019, 12:08pm

Hi @rasmustolstrup. You may do it with pmap_dbl.

library(tidyverse)

WD <- data.frame(
  steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
  steps_day2 = c(3224, 3632, 4801, 970, 508, 1771, 15560, 2590, 3004, 10773),
  steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
  steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400),
  steps_day6 = c(4148, 1468, 4303, 2472, 1540, 1548, NA, 6439, 2590, 12786),
  steps_day7 = c(1112, 6871, 2895, 843, 1889, 1254, 5590, 4314, 2629, 13624),
  steps_day8 = c(2177, 5290, 8666, 4454, 3687, 5988, 15366, 5867, 4662, 14649),
  steps_day9 = c(294, 1813, 843, 1294, 1745, NA, 2804, 5092, 3440, 14555)
)

WD %>%
  mutate(missing = pmap_dbl(., function(...)sum(is.na(c(...)))))
#>    steps_day1 steps_day2 steps_day4 steps_day5 steps_day6 steps_day7
#> 1        2146       3224       1760       3108       4148       1112
#> 2        2093       3632       3165       3178       1468       6871
#> 3        3243       4801       2533       5426       4303       2895
#> 4        1030        970       1034       1401       2472        843
#> 5          NA        508       1637       3076       1540       1889
#> 6          NA       1771       1702       1330       1548       1254
#> 7        8944      15560      15726       4120         NA       5590
#> 8        5835       2590       4504       4030       6439       4314
#> 9        1907       3004       5969       2288       2590       2629
#> 10       2278      10773      11081      15400      12786      13624
#>    steps_day8 steps_day9 missing
#> 1        2177        294       0
#> 2        5290       1813       0
#> 3        8666        843       0
#> 4        4454       1294       0
#> 5        3687       1745       1
#> 6        5988         NA       2
#> 7       15366       2804       1
#> 8        5867       5092       0
#> 9        4662       3440       0
#> 10      14649      14555       0

^{Created on 2019-09-26 by the reprex package (v0.3.0)}

rasmustolstrup · September 26, 2019, 1:26pm

Hi @raytong

Thank you for your help. So I am getting closer. I am running this now:

WD %>%
  mutate(missing = pmap_dbl(., function(...)sum(is.na(c(WD$steps_day1,WD$steps_day2,WD$steps_day3,WD$steps_day4,
                                                        WD$steps_day5,WD$steps_day6,WD$steps_day7,WD$steps_day8,WD$steps_day9,WD$steps_day10,WD$steps_day11,WD$steps_day12,WD$steps_day13,WD$steps_day14,
                                                        WD$steps_day15,WD$steps_day16,WD$steps_day17,WD$steps_day18,WD$steps_day19,WD$steps_day20,WD$steps_day21,WD$steps_day22,WD$steps_day23,WD$steps_day24,
                                                        WD$steps_day25,WD$steps_day26,WD$steps_day27,WD$steps_day28,WD$steps_day29,WD$steps_day30,WD$steps_day31,WD$steps_day32,WD$steps_day33,WD$steps_day34,
                                                        WD$steps_day35,WD$steps_day36,WD$steps_day37,WD$steps_day38,WD$steps_day39,WD$steps_day40,WD$steps_day41,WD$steps_day42,WD$steps_day43,WD$steps_day44,
                                                        WD$steps_day45,WD$steps_day46,WD$steps_day47,WD$steps_day48,WD$steps_day49,WD$steps_day50,WD$steps_day51,WD$steps_day52,WD$steps_day53,WD$steps_day54,
                                                        WD$steps_day55,WD$steps_day56,WD$steps_day57,WD$steps_day58,WD$steps_day59,WD$steps_day60,WD$steps_day61,WD$steps_day62,WD$steps_day63,WD$steps_day64,
                                                        WD$steps_day65,WD$steps_day66,WD$steps_day67,WD$steps_day68,WD$steps_day69,WD$steps_day70,WD$steps_day71,WD$steps_day72,WD$steps_day73,WD$steps_day74,
                                                        WD$steps_day75,WD$steps_day76,WD$steps_day77,WD$steps_day78,WD$steps_day79,WD$steps_day80,WD$steps_day81,WD$steps_day82,WD$steps_day83,WD$steps_day84,
                                                        WD$steps_day85,WD$steps_day86,WD$steps_day87,WD$steps_day88,WD$steps_day89,WD$steps_day90,WD$steps_day91)))))

But It does not produce any extra variable. Can you guide me?
Ps. this is the full code, as I would like to have the NA count for the 91 days of an intervention period.

Cheers

raytong · September 26, 2019, 1:42pm

@rasmustolstrup. If all your data already in the data frame WD, no need to modify ... in the code. Just run the code. The ... ellipsis means the function will take any number of arguments to the function and process all values in c(...).

WD %>%
  mutate(missing = pmap_dbl(., function(...)sum(is.na(c(...)))))

rasmustolstrup · September 26, 2019, 2:02pm

I am running it with copy and paste. I have two problems with it: 1) no variable is added to the dataframe WD and 2) if 1) is solved, then I still have the problem that it sums ALL the NAs in WD, and not only the ones from the 91 days of intervention.

Again, thank you for your help.

Best

raytong · September 26, 2019, 2:16pm

@rasmustolstrup. 1) Are there any error message when you run the code. If yes, post here and see what the problem are. 2) The code is not to sum all the NAs in the whole data frame. It sums the number of NAs of each row.

Yarnabrina · September 26, 2019, 2:42pm

@raytong, I think OP believes no variable is added to WD because the new column is not assigned to WD.

@rasmustolstrup, the above answer by @raytong indeed adds an column, but it is not assigned to the already existing WD automatically. It doesn't modify in place. You'll have to use something like this:

WD <- WD %>%
  mutate(missing = pmap_dbl(., function(...)sum(is.na(c(...)))))

Or, you can load the magrittr package explicitly and use assignment pipe.

Or, in case you don't have to use tidyverse functions, just use this:

WD <- data.frame(steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
                 steps_day2 = c(3224, 3632, 4801, 970, 508, 1771, 15560, 2590, 3004, 10773),
                 steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
                 steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400),
                 steps_day6 = c(4148, 1468, 4303, 2472, 1540, 1548, NA, 6439, 2590, 12786),
                 steps_day7 = c(1112, 6871, 2895, 843, 1889, 1254, 5590, 4314, 2629, 13624),
                 steps_day8 = c(2177, 5290, 8666, 4454, 3687, 5988, 15366, 5867, 4662, 14649),
                 steps_day9 = c(294, 1813, 843, 1294, 1745, NA, 2804, 5092, 3440, 14555))

WD$no_NA_values <- rowSums(x = is.na(x = WD))

WD

In case you wanted to count NA from a few columns only (as Andres guessed below), you can do this:

WD$no_NA_values <- rowSums(x = is.na(x = WD[grepl(pattern = "<your_required_pattern>", x = names(x = WD))]))

andresrcs · September 26, 2019, 2:51pm

If I understand you correctly, you want to select specific columns, then you can do something like this, obviously you would have to fine tune the regular expression to your actual needs

library(dplyr)

WD <- data.frame(
    steps_day1 = c(2146, 2093, 3243, 1030, NA, NA, 8944, 5835, 1907, 2278),
    steps_day2 = c(3224, 3632, 4801, 970, 508, 1771, 15560, 2590, 3004, 10773),
    steps_day4 = c(1760, 3165, 2533, 1034, 1637, 1702, 15726, 4504, 5969, 11081),
    steps_day5 = c(3108, 3178, 5426, 1401, 3076, 1330, 4120, 4030, 2288, 15400),
    steps_day6 = c(4148, 1468, 4303, 2472, 1540, 1548, NA, 6439, 2590, 12786),
    steps_day7 = c(1112, 6871, 2895, 843, 1889, 1254, 5590, 4314, 2629, 13624),
    steps_day8 = c(2177, 5290, 8666, 4454, 3687, 5988, 15366, 5867, 4662, 14649),
    steps_day9 = c(294, 1813, 843, 1294, 1745, NA, 2804, 5092, 3440, 14555)
)

WD <- WD %>%
    mutate(missing = rowSums(is.na(select(., matches("[12456789]$")))))
WD
#>    steps_day1 steps_day2 steps_day4 steps_day5 steps_day6 steps_day7
#> 1        2146       3224       1760       3108       4148       1112
#> 2        2093       3632       3165       3178       1468       6871
#> 3        3243       4801       2533       5426       4303       2895
#> 4        1030        970       1034       1401       2472        843
#> 5          NA        508       1637       3076       1540       1889
#> 6          NA       1771       1702       1330       1548       1254
#> 7        8944      15560      15726       4120         NA       5590
#> 8        5835       2590       4504       4030       6439       4314
#> 9        1907       3004       5969       2288       2590       2629
#> 10       2278      10773      11081      15400      12786      13624
#>    steps_day8 steps_day9 missing
#> 1        2177        294       0
#> 2        5290       1813       0
#> 3        8666        843       0
#> 4        4454       1294       0
#> 5        3687       1745       1
#> 6        5988         NA       2
#> 7       15366       2804       1
#> 8        5867       5092       0
#> 9        4662       3440       0
#> 10      14649      14555       0

rasmustolstrup · September 27, 2019, 8:31am

I think the solution you mentioned last, should work the best. It does indeed ad a new column/variable as I wished for, but it does not count the NAs in my "pattern". I have tried to specify it in different ways, but it does not work.

    WD$no_NA_values <- rowSums(x = is.na(x = WD[grepl(pattern = "<WD$steps_day1,WD$steps_day2,WD$steps_day3,WD$steps_day4,
                                                   WD$steps_day5,WD$steps_day6,WD$steps_day7,WD$steps_day8,WD$steps_day9,WD$steps_day10,WD$steps_day11,WD$steps_day12,WD$steps_day13,WD$steps_day14,
                                                  WD$steps_day15,WD$steps_day16,WD$steps_day17,WD$steps_day18,WD$steps_day19,WD$steps_day20,WD$steps_day21,WD$steps_day22,WD$steps_day23,WD$steps_day24,
                                                  WD$steps_day25,WD$steps_day26,WD$steps_day27,WD$steps_day28,WD$steps_day29,WD$steps_day30,WD$steps_day31,WD$steps_day32,WD$steps_day33,WD$steps_day34,
                                                  WD$steps_day35,WD$steps_day36,WD$steps_day37,WD$steps_day38,WD$steps_day39,WD$steps_day40,WD$steps_day41,WD$steps_day42,WD$steps_day43,WD$steps_day44,
                                                  WD$steps_day45,WD$steps_day46,WD$steps_day47,WD$steps_day48,WD$steps_day49,WD$steps_day50,WD$steps_day51,WD$steps_day52,WD$steps_day53,WD$steps_day54,
                                                  WD$steps_day55,WD$steps_day56,WD$steps_day57,WD$steps_day58,WD$steps_day59,WD$steps_day60,WD$steps_day61,WD$steps_day62,WD$steps_day63,WD$steps_day64,
                                                  WD$steps_day65,WD$steps_day66,WD$steps_day67,WD$steps_day68,WD$steps_day69,WD$steps_day70,WD$steps_day71,WD$steps_day72,WD$steps_day73,WD$steps_day74,
                                                  WD$steps_day75,WD$steps_day76,WD$steps_day77,WD$steps_day78,WD$steps_day79,WD$steps_day80,WD$steps_day81,WD$steps_day82,WD$steps_day83,WD$steps_day84,
                                                  WD$steps_day85,WD$steps_day86,WD$steps_day87,WD$steps_day88,WD$steps_day89,WD$steps_day90,WD$steps_day91>", x = names(x = WD))]))

or
¨

WD$no_NA_values <- rowSums(x = is.na(x = WD[grepl(pattern = "steps_day1,steps_day2,steps_day3,steps_day4,
                                                  steps_day5,steps_day6,steps_day7,steps_day8,steps_day9,steps_day10,steps_day11,steps_day12,steps_day13,steps_day14,
                                                  steps_day15,steps_day16,steps_day17,steps_day18,steps_day19,steps_day20,steps_day21,steps_day22,steps_day23,steps_day24,
                                                  steps_day25,steps_day26,steps_day27,steps_day28,steps_day29,steps_day30,steps_day31,steps_day32,steps_day33,steps_day34,
                                                  steps_day35,steps_day36,steps_day37,steps_day38,steps_day39,steps_day40,steps_day41,steps_day42,steps_day43,steps_day44,
                                                  steps_day45,steps_day46,steps_day47,steps_day48,steps_day49,steps_day50,steps_day51,steps_day52,steps_day53,steps_day54,
                                                  steps_day55,steps_day56,steps_day57,steps_day58,steps_day59,steps_day60,steps_day61,steps_day62,steps_day63,steps_day64,
                                                  steps_day65,steps_day66,steps_day67,steps_day68,steps_day69,steps_day70,steps_day71,steps_day72,steps_day73,steps_day74,
                                                  steps_day75,steps_day76,steps_day77,steps_day78,steps_day79,steps_day80,steps_day81,steps_day82,steps_day83,steps_day84,
                                                  steps_day85,steps_day86,steps_day87,steps_day88,steps_day89,steps_day90,steps_day91", x = names(x = WD))])

I believe I am doing something wrong with the pattern, but how should I specify the variables that I would like to use?

Thank you for your help.

rasmustolstrup · September 27, 2019, 8:38am

This actually seem to work well, but for rows with 91xNA in the 91 variables, the "missing" column counts only 86 NAs. Does this have something to to with the matches("[12456789]$")? Furthermore, my data frame also holds several other variables (questionnaires etc) who also contains NAs, but these should not be included in the "missing" variable. Can I do something about that?

Best

andresrcs · September 27, 2019, 12:36pm

Yes, this function select columns which names match that particular "regular expression", and since I don't know the name of the other columns I have define it to match just the columns in your sample data.
If I'm guessing right only the variables you want to check for NA have names that start with "steps_day", so you could use this regular expression instead.

matches("^steps_day\\d{1,2}")

rasmustolstrup · September 29, 2019, 8:04am

This seem to work

 WD <- WD %>%
  mutate(missing = rowSums(is.na(select(., matches("^steps_day\\d{1,2}")))))

THANK YOU!!!

"insert meme about happiness and joy in life"

rasmustolstrup · September 29, 2019, 8:16am

Last question, just to understand everything. What does {1,2} do in the code?

Best

jcblum · September 29, 2019, 10:59am

Those are regular expression quantifiers. Here’s a gloss on what @andresrcs’s regular expression means:

^steps_day\\d{1,2}

^ : beginning of the text
steps_day : these literal characters, in exactly this order
\\d : any single digit (normally \d, but because of how R strings work the \ has to be escaped with another \)
{1,2} : the immediately preceding element should appear at least once and no more than twice

Putting that all together, the pattern can be read as: “find the strings that start with ‘steps_day’ followed by at least one and no more than two digits”. Note that if you happened to have a column named steps_day25_otherlabel, this pattern would also match it, since it does not include an end-of-text anchor.

Regular expressions are extremely useful and well worth spending some time learning. A fantastic starting place for learning how they work is https://regexr.com/

And if you’re working in RStudio and learning about regular expressions, you should definitely check out the RegExplain RStudio Addin (heavily inspired by RegExr, but with all the R-specific quirks taken into account!)