NA for string variables not always working?

Slavek · October 8, 2019, 9:15am

Hi,
I have this simple data frame:

data.frame(stringsAsFactors=FALSE,
                       URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg", "hhh", "iii"),
                        E1 = c(1, 2, 3, NA, NA, NA, NA, NA, NA),
                     string = c("ddd", "s", NA, "fr", "rf", "f", "aa", "v", "bg"),
                        D1 = c(-1, 10, 6, -1, 8, 9, 7, -1, 99)
              )

and I would like to create new variable indicating NAs in the "string" variable.

data.frame<- data.frame%>% 
  mutate(XXX = if_else(is.na(x = data.frame$string),1,0))

I used this code end everything works fine (XXX for URN=ccc is 1)
Now, I have another data frame:

data.frame(
                              VitalCode = c(223L, 221L, 221L, 221L, 2415L),
                           InvoiceTotal = c(11.57, 35.33, 33.06, 25.04, 276.15),
                            LabourTotal = c(65, 130, 130, 137, 215.58),
                             PartsTotal = c(31.31, 191.11, 197.55, 73.49, 60.57),
                           AuraURNMatch = c(0, 0, 0, 0, 0),
                             DealerName = as.factor(c("AAA", "BBB", "CCC", "DDD", "EEE")),
                             Accountnum = as.factor(c("A", "A", "A", "A", "B")),
                               Division = as.factor(c("4W", "4W", "4W", "4W", "4W")),
                        ReferenceNumber = as.factor(c("ad", "das", "ee", "ad", "ad")),
                              VITAL.URN = as.factor(c(NA, NA, NA, NA, "V20619738065")),
                               URN = as.factor(c(NA, NA, NA, NA, "VE21511906256751")),
                                 RoDate = as.factor(c("17/12/2012 00:00", "05/03/2013 00:00",
                                                      "15/03/2013 00:00",
                                                      "10/06/2014 00:00",
                                                      "26/06/2019 00:00")),
                          OperationCode = as.factor(c("MEC", "MEC", "MEC", "MST", "Service|Other")),
                        WorkDescription = as.factor(c("aaa", "sadd", "sfe", "gdtt", "gret")),
                                JobType = as.factor(c("Retail", "Retail", "Retail", "Retail",
                                                      "Retail")),
                            ServiceFlag = as.factor(c("NO", "NO", "NO", "YES", "YES")),
                                MOTFlag = as.factor(c("NO", "NO", "NO", "YES", "NO"))
                     )

but the same code does not generate what I need:

data.frame <- data.frame %>% 
  mutate(XXX= if_else(is.na(x = data.frame$aura.URN),1,0))

because I have 0s for all URNs (the last one should generate 1).

Any ideas?

Slavek · October 8, 2019, 9:21am

I found a way around it:

library(stringr)
data.frame <- data.frame %>% 
  mutate(XXX = ifelse(test = (str_length(string = data.frame$URN) < 1),yes = 0,no = 1))

but just wondering that I did wrong...

Yarnabrina · October 8, 2019, 10:23am

I can think of multiple problems:

There is no column aura.URN in your dataset.
You haven't provided us what name you gave to the dataset, but most probably it is not data.frame.
Even if you actually named it as data.frame, that is a very bad name.
Also, you don't need to use the name of the data.frame inside mutate.
In your 1st dataset (where you succeeded), you specified stringsAsFactors=FALSE, and here you're forcing all to be factors using as.factor.
There is no reason to use if_else, as is.na gives a logical vector, and converting that to integer directly is much better.
The way you've written, missing values will become 1, and non-missing will become 0. Hence, last value will not be 1 as you expect.

One way to get 1 for non-missing and 0 for missing:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(VitalCode = c(223L, 221L, 221L, 221L, 2415L),
                      InvoiceTotal = c(11.57, 35.33, 33.06, 25.04, 276.15),
                      LabourTotal = c(65, 130, 130, 137, 215.58),
                      PartsTotal = c(31.31, 191.11, 197.55, 73.49, 60.57),
                      AuraURNMatch = c(0, 0, 0, 0, 0),
                      DealerName = c("AAA", "BBB", "CCC", "DDD", "EEE"),
                      Accountnum = c("A", "A", "A", "A", "B"),
                      Division = c("4W", "4W", "4W", "4W", "4W"),
                      ReferenceNumber = c("ad", "das", "ee", "ad", "ad"),
                      VITAL.URN = c(NA, NA, NA, NA, "V20619738065"),
                      URN = c(NA, NA, NA, NA, "VE21511906256751"),
                      RoDate = c("17/12/2012 00:00", "05/03/2013 00:00", "15/03/2013 00:00", "10/06/2014 00:00", "26/06/2019 00:00"),
                      OperationCode = c("MEC", "MEC", "MEC", "MST", "Service|Other"),
                      WorkDescription = c("aaa", "sadd", "sfe", "gdtt", "gret"),
                      JobType = c("Retail", "Retail", "Retail", "Retail", "Retail"),
                      ServiceFlag = c("NO", "NO", "NO", "YES", "YES"),
                      MOTFlag = c("NO", "NO", "NO", "YES", "NO"),
                      stringsAsFactors = FALSE)

dataset %>%
    mutate(XXX = +!is.na(x = URN))
#>   VitalCode InvoiceTotal LabourTotal PartsTotal AuraURNMatch DealerName
#> 1       223        11.57       65.00      31.31            0        AAA
#> 2       221        35.33      130.00     191.11            0        BBB
#> 3       221        33.06      130.00     197.55            0        CCC
#> 4       221        25.04      137.00      73.49            0        DDD
#> 5      2415       276.15      215.58      60.57            0        EEE
#>   Accountnum Division ReferenceNumber    VITAL.URN              URN
#> 1          A       4W              ad         <NA>             <NA>
#> 2          A       4W             das         <NA>             <NA>
#> 3          A       4W              ee         <NA>             <NA>
#> 4          A       4W              ad         <NA>             <NA>
#> 5          B       4W              ad V20619738065 VE21511906256751
#>             RoDate OperationCode WorkDescription JobType ServiceFlag
#> 1 17/12/2012 00:00           MEC             aaa  Retail          NO
#> 2 05/03/2013 00:00           MEC            sadd  Retail          NO
#> 3 15/03/2013 00:00           MEC             sfe  Retail          NO
#> 4 10/06/2014 00:00           MST            gdtt  Retail         YES
#> 5 26/06/2019 00:00 Service|Other            gret  Retail         YES
#>   MOTFlag XXX
#> 1      NO   0
#> 2      NO   0
#> 3      NO   0
#> 4     YES   0
#> 5      NO   1

^{Created on 2019-10-08 by the reprex package (v0.3.0)}

Hope this helps.

Slavek · October 8, 2019, 10:45am

Thank you, sorry for aura.URN error.

You code is not working for me. Is it because my URN variable is "as.factor"?
My (shorter) data file is like that:

data.frame(
                              VitalCode = c(2231L, 2231L, 2231L, 2231L, 2415L),
                               URN = as.factor(c(NA, NA, NA, NA, "VE21511906256751")),
                                 RoDate = as.factor(c("17/12/2012 00:00", "05/03/2013 00:00",
                                                      "15/03/2013 00:00",
                                                      "10/06/2014 00:00",
                                                      "26/06/2019 00:00")),
                                MOTFlag = as.factor(c("NO", "NO", "NO", "YES", "NO"))
                     )

and my result is as below (only 1s):

  VitalCode              URN           RoDate MOTFlag XXX
1      2231                  17/12/2012 00:00      NO           1
2      2231                  05/03/2013 00:00      NO           1
3      2231                  15/03/2013 00:00      NO           1
4      2231                  10/06/2014 00:00     YES           1
5      2415 VE21511906256751 26/06/2019 00:00      NO           1

Do you know what I am doing wrong?

Yarnabrina · October 8, 2019, 11:07am

I don't think being factor or not will affect is.na.

Can you please provide a proper reprex with all library calls using the reprex package so that we can actually what exact commands you're using, what is your exact input, and what is your exact output? The output you're showing is not for the input you're showing.

Slavek · October 8, 2019, 11:22am

I'm really sorry about the confusion. The data is fine and the result table (after applying your code) is now updated...

mara · October 8, 2019, 12:33pm

This is a result of not using reprex. If you could please take the time to familiarize yourself with it, and give it a go, it would really help everyone here to help you with minimal confusion.

For pointers specific to the community site, check out the reprex FAQ.

Slavek · October 8, 2019, 1:48pm

I wish I could but reprex is not working on my machine: https://forum.posit.co/t/re-reprex/22423
That is the only reason why I use the only working option: datapaste

mara · October 8, 2019, 2:14pm

Do you mean datapasta here? If so, datapasta is one way to get your data ready for a reprex, it doesn't run the code and give output like reprex does (which is what makes reprex so useful for troubleshooting).

If you're running into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

You also always have the option of using rstudio cloud. Here's a project that I made (it's accessible, so you'll be able to open it) where I copied @Yarnabrina's code and ran it as a reprex with a specified input and output file:

Below, I have copied and pasted the contents of the output file.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- data.frame(VitalCode = c(223L, 221L, 221L, 221L, 2415L),
                      InvoiceTotal = c(11.57, 35.33, 33.06, 25.04, 276.15),
                      LabourTotal = c(65, 130, 130, 137, 215.58),
                      PartsTotal = c(31.31, 191.11, 197.55, 73.49, 60.57),
                      AuraURNMatch = c(0, 0, 0, 0, 0),
                      DealerName = c("AAA", "BBB", "CCC", "DDD", "EEE"),
                      Accountnum = c("A", "A", "A", "A", "B"),
                      Division = c("4W", "4W", "4W", "4W", "4W"),
                      ReferenceNumber = c("ad", "das", "ee", "ad", "ad"),
                      VITAL.URN = c(NA, NA, NA, NA, "V20619738065"),
                      URN = c(NA, NA, NA, NA, "VE21511906256751"),
                      RoDate = c("17/12/2012 00:00", "05/03/2013 00:00", "15/03/2013 00:00", "10/06/2014 00:00", "26/06/2019 00:00"),
                      OperationCode = c("MEC", "MEC", "MEC", "MST", "Service|Other"),
                      WorkDescription = c("aaa", "sadd", "sfe", "gdtt", "gret"),
                      JobType = c("Retail", "Retail", "Retail", "Retail", "Retail"),
                      ServiceFlag = c("NO", "NO", "NO", "YES", "YES"),
                      MOTFlag = c("NO", "NO", "NO", "YES", "NO"),
                      stringsAsFactors = FALSE)

dataset %>%
  mutate(XXX = +!is.na(x = URN))
#>   VitalCode InvoiceTotal LabourTotal PartsTotal AuraURNMatch DealerName
#> 1       223        11.57       65.00      31.31            0        AAA
#> 2       221        35.33      130.00     191.11            0        BBB
#> 3       221        33.06      130.00     197.55            0        CCC
#> 4       221        25.04      137.00      73.49            0        DDD
#> 5      2415       276.15      215.58      60.57            0        EEE
#>   Accountnum Division ReferenceNumber    VITAL.URN              URN
#> 1          A       4W              ad         <NA>             <NA>
#> 2          A       4W             das         <NA>             <NA>
#> 3          A       4W              ee         <NA>             <NA>
#> 4          A       4W              ad         <NA>             <NA>
#> 5          B       4W              ad V20619738065 VE21511906256751
#>             RoDate OperationCode WorkDescription JobType ServiceFlag
#> 1 17/12/2012 00:00           MEC             aaa  Retail          NO
#> 2 05/03/2013 00:00           MEC            sadd  Retail          NO
#> 3 15/03/2013 00:00           MEC             sfe  Retail          NO
#> 4 10/06/2014 00:00           MST            gdtt  Retail         YES
#> 5 26/06/2019 00:00 Service|Other            gret  Retail         YES
#>   MOTFlag XXX
#> 1      NO   0
#> 2      NO   0
#> 3      NO   0
#> 4     YES   0
#> 5      NO   1

^{Created on 2019-10-08 by the reprex package (v0.3.0)}

Is there an open issue in the reprex repo that is stopping you from making a reprex?

Slavek · October 8, 2019, 2:25pm

Slavek:

Thank you, sorry for aura.URN error.

You code is not working for me. Is it because my URN variable is "as.factor"?
My (shorter) data file is like that:

data.frame(
                              VitalCode = c(2231L, 2231L, 2231L, 2231L, 2415L),
                               URN = as.factor(c(NA, NA, NA, NA, "VE21511906256751")),
                                 RoDate = as.factor(c("17/12/2012 00:00", "05/03/2013 00:00",
                                                      "15/03/2013 00:00",
                                                      "10/06/2014 00:00",
                                                      "26/06/2019 00:00")),
                                MOTFlag = as.factor(c("NO", "NO", "NO", "YES", "NO"))
                     )

and my result is as below (only 1s):

  VitalCode              URN           RoDate MOTFlag XXX
1      2231                  17/12/2012 00:00      NO           1
2      2231                  05/03/2013 00:00      NO           1
3      2231                  15/03/2013 00:00      NO           1
4      2231                  10/06/2014 00:00     YES           1
5      2415 VE21511906256751 26/06/2019 00:00      NO           1

Do you know what I am doing wrong?

Thank you but I still have wrong results in my "short" data file...

mara · October 8, 2019, 2:31pm

I can't reproduce your error. I will need a reprex in order to do so.

Slavek · October 8, 2019, 4:09pm

There is no error as such. I simply get 1s whereas I should get:
xxx <- c(0, 0, 0, 0, 1) as only the 5th record of URN is not blank...
Basically, any data frame with URN string column with some blank values would work to do this exercise...

andresrcs · October 8, 2019, 4:59pm

We are not asking you for sample data, we are asking you for a reproducible example, which does not necessarily mean using the reprex package, as we have told you so many times before.

The example Yarnabrina gave your works fine, you would have to gave an example when it doesn't, so we can help you with that specific case.

As an advice, the only thing you are going to get by refusing to adopt good practices here, is for us to lose the will to help you.

system · October 29, 2019, 4:59pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.