Mutate variable with condition (scientific notation or padding)

# A tibble: 5 x 3
  x                        wanted        length_of_x
  <chr>                    <chr>               <int>
1 0                        0000000000000           1
2 001570098                0000001570098           9
3 9770025215000            9770025215000          13
4 97725320530072           9.77253205E13          14
5 303401110309080105010401 03.0340111E23          24

Given x, how can I mutate wanted with the following conditions:

  • When the length of x is greater than 13, wanted gets expressed in scientific notation. See rows 4 and 5.
  • When the length of x is less than 13, wanted gets padded with 0s. See rows 1 to 3.
  • Then length of wanted is 13.
library(tidyverse)
# Toy data
df <- tibble(x = c("0", "001570098", "9770025215000", "97725320530072", 
                   "303401110309080105010401"), 
             wanted = c("0000000000000", "0000001570098", 
                        "9770025215000", "9.77253205E13", "03.0340111E23"), 
             length_of_x = c(1L,9L, 13L, 14L, 24L))

Why does the last entry in the wanted column start with 0?

Thanks for the reply. Notice that when the length of x is less than 13, 0s are being added to achieve the length of 13. See rows 1 and 2. wanted must have a length of 13.

But the last entry is "303401110309080105010401" and longer than 13. Shouldn't the wanted value be 3.03401110E23?

Assuming the last wanted entry does not need a leading zero:

library(tibble)
#> Warning: package 'tibble' was built under R version 4.1.2
df <- tibble(x = c("0", "001570098", "9770025215000", "97725320530072", 
                   "303401110309080105010401"), 
             wanted = c("0000000000000", "0000001570098", 
                        "9770025215000", "9.77253205E13", "03.0340111E23"), 
             length_of_x = c(1L,9L, 13L, 14L, 24L))


df$New <- ifelse(df$length_of_x<=13,
                 formatC(as.numeric(df$x),format="f",width=13,flag="0",digits=0),
                 paste0(substr(df$x,1,1),".",substr(df$x,2,9),"E",nchar(df$x)-1))
df
#> # A tibble: 5 x 4
#>   x                        wanted        length_of_x New          
#>   <chr>                    <chr>               <int> <chr>        
#> 1 0                        0000000000000           1 0000000000000
#> 2 001570098                0000001570098           9 0000001570098
#> 3 9770025215000            9770025215000          13 9770025215000
#> 4 97725320530072           9.77253205E13          14 9.77253205E13
#> 5 303401110309080105010401 03.0340111E23          24 3.03401110E23

Created on 2022-02-28 by the reprex package (v2.0.1)

1 Like

Many thanks for your code and also for pointing out the issue!

This problematic wanted variable is SAS-generated output. I have no clue what the SAS is doing. As you pointed out, the leading zero is a bit odd! Here are more examples; see rows 1 and 4, where leading zeroes appear. Please let me know if you can crack it!

   x                        wanted        length_of_x
   <chr>                    <chr>               <int>
 1 303401110309080105010401 03.0340111E23          24
 2 3034012131080201030102   3.03401213E21          22
 3 30340111410503020302     3.03401114E19          20
 4 30340111011005010102     03.0340111E19          20
 5 30340111410703020302     3.03401114E19          20

I see six examples of x values for which the wanted value is in scientific notation, line 4 in your original data set and the five more recent samples (line five of the earlier set is duplicated in line one of the more recent set.). The two cases that end up with a leading zero start out with a zero as the ninth digit of x. The four samples without leading zeros have non zero values at the ninth digit of x. If that pattern holds with a bigger sample size, we could treat the ninth digit as a flag for putting a leading zero on wanted.

1 Like

Thank you again for your reply. I think the 9th digit indeed is the culprit. Please have a look at the new data.

# New Toy Data
library(tidyverse)
df_new <- tibble(
  x = c("303401110309080105010401", "30340111011005010102", 
        "47506440000000", "55000000000637", "00160070001398", "47506450000000", 
        "47506180000000", "97715929050042", "27902690022652", "970503470", 
        "0000024100298", "0", "001570098", "9770025215000", "97725320530072"),
  wanted = c("03.0340111E23", "03.0340111E19", "004.750644E13", 
             "00000005.5E13", "0160070001398", "004.750645E13", "004.750618E13", 
             "9.77159291E13", "002.790269E13", "0000970503470", "0000024100298", 
             "0000000000000", "0000001570098", "9770025215000", "9.77253205E13"),
  length_of_x = c(24L, 20L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 
                  9L, 13L, 1L, 9L, 13L, 14L),
  pos9 = c("0", "0", "0", "0", "0", 
           "0", "0", "0", "0", "0", "0", "", "8", "1", "5")
  
)
df_new
#> # A tibble: 15 x 4
#>    x                        wanted        length_of_x pos9 
#>    <chr>                    <chr>               <int> <chr>
#>  1 303401110309080105010401 03.0340111E23          24 "0"  
#>  2 30340111011005010102     03.0340111E19          20 "0"  
#>  3 47506440000000           004.750644E13          14 "0"  
#>  4 55000000000637           00000005.5E13          14 "0"  
#>  5 00160070001398           0160070001398          14 "0"  
#>  6 47506450000000           004.750645E13          14 "0"  
#>  7 47506180000000           004.750618E13          14 "0"  
#>  8 97715929050042           9.77159291E13          14 "0"  
#>  9 27902690022652           002.790269E13          14 "0"  
#> 10 970503470                0000970503470           9 "0"  
#> 11 0000024100298            0000024100298          13 "0"  
#> 12 0                        0000000000000           1 ""   
#> 13 001570098                0000001570098           9 "8"  
#> 14 9770025215000            9770025215000          13 "1"  
#> 15 97725320530072           9.77253205E13          14 "5"

That's progress in understanding the wanted column but I'm not yet sure how to consistently derive the values. I am surprised that in line 5 the x value is truncated on the left and in line 8 the last displayed digit is rounded up. I will think about this more tomorrow but the rules seem pretty complicated.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.