If/Else logic for fixing ZipCodes


#1

Can't seem to get this 100% correct.

Let's say I have a column of zipcodes. They vary from length but the maximum they would be is 9.

ZipCodes
19425
143489
190387462
345
19429

etc etc

If the zip is length 5, it's good. If it's 9, it's good. However if it's 8, I need to pad a 0 to the left. If it's length 7, I need to pad 2, etc for each result to add to 9. I ignore 9's and 5's simply because those are already complete zips, the length of 5 population just does not have the trailing 4. I've tried a few variations and the zips can be treated as either numeric or character, it does not matter.

for(row in CPD$ZipP) {
   #print(row) }
   if (nchar(row) == 8) {
  row <- paste0("0", row)
  } else if (nchar(row) == 7) {
    row <- paste0("00", row)
  }  else if (nchar(row) == 6) {  
    row <- paste0("000", row)  
  }  else if (nchar(row) == 4) {
    row  <- paste0("00000", row)
  }  else 
    print("zero")
}

When I run this, all it does is print zero, so clearly there is an issue with my logic. Basically I'm trying to loop the dataset, and update each row where the length needs to be changed.


#2

Hi! Welcome to RStudio Community!

It looks like your code was not formatted correctly to make it easy to read for people trying to help you. Formatting code allows for people to more easily identify where issues may be occuring, and makes it easier to read, in general. I have edited you post to format the code properly.

In the future please put code that is inline (such as a function name, like mutate or filter) inside of backticks (`mutate`) and chunks of code (including error messages and code copied from the console) can be put between sets of three backticks:

```
example <- foo %>%
  filter(a == 1)
```

This process can be done automatically by highlighting your code, either inline or in a chunk, and clicking the </> button on the toolbar of the reply window!

This will help keep our community tidy and help you get the help you are looking for!

For more information, please take a look at the community's FAQ on formating code


In addition, I do not think the issue is with your if/else logic, but rather the for loop itself.

Try this:

for(i in seq_along(CPD$ZipP)) {
  row <- CPD$ZipP[i]
   #print(row) }
   if (nchar(row) == 8) {
  row <- paste0("0", row)
  } else if (nchar(row) == 7) {
    row <- paste0("00", row)
  }  else if (nchar(row) == 6) {  
    row <- paste0("000", row)  
  }  else if (nchar(row) == 4) {
    row  <- paste0("00000", row)
  }  else 
    print("zero")

  CPD$ZipP[i] <- row
}

#3

Here is a solution that should scale well to a large numbers of rows (as we are doing a for-loop over the type of error and not over the rows).



CDP <- data.frame(ZipCodes = c(
  "19425",
  "143489",
  "190387462",
  "345",
  "19429"),
  stringsAsFactors = FALSE)

print(CDP)

#    ZipCodes
# 1     19425
# 2    143489
# 3 190387462
# 4       345
# 5     19429


fix_zips <- function(col) {
  col <- trimws(as.character(col), which = "both")

  for(len in 0:4) {
    needs_fix <- nchar(col)==len
    pad <- paste(rep("0", 5-len), collapse = "")
    col[needs_fix] <- paste0(pad, col[needs_fix])
  }
  
  for(len in 6:8) {
    needs_fix <- nchar(col)==len
    pad <- paste(rep("0", 9-len), collapse = "")
    col[needs_fix] <- paste0(pad, col[needs_fix])
  }
  
  substr(col, 1, 9)
}

CDP$ZipCodes <- fix_zips(CDP$ZipCodes)

print(CDP)

#    ZipCodes
# 1     19425
# 2 000143489
# 3 190387462
# 4     00345
# 5     19429

#4

Though I am wondering if for US Zip+4 notation maybe you should pad on the left until you get to 5 digits (as I think zips can start with zero) and then the rest on the right (the +4 bit) which happens to be very easy to express.

EDIT: I was wrong. If the only thing you are worried about is that integer conversion may have stripped off leading zeros then you should only pad on the left (as you said). Sorry!


#5

Hmm John you make a good point.

The trick is that zips with the full 9 are ok, since I can just substring characters 1-5. Zips that ONLY have 5 are fine since I don't need to do anything.

However, if the zip is 3 or 4 digits, it would then be something like 000001234. When I substring off the first 5 chars, I get 00000 as opposed to 01234. Right now using your first code snippet, it updated instantly(650k rows mind you) however I have 182 entries that now pushed the real zip to the back of the string.

This is why I was looking at the if else logic since I can say if len == 4, pad 1 zero, etc. However the for loop was quite slow.

Using the second piece of code seems to correct that problem however, if my zip records are missing the preceding 0 and over the length of 4 it stays off. Using your second snippet, could I do something like:

fix_zips <- function(col) {
  col <- trimws(as.character(col), which = "both")
  for(len in 0:4) {
    needs_fix <- nchar(col)==len
    pad <- paste(rep("0", 5-len), collapse = "")
    col[needs_fix] <- paste0(pad, col[needs_fix])
  }
  for(len in 0:6) {
   needs_fix <- nchar(col)==len
   pad <-paste(rep("000", 9-len), collapse = "") 
  col <- paste0(col, paste(rep("0", 9), collapse = ''))
  substr(col, 1, 9)
}

I think I would need to do this for 7,8 as well. The goal would be to "end up" with either 5 OR 9 in length.


#7

You were right and I was wrong: you always want to pad on the left (as that is what integer conversion may have stripped off). The difference is some codes you want to be 5 digit and others 9. I have updated the earlier code to solve exactly that. Sorry about the confusion.


#8

John no need to apologize. This code is perfect.

If I'm understanding this, your looping and looking at the length, when it falls between 0:4, you're padding 0's till it hits length of 5? Second piece I'm assuming pads till it hits the length of 9? No point in copying and pasting something if I can't understand it as well!

Which part actually dictates whether or not it pads to the left or right? That's the one piece alluding me..


#9

In my new code I am only padding on the left (I decided right-padding was not the right idea). The side control is if the pad is left or right of the col[needs_fix] argument. Definitely work through my remaining solution- I think it is now close to what you want.

The key is to never start the padding on length-5 items (hence 5 is absent from both for-loops; notice they are 0:5 and 6:8- not overlapping with each other and neither touching 5).


#10

Another alternative could be:

CDP <- data.frame(ZipCodes = c(
  "19425",
  "143489",
  "190387462",
  "345",
  "19429"),
  stringsAsFactors = FALSE)



for (i in 1:nrow(CDP)){
    ln <- nchar(CDP$ZipCodes[i])
    if(ln %in% c(5,9)) next
    if(ln > 5){
        CDP$ZipCodes[i] <- paste(paste(rep(0,9-ln), sep='', collapse = ''),
                                 CDP$ZipCodes[i], sep ='')
    } else if (ln < 5){
        CDP$ZipCodes[i] <- paste(paste(rep(0,5-ln), sep='', collapse = ''),
                                 CDP$ZipCodes[i], sep ='')
    }
}


CDP

#11

sprintf is a powerful tool for formatting strings. It's a wrapper for the C function of the same name, which takes a formatting string to define how to format the input. Those formatting strings can get somewhat complicated, but start with ?sprintf; if you're still confused, the Wikipedia page is pretty decent:

If you combine sprintf with ifelse to set the format string according to the number of characters, you can do all the formatting in one line:

df <- data.frame(ZipCodes = c(19425L, 143489L, 190387462L, 345L, 19429L))

sprintf(ifelse(nchar(df$ZipCodes) > 5, '%09d', '%05d'), df$ZipCodes)
#> [1] "19425"     "000143489" "190387462" "00345"     "19429"

#12

The stringi package has a stri_pad_left function you can use with an ifelse, similar to how @alistaire does:

library(stringi)

stringi::stri_pad_left(
  df[["ZipCodes"]],
  width = ifelse(nchar(df[["ZipCodes"]]) > 5, 9, 5),
  pad = "0"
)