tidyverse: subsetting a highly irregular data using stringr and dplyr and creating logical columns based on character inputs

stabdelraheem · October 23, 2020, 9:40am

Hi all,

I have a two-part question/problem:

I have a fairly large data set that is 100K long and I'm trying to filter out entries based on a general set of character values that I want to keep. Essentially, if the letters J,K,L, or SR/S appear in the sightings$BestPod column, I'd like to keep those entries, and entries that do not contain those letters, I'd like dropped. The problem is that I have over 200 different combinations of J|j|K|k|L|l|SR|SRS|Sr|srs|sRs|SRs... some even contain numbers and punctuation in between i.e. J12|J?K?|Lp?k4|SR?|K-19... and so on, and so it's impossible to come up with a pattern for efficient use with stringr. The str_detect function is somewhat helpful but I can't figure out a robust way to subset my data without accidentally losing entries. Here's an image of my df:

image1389×911 213 KB

and this is a sample of my data:

  SightDateTime       BestLat BestLong BestPod
  <dttm>                <dbl>    <dbl> <chr>  
1 2011-07-15 15:11:00    48.4    -123. JKLp   
2 2006-06-16 19:45:00    48.6    -123. J      
3 2012-07-10 13:00:00    48.7    -123. JL     
4 2007-08-01 19:00:00    48.6    -123. Orcas  
5 2002-07-14 18:30:00    48.8    -123. J      
6 2011-09-25 12:45:00    48.6    -123. JKL

and this is the dput:

structure(list(SightDateTime = structure(c(206832600, 204956700, 
205804800, 205432200, 205349640, 207708600, 203877300, 199213200, 
207710100, 197846580, 206334780, 204772200, 206905800, 206751900, 
207174300, 199213500, 2.07e+08, 202773600, 203826420, 197917200
), tzone = "US/Pacific", class = c("POSIXct", "POSIXt")), BestLat = c(48.155, 
48.6893, 47.6803, 48.5386, 48.4454, 47.5533, 48.6562, 48.7395, 
47.5483, 48.155, 48.6089, 48.3526, 48.5602, 48.711, 48.5602, 
48.7395, 48.8053, 48.7453, 47.9811, 47.7785), BestLong = c(-123.5999, 
-123.0823, -122.5896, -122.9544, -123.0354, -122.4188, -123.243, 
-123.3776, -122.4917, -123.5999, -123.2403, -123.8284, -123.2168, 
-123.2655, -123.2168, -123.3776, -122.8315, -123.0889, -122.5625, 
-122.8105), BestPod = c("Orcas", "K", "Orcas", "Orcas", "KpLp", 
"JK", "Orcas", "Orcas", "JK", "KL", "JK", "Orcas", "KL-19", "J", 
"Orcas", "Orcas", "Orcas", "Orcas", "J", "SRs")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

2 ) After dropping those non-J,K,L,or SR(+/-s w/case variation) I'd like to create three new columns J K L where there can be 3 possible values: 0 = FALSE (no J|K|L is not present in subsetted BestPod column) , 1 = TRUE (J|K|L is present in the subsetted column), or 2 = SR(+/-s w/case variations) is present in the subsetted BestPod column Again because those entries vary wildly I really only care about the value of "j", "k", "l", and "sr/s".

Sorry for the long and monumentous ask, but I'm not really sure how to go about tidying this data. Apologies if my examples/sample code are not working, please let me know if they need to be amended!
Any and all help would be so greatly appreciated. Thanks for your time!

nirgrahamuk · October 23, 2020, 10:29am

what do you mean by 'contain those letters'
do you mean that they are only exactly those letters, or any combination of letters where one of the j k l is present is a match ?

nirgrahamuk · October 23, 2020, 10:33am

library(tidyverse)

DF <- structure(list(SightDateTime = structure(c(
  206832600, 204956700,
  205804800, 205432200, 205349640, 207708600, 203877300, 199213200,
  207710100, 197846580, 206334780, 204772200, 206905800, 206751900,
  207174300, 199213500, 2.07e+08, 202773600, 203826420, 197917200
), tzone = "US/Pacific", class = c("POSIXct", "POSIXt")), BestLat = c(
  48.155,
  48.6893, 47.6803, 48.5386, 48.4454, 47.5533, 48.6562, 48.7395,
  47.5483, 48.155, 48.6089, 48.3526, 48.5602, 48.711, 48.5602,
  48.7395, 48.8053, 48.7453, 47.9811, 47.7785
), BestLong = c(
  -123.5999,
  -123.0823, -122.5896, -122.9544, -123.0354, -122.4188, -123.243,
  -123.3776, -122.4917, -123.5999, -123.2403, -123.8284, -123.2168,
  -123.2655, -123.2168, -123.3776, -122.8315, -123.0889, -122.5625,
  -122.8105
), BestPod = c(
  "Orcas", "K", "Orcas", "Orcas", "KpLp",
  "JK", "Orcas", "Orcas", "JK", "KL", "JK", "Orcas", "KL-19", "J",
  "Orcas", "Orcas", "Orcas", "Orcas", "J", "SRs"
)), row.names = c(
  NA,
  -20L
), class = c("tbl_df", "tbl", "data.frame"))

(filtered_df <- filter(
  DF,
  str_detect(
    tolower(BestPod),
    "j|k|l|sr?s"
  )
))

(categorised_df <-
  mutate(filtered_df,
    category = factor(case_when(
      str_detect( tolower(BestPod),"sr?s") ~ 2,
      str_detect(tolower(BestPod),"j|k|l") ~ 1,
      TRUE ~ 0
    ))
  ))

mara · October 23, 2020, 12:28pm

To expand a bit here, you might want to take a look at the stringr vignette on Regular Expressions, since, if you want those characters present, but not an exact match, you can use things like character classes to avoid having to create every combination yourself:

Excerpting from the above (though I definitely recommend looking at the whole thing):

You can also create your own character classes using [] :

[abc] : matches a, b, or c.

[a-z] : matches every character between a and z (in Unicode code point order).

[^abc] : matches anything except a, b, or c.

[\^\-] : matches ^ or - .

The stringr and regular-expression cheatsheets might also come in handy (if you click on the images it'll take you to the PDFs):

jmcvw · October 23, 2020, 2:04pm

If you have difficulty finding regex matching patterns, is it easier to match values of non-interest, and filter them out?
- Eg. grepl('[^Oca], BestPod), would remove all your "Orcas" rows)
Since you explicitly remove all column that don't contian J/K/L/SR, there should be no 0 hits left in your filtered data.
- Because of this the function I provide does not return any 0s.
- If you do need the 0s, uncomment the extra line and change TRUE ~ 1, to TRUE ~ 0
- In this case, you may not need to filter you data at all, which will give 0s for the non-matches.

Also note, I used grepl. That's the base equivalent of str_detect, and I just think more quickly using those - you can use str_detect if you prefer

If I understand your question correctly, you want 3 new columns?

   f <- function(column, match) {
    
    case_when(
      grepl('sr', column, ignore.case = T) ~ 2,
      # grepl(match, column, ignore.case = T) ~ 1,
      TRUE ~ 1
    )
  }
  
  d %>% 
    filter(grepl('j|k|l|srs?', BestPod, ignore.case = TRUE)) %>%
    mutate(J = f(BestPod, 'j'),
           K = f(BestPod, 'k'),
           L = f(BestPod, 'l'))
#> # A tibble: 10 x 7
#>    SightDateTime       BestLat BestLong BestPod     J     K     L
#>    <dttm>                <dbl>    <dbl> <chr>   <dbl> <dbl> <dbl>
#>  1 1976-06-29 21:25:00    48.7    -123. K           1     1     1
#>  2 1976-07-04 10:34:00    48.4    -123. KpLp        1     1     1
#>  3 1976-07-31 17:50:00    47.6    -122. JK          1     1     1
#>  4 1976-07-31 18:15:00    47.5    -122. JK          1     1     1
#>  5 1976-04-08 13:23:00    48.2    -124. KL          1     1     1
#>  6 1976-07-15 20:13:00    48.6    -123. JK          1     1     1
#>  7 1976-07-22 10:50:00    48.6    -123. KL-19       1     1     1
#>  8 1976-07-20 16:05:00    48.7    -123. J           1     1     1
#>  9 1976-06-16 19:27:00    48.0    -123. J           1     1     1
#> 10 1976-04-09 09:00:00    47.8    -123. SRs         2     2     2

stabdelraheem · October 23, 2020, 2:55pm

Yes, sorry for not clarifying, I do mean any combination of those letters. Thank you for sharing those resources! I'll check them out!

stabdelraheem · October 23, 2020, 3:49pm

Thank you for the tip! I tried first to exclude based on values of non-interest, however there are a lot more different values of non-interest than interest in this dataset. The subsetted data I shared only produced a few examples, and so I thought it would be easier to subset based on what I was interested in keeping.

Also! Thank you! I did not think to use the case_when function! I really appreciate that! It worked mostly well however, I still wanted to return values of 0 for entries in which 1 or more of 'J/K/L' do not appear. For example if $BestPod == JK I'd like a return of $J == 1 $K == 1 $L == 0 for that row (rows 1:8 in the output tibble you provided are an example of what I mean - it's a bit difficult for me to clearly explain myself in written text).

I wanted to ask about the ? that you used in the filter(grepl('j|k|l|srs?', BestPod, ignore.case = TRUE)) chunk of code. Did you mean to include that as part of the match argument in the grepl() function or is it an inherent part of the syntax?

Sorry for all of the questions, I'm still learning my way around R in general but especially the tidyverse.

Thank you again for your help!!

jmcvw · October 23, 2020, 4:59pm

Hmmm, How strange!

I think I posted the wrong out reprex, because I'm pretty sure I did have the output you described! I'll take quick look now to see if I still have it.

The ? will check for sr followed by a maybe s - it's not really needed here I think, because I think it doesn't matter is the s is there or not. Try the examples in the cheatsheet linked by Mara for some experiments.

Edit

Ah, yes, I think I was geeting my reprexes mixed up after all.
If you follow the instruction about uncommenting the line # grepl(match, column, ignore.case = T) ~ 1, and changing the next line to TRUE ~ 0, the result becomes as below. Is that what you were after?

# A tibble: 10 x 7
   SightDateTime       BestLat BestLong BestPod     J     K     L
   <dttm>                <dbl>    <dbl> <chr>   <dbl> <dbl> <dbl>
 1 1976-06-29 21:25:00    48.7    -123. K           0     1     0
 2 1976-07-04 10:34:00    48.4    -123. KpLp        0     1     1
 3 1976-07-31 17:50:00    47.6    -122. JK          1     1     0
 4 1976-07-31 18:15:00    47.5    -122. JK          1     1     0
 5 1976-04-08 13:23:00    48.2    -124. KL          0     1     1
 6 1976-07-15 20:13:00    48.6    -123. JK          1     1     0
 7 1976-07-22 10:50:00    48.6    -123. KL-19       0     1     1
 8 1976-07-20 16:05:00    48.7    -123. J           1     0     0
 9 1976-06-16 19:27:00    48.0    -123. J           1     0     0
10 1976-04-09 09:00:00    47.8    -123. SRs         2     2     2

stabdelraheem · October 23, 2020, 6:07pm

Yes! Hmm, I guess I'm not clear on the regex match value constraints (I'm working through the examples in the cheat sheets that @Mara shared). But, I've run the code for the function with the line grepl(match, column, ignore.case = T) ~ 1, uncommented and TRUE ~ 0 to retain the "catchall" entries that don't fit the parameters. I've tried the function 3 different ways all of which yield the same output in which I don't retain non-J/K/L/SR entries (my df goes from 101K entries to 83K) and I still don't get 0 for entries in which 1 or more of 'J/K/L' do not appear in a row.

The first way I used a character class and used '[jkl]' for my argument

f_pod <- function(column, match) {
  
  case_when(
    grepl('sr', column, ignore.case = T) ~ 2,
    grepl('[jkl]', column, ignore.case = T) ~ 1,
    TRUE ~ 0
  )
}

I also ran it in which I used an argument similar to the argument used in the filter function: 'j|k|l'

f_pod <- function(column, match) {
  
  case_when(
    grepl('sr', column, ignore.case = T) ~ 2,
    grepl('j|k|l', column, ignore.case = T) ~ 1,
    TRUE ~ 0
  )
}

And for completeness' sake I combined the two and used '[j|k|l]' in my argument

f_pod <- function(column, match) {
  
  case_when(
    grepl('sr', column, ignore.case = T) ~ 2,
    grepl('[j|k|l]', column, ignore.case = T) ~ 1,
    TRUE ~ 0
  )
}

All yielded a result similar to the first tibble that you shared. I know that I'm really close - I can feel it! - but I'm not clear on why it's not working. My suspicion is with the | operator because it's inclusive of all provided values but does not exclude. I tried to bypass that by making three different grepl() arguments in the case_when() function, but it just produced a lot of errors and so I don't think that actually gets me any closer.

I guess, in addition to the cheat sheets, do you or @Mara recommend any other learning resources (or tutorials) for working with weird strings like this for someone who doesn't have a programming background? A lot of the examples and tutorials that I've worked through have really consistent/short datasets so learning the different regex inputs seems really straightforward until I work with my own data. I understand the differences between the arguments, but not enough to articulate how they work differently and therefore why I am not able to produce the desired outcome.

Thank you SO SO MUCH!!!

jmcvw · October 23, 2020, 6:15pm

Two things.

No 1
It looks like you have changed the function I wrote from
grepl('[j|k|l]', column, ignore.case = T) to
grepl(match, column, ignore.case = T)
It is this difference that is causing your problem - was that in my original reprex mistake?
Then also check the way it is used in mutate: J = f(BestPod, 'j')

See the updated reprex

No 2
Maybe if you could post an example of your current output, and your desired output.
Also check th output I posted in my updated reprex. Is it wrong also?

stabdelraheem · October 23, 2020, 6:20pm

Oh no! I really misunderstood and interpreted that I needed to enter the values I wanted matched to match my apologies. Your updated reprex is correct - I just ran it with the desired output. I'm so sorry for not understanding that!

jmcvw · October 23, 2020, 6:25pm

Thank goodness for that. I thought my brain was imploding! Glad it's working, though.

As for learning regex. there are a gazillion site, and blogs. The biggest thing is just practicing. I used to practice with very simple strings, like "the quick brown fox", and that kind of thing.

system · October 30, 2020, 6:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.