using dplyr and str_detect to check partial match

jfca283 · September 17, 2020, 6:10am

Hello,
I think I have a problem.
I have two columns with phone numbers. And I need to check if they are the same.
The phone numbers have between 10 and 8 digits.

The data frame I am using hast the variables phone1 and phone2, and I need to check a partial match with phone2 over phone1.

I tried something like this

str_detect("45625819/5637514682 office","625819")

But when I try using variables, I get

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) : 
  Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

I know how to extract the last 7 digits from variable phone2. But I really don't know how to apply the str_detect(phone1, phone2) using mutate.

phone1 and phone2 include some special characthers as "/" and " - ". Even some text.
I hope you can help me.
Thanks

technocrat · September 17, 2020, 6:42am

library(stringr)
str_detect("45625819/5637514682 office","625819")
[1] TRUE

phone1 and phone2 both have to be character variables in the data frame.

jfca283 · September 23, 2020, 1:03am

The two variables are chr when I see the data frame using glimpse()
I run this now

dataset1 %>% str_detect(phone1,phone2)
Error in type(pattern) : objet 'phone2 introuvable

jfca283 · September 23, 2020, 1:12am

I couldn't run It using dplyr, so I tried this:

 str_detect(dataset1$phone1,dataset1$phone2)

And It worked partially.
When I see in the variable phone2 the same number in phone1, but after a "/" or two phones separated by some text, I received this message.

Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

I think I will need to add more rules...

andresrcs · September 23, 2020, 1:57am

If you need more specific help, please provide a proper REPRoducible EXample (reprex) illustrating your issue.

technocrat · September 23, 2020, 3:20am

The second argument should be quoted, "3334444"

jfca283 · September 23, 2020, 6:42am

What I'm tried is to detect string, but I failed.
The task I need is to update a phone only if the new number is new.
The original number is phone1, the updated number is phone2.
So, I asked if the updated phone is new, repeat that number in phone3.
If they are the same, or it's contained in phone1, assign as "NA" the phone3 value.

library(stringr)
phone1=c("123784569","14785236","181232356","56915348789","754132538","186578452155","5421312654","456523221285")
phone2=c("97856526","8114785236","22/181232356","62145412","754132538","6578452155","5421312654","7824521285/456523221285")
dataz=data_frame(phone1,phone2)
dataz
dataz=dataz %>% mutate(phone3=ifelse(phone1==phone2,phone1,phone2))
dataz=dataz%>% mutate(phone3=str_replace_all(dataz$phone3,dataz$phone2,dataz$phone2) )%>% print()
  
dataz=dataz%>%mutate(phone3desired=c("97856526",NA,NA,"62145412",NA,NA,NA,"7824521285")) %>% print()

The issue was the task doesn't seem easy for me now. Phone3desired is the result I intend to get.
Also, sometimes phone2 is larger than phone1, or vice versa, by the area number. So, phone1 contains phone2 or phone2 contains phone1. Sometimes I received two numbers in phone2, but only one of them was declared and shoud be removed. The variable phone3desired is clear.

Thanks for your time and interest.

jfca283 · September 23, 2020, 6:58am

I tried a different approach...


str_view(dataz$phone3,regex(dataz$phone1))
str_view(dataz$phone3,regex(dataz$phone2))
gg=str_replace(dataz$phone3,regex(dataz$phone1),"")



dataz=dataz%>%mutate(phone3_alt=gg)
dataz %>% mutate(phone3_alt=ifelse(nchar(dataz$phone3_alt)<5,"",phone3_alt))

So, I'm seeing the area codes "81" and "22". I don't know how to replace to "" if the length is under 6 numbers. I think that's a fast way to remove those strange numbers.
I have no idea how to replace "" to NA.
Also, I have no clue in how to remove a "/" if that's the final character.
The good thing is I am near my goal.

technocrat · September 23, 2020, 7:22am

f(x) = y, where x is phone1 and y is phone2.

f is a function that determines if y = x or y \in x. And y \in x \rightarrow y = x, so f only needs to test for y \in x, and phone1 == phone2 is unnecessary by terms of the problem statement. If y \in x, phone3= phone2, else NA.

Now, however, the most recent post suggests the problem statement is deficient, because an unbound variable area code is introduced, the 22 and 88 digits in rows 2 and 3 of dataz as manipulated. That equates to a test of y = x, which amends the statement to y = x \wedge y \in x.

In terms of dataz

dataz=dataz %>% mutate(phone3=ifelse(phone1==phone2,phone1,phone2))

should have ifelse modified to add to the equality test a string match test. That will avoid having to deal with base phone number length and empty characters.

jfca283 · September 23, 2020, 7:29am

I know what you mean. It isn't easy to deal with a variable that can contain two phone numbers and deleting only part of the string, in some cases.
I was trying to do

gg=str_replace(dataz$phone3,regex(dataz$phone1),"")

But backwards, I mean, check if phone1 contains phone3_alt. In that case, replace with NA.
It's a long way to the goal. The good side, it's that checking both ways removes the area code issue.
I woud only need to delete it if there are remaining 2-3 numbers in the variable phone3_alt.

system · October 14, 2020, 7:29am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.