validate name with email

shoaibali · September 13, 2020, 8:39pm

I have a data frame like below

df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"),
                  email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))

i want to check if name in gmail and name in column (Name) is same if false then mutate new column

output should be like below

Name	email	Match_name
try,xab	xab.try@ybcd.com	0
xab,Lan	Lan.xab@ybcd.com	1
mhy,mun	mun.mhy@ybcd.com	0
vgtu,mmc	mmc.vgtu@ybcd.com	0
dgsy,aaf	aaf.dgsy@ybcd.com	0
kull,nnhu	nnhu.kull@ybcd.com	0
hula,njam	njam.hula@ybcd.com	0
mund,jiha	jiha.mund@ybcd.com	0
htfy,ntha	ntha.htfy@ybcd.com	0
bhr,gydbt	gydbt.bhr@ybcd.com	0
sgyu,hytb	hytb.sgyu@ybcd.com	0
vdti,kula	kula.vdti@ybcd.com	0
mftyu,huta	huta.mftyu@ybcd.com	0
ibdy,vcge	vcge.ibdy@ybcd.com	1
cday,bhsue	bhsue.cday@ybcd.com	0
ajtu,nudj	nudj.ajtu@ybcd.com	0

technocrat · September 14, 2020, 3:24am

This is ugly; the tmp intermediates are a kludge

suppressPackageStartupMessages({library(dplyr)
                                library(stringr)})
df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"), email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))

pattern1 <- "(^.*,)(.*$)"
pattern2 <- "\\2.\\1"
pattern3 <- ".$"
pattern4 <- "@.*$"


df6 %>% mutate(tmp1 = str_replace(df6$name,pattern1,pattern2) %>% 
        str_remove(.,pattern3)) %>%
        mutate(tmp2 = str_remove(email,pattern4)) %>%
        mutate(Match_name = ifelse(tmp1 == tmp2,1,0)) %>%
        select(-tmp1,-tmp2)
#>          name               email Match_name
#> 1     try,xab    xab.try@ybcd.com          1
#> 2     xab,Lan    Lan.xab@ybcd.com          1
#> 3     mhy,mun    tth.vgu@ybcd.com          0
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com          1
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com          1
#> 6   kull,nnhu  nnhu.kull@ybcd.com          1
#> 7   hula,njam  njam.hula@ybcd.com          1
#> 8   mund,jiha  jiha.mund@ybcd.com          1
#> 9   htfy,ntha  ntha.htfy@ybcd.com          1
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com          1
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com          1
#> 12  vdti,kula  kula.vdti@ybcd.com          1
#> 13 mftyu,huta huta.mftyu@ybcd.com          1
#> 14  ibdy,vcge  ggat.khul@ybcd.com          0
#> 15 cday,bhsue bhsue.cday@ybcd.com          1
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com          1

^{Created on 2020-09-13 by the reprex package (v0.3.0)}

shoaibali · September 14, 2020, 5:35am

please explain these pattern for my knowledge.....

technocrat · September 14, 2020, 6:41am

These are regular expressions, which are used by the stringr library to parse character strings.

Match two groups: the first any sequence of characters up to and including a comma; the second, any following sequence of characters to the end.
The second matched group followed by . followed by the first mapped group
The last character (to get rid of the comma)
Everything from @ to end-of-line

stringr has many shortcuts for pattern matching, as well.

elmstedt · September 15, 2020, 12:00am

Here is a {base} R solution which does precisely what you want,

df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"), email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))
df6
#>          name               email
#> 1     try,xab    xab.try@ybcd.com
#> 2     xab,Lan    Lan.xab@ybcd.com
#> 3     mhy,mun    tth.vgu@ybcd.com
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com
#> 6   kull,nnhu  nnhu.kull@ybcd.com
#> 7   hula,njam  njam.hula@ybcd.com
#> 8   mund,jiha  jiha.mund@ybcd.com
#> 9   htfy,ntha  ntha.htfy@ybcd.com
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com
#> 12  vdti,kula  kula.vdti@ybcd.com
#> 13 mftyu,huta huta.mftyu@ybcd.com
#> 14  ibdy,vcge  ggat.khul@ybcd.com
#> 15 cday,bhsue bhsue.cday@ybcd.com
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     lapply(strsplit(df6[["name"]], ","),
                                            function(name) {
                                              paste(rev(name), collapse = "\\.")
                                            }),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              integer(1))
df6
#>          name               email match_name
#> 1     try,xab    xab.try@ybcd.com          1
#> 2     xab,Lan    Lan.xab@ybcd.com          1
#> 3     mhy,mun    tth.vgu@ybcd.com          0
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com          1
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com          1
#> 6   kull,nnhu  nnhu.kull@ybcd.com          1
#> 7   hula,njam  njam.hula@ybcd.com          1
#> 8   mund,jiha  jiha.mund@ybcd.com          1
#> 9   htfy,ntha  ntha.htfy@ybcd.com          1
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com          1
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com          1
#> 12  vdti,kula  kula.vdti@ybcd.com          1
#> 13 mftyu,huta huta.mftyu@ybcd.com          1
#> 14  ibdy,vcge  ggat.khul@ybcd.com          0
#> 15 cday,bhsue bhsue.cday@ybcd.com          1
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com          1

Slightly different, here is a {base} R solution which has the benefit of matching the names regardless of order (e.g. a,b would match a.b@c.com and b.a@c.com. Also, I think it is better to encode this as a logical variable rather than an integer.

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     strsplit(df6[["name"]], ","),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              logical(1))
df6
#>          name               email match_name
#> 1     try,xab    xab.try@ybcd.com       TRUE
#> 2     xab,Lan    Lan.xab@ybcd.com       TRUE
#> 3     mhy,mun    tth.vgu@ybcd.com      FALSE
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com       TRUE
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com       TRUE
#> 6   kull,nnhu  nnhu.kull@ybcd.com       TRUE
#> 7   hula,njam  njam.hula@ybcd.com       TRUE
#> 8   mund,jiha  jiha.mund@ybcd.com       TRUE
#> 9   htfy,ntha  ntha.htfy@ybcd.com       TRUE
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com       TRUE
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com       TRUE
#> 12  vdti,kula  kula.vdti@ybcd.com       TRUE
#> 13 mftyu,huta huta.mftyu@ybcd.com       TRUE
#> 14  ibdy,vcge  ggat.khul@ybcd.com      FALSE
#> 15 cday,bhsue bhsue.cday@ybcd.com       TRUE
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com       TRUE

^{Created on 2020-09-14 by the reprex package (v0.3.0)}

shoaibali · September 15, 2020, 4:00pm

elmstedt:

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     strsplit(df6[["name"]], ","),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              logical(1))

for this if i want 1 for False and 0 for True then...??

AlexisW · September 15, 2020, 4:14pm

One more, pure dplyr:

df6 %>%
  separate(name,
           into = c("last_name", "first_name"),
           sep = ",",
           remove = FALSE) %>%
  mutate(first_name = tolower(first_name),
         last_name = tolower(last_name)) %>%
  mutate(match = 1L*str_detect(email,
                               paste0("^", first_name, "\\.", last_name,
                                        "@\\w+\\.com$"))) %>%
  select(-c(first_name, last_name))

Where separate splits the name into first and last name (stored in new columns).

Then mutate(tolower()) converts them to lowercase, so that "xab" and "Xab" (or even "XAB") are equivalent.

Then the meat of the operation is in this str_detect(), using a regular expression. The regular expression "first.last@anything.com" is built using paste(). I make sure to compare to tolower(email) so that again everything is lowercase. Finally, str_detect() returns TRUE/FALSE, so I multiply by 1 to get a 0/1 output.

The last step is to remove the temporary columns first_name and last_name using select() (unless you want to keep them).

elmstedt · September 15, 2020, 7:53pm

0 is FALSE and 1 is TRUE.

shoaibali · September 16, 2020, 5:08am

elmstedt:

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     lapply(strsplit(df6[["name"]], ","),
                                            function(name) {
                                              paste(rev(name), collapse = "\\.")
                                            }),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              integer(1))

no, I am saying if i want 0 for matching names and want 1 for non matching names then.....??

i have tried to do this way also but doesn't work

SIMPLIFY = TRUE),
all,
integer(0))

technocrat · September 16, 2020, 5:32am

In my example, all that is necessary is to reverse ifelse this way

        mutate(Match_name = ifelse(tmp1 == tmp2,0,1))

Formal education in programming, as in other subjects, has a hierarchy of solutions. They range from the most efficient, in terms of execution time, the most parsimonious in expression, in calls to non-base functions or other standards of beauty. The criteria are a matter of taste.

From the perspective of practice, however, the criterion ought to optimize the weakest link in the chain: the user applying the code to get a result. The most expensive CPU and RAM is cheaper than the cheapest wetware on a per-attempt basis.

The target user may be someone who is writing the code, in which case the test is whether it can be understood again six months on. It may be a found user, whose means are more limited. For that case, the standard should be transparency, even at the price of tedium.

Ideally, the details should be encapsulated in a function, with the fewest possible arguments and the simplest possible return value that does the job.

How any code does in meeting the standard is, and should be, open to debate; however, that should be the standard.

system · September 23, 2020, 5:32am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.