validate name with email

I have a data frame like below

df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"),
                  email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))

i want to check if name in gmail and name in column (Name) is same if false then mutate new column

output should be like below

Name email Match_name
try,xab xab.try@ybcd.com 0
xab,Lan Lan.xab@ybcd.com 1
mhy,mun mun.mhy@ybcd.com 0
vgtu,mmc mmc.vgtu@ybcd.com 0
dgsy,aaf aaf.dgsy@ybcd.com 0
kull,nnhu nnhu.kull@ybcd.com 0
hula,njam njam.hula@ybcd.com 0
mund,jiha jiha.mund@ybcd.com 0
htfy,ntha ntha.htfy@ybcd.com 0
bhr,gydbt gydbt.bhr@ybcd.com 0
sgyu,hytb hytb.sgyu@ybcd.com 0
vdti,kula kula.vdti@ybcd.com 0
mftyu,huta huta.mftyu@ybcd.com 0
ibdy,vcge vcge.ibdy@ybcd.com 1
cday,bhsue bhsue.cday@ybcd.com 0
ajtu,nudj nudj.ajtu@ybcd.com 0

This is ugly; the tmp intermediates are a kludge

suppressPackageStartupMessages({library(dplyr)
                                library(stringr)})
df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"), email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))

pattern1 <- "(^.*,)(.*$)"
pattern2 <- "\\2.\\1"
pattern3 <- ".$"
pattern4 <- "@.*$"


df6 %>% mutate(tmp1 = str_replace(df6$name,pattern1,pattern2) %>% 
        str_remove(.,pattern3)) %>%
        mutate(tmp2 = str_remove(email,pattern4)) %>%
        mutate(Match_name = ifelse(tmp1 == tmp2,1,0)) %>%
        select(-tmp1,-tmp2)
#>          name               email Match_name
#> 1     try,xab    xab.try@ybcd.com          1
#> 2     xab,Lan    Lan.xab@ybcd.com          1
#> 3     mhy,mun    tth.vgu@ybcd.com          0
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com          1
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com          1
#> 6   kull,nnhu  nnhu.kull@ybcd.com          1
#> 7   hula,njam  njam.hula@ybcd.com          1
#> 8   mund,jiha  jiha.mund@ybcd.com          1
#> 9   htfy,ntha  ntha.htfy@ybcd.com          1
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com          1
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com          1
#> 12  vdti,kula  kula.vdti@ybcd.com          1
#> 13 mftyu,huta huta.mftyu@ybcd.com          1
#> 14  ibdy,vcge  ggat.khul@ybcd.com          0
#> 15 cday,bhsue bhsue.cday@ybcd.com          1
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com          1

Created on 2020-09-13 by the reprex package (v0.3.0)

please explain these pattern for my knowledge.....

These are regular expressions, which are used by the stringr library to parse character strings.

  1. Match two groups: the first any sequence of characters up to and including a comma; the second, any following sequence of characters to the end.
  2. The second matched group followed by . followed by the first mapped group
  3. The last character (to get rid of the comma)
  4. Everything from @ to end-of-line

stringr has many shortcuts for pattern matching, as well.

1 Like

Here is a {base} R solution which does precisely what you want,

df6 <- data.frame(name=c("try,xab","xab,Lan","mhy,mun","vgtu,mmc","dgsy,aaf","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","bhr,gydbt","sgyu,hytb","vdti,kula","mftyu,huta","ibdy,vcge","cday,bhsue","ajtu,nudj"), email=c("xab.try@ybcd.com","Lan.xab@ybcd.com","tth.vgu@ybcd.com","mmc.vgtu@ybcd.com","aaf.dgsy@ybcd.com","nnhu.kull@ybcd.com","njam.hula@ybcd.com","jiha.mund@ybcd.com","ntha.htfy@ybcd.com","gydbt.bhr@ybcd.com","hytb.sgyu@ybcd.com","kula.vdti@ybcd.com","huta.mftyu@ybcd.com","ggat.khul@ybcd.com","bhsue.cday@ybcd.com","nudj.ajtu@ybcd.com"))
df6
#>          name               email
#> 1     try,xab    xab.try@ybcd.com
#> 2     xab,Lan    Lan.xab@ybcd.com
#> 3     mhy,mun    tth.vgu@ybcd.com
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com
#> 6   kull,nnhu  nnhu.kull@ybcd.com
#> 7   hula,njam  njam.hula@ybcd.com
#> 8   mund,jiha  jiha.mund@ybcd.com
#> 9   htfy,ntha  ntha.htfy@ybcd.com
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com
#> 12  vdti,kula  kula.vdti@ybcd.com
#> 13 mftyu,huta huta.mftyu@ybcd.com
#> 14  ibdy,vcge  ggat.khul@ybcd.com
#> 15 cday,bhsue bhsue.cday@ybcd.com
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     lapply(strsplit(df6[["name"]], ","),
                                            function(name) {
                                              paste(rev(name), collapse = "\\.")
                                            }),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              integer(1))
df6
#>          name               email match_name
#> 1     try,xab    xab.try@ybcd.com          1
#> 2     xab,Lan    Lan.xab@ybcd.com          1
#> 3     mhy,mun    tth.vgu@ybcd.com          0
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com          1
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com          1
#> 6   kull,nnhu  nnhu.kull@ybcd.com          1
#> 7   hula,njam  njam.hula@ybcd.com          1
#> 8   mund,jiha  jiha.mund@ybcd.com          1
#> 9   htfy,ntha  ntha.htfy@ybcd.com          1
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com          1
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com          1
#> 12  vdti,kula  kula.vdti@ybcd.com          1
#> 13 mftyu,huta huta.mftyu@ybcd.com          1
#> 14  ibdy,vcge  ggat.khul@ybcd.com          0
#> 15 cday,bhsue bhsue.cday@ybcd.com          1
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com          1

Slightly different, here is a {base} R solution which has the benefit of matching the names regardless of order (e.g. a,b would match a.b@c.com and b.a@c.com. Also, I think it is better to encode this as a logical variable rather than an integer.

df6[["match_name"]] <- vapply(mapply(Vectorize(grepl, "pattern"),
                                     strsplit(df6[["name"]], ","),
                                     df6[["email"]],
                                     SIMPLIFY = FALSE),
                              all,
                              logical(1))
df6
#>          name               email match_name
#> 1     try,xab    xab.try@ybcd.com       TRUE
#> 2     xab,Lan    Lan.xab@ybcd.com       TRUE
#> 3     mhy,mun    tth.vgu@ybcd.com      FALSE
#> 4    vgtu,mmc   mmc.vgtu@ybcd.com       TRUE
#> 5    dgsy,aaf   aaf.dgsy@ybcd.com       TRUE
#> 6   kull,nnhu  nnhu.kull@ybcd.com       TRUE
#> 7   hula,njam  njam.hula@ybcd.com       TRUE
#> 8   mund,jiha  jiha.mund@ybcd.com       TRUE
#> 9   htfy,ntha  ntha.htfy@ybcd.com       TRUE
#> 10  bhr,gydbt  gydbt.bhr@ybcd.com       TRUE
#> 11  sgyu,hytb  hytb.sgyu@ybcd.com       TRUE
#> 12  vdti,kula  kula.vdti@ybcd.com       TRUE
#> 13 mftyu,huta huta.mftyu@ybcd.com       TRUE
#> 14  ibdy,vcge  ggat.khul@ybcd.com      FALSE
#> 15 cday,bhsue bhsue.cday@ybcd.com       TRUE
#> 16  ajtu,nudj  nudj.ajtu@ybcd.com       TRUE

Created on 2020-09-14 by the reprex package (v0.3.0)

for this if i want 1 for False and 0 for True then...??

One more, pure dplyr:

df6 %>%
  separate(name,
           into = c("last_name", "first_name"),
           sep = ",",
           remove = FALSE) %>%
  mutate(first_name = tolower(first_name),
         last_name = tolower(last_name)) %>%
  mutate(match = 1L*str_detect(email,
                               paste0("^", first_name, "\\.", last_name,
                                        "@\\w+\\.com$"))) %>%
  select(-c(first_name, last_name))

Where separate splits the name into first and last name (stored in new columns).

Then mutate(tolower()) converts them to lowercase, so that "xab" and "Xab" (or even "XAB") are equivalent.

Then the meat of the operation is in this str_detect(), using a regular expression. The regular expression "first.last@anything.com" is built using paste(). I make sure to compare to tolower(email) so that again everything is lowercase. Finally, str_detect() returns TRUE/FALSE, so I multiply by 1 to get a 0/1 output.

The last step is to remove the temporary columns first_name and last_name using select() (unless you want to keep them).

0 is FALSE and 1 is TRUE.

no, I am saying if i want 0 for matching names and want 1 for non matching names then.....??

i have tried to do this way also but doesn't work

SIMPLIFY = TRUE),
all,
integer(0))

In my example, all that is necessary is to reverse ifelse this way

        mutate(Match_name = ifelse(tmp1 == tmp2,0,1)) 

Formal education in programming, as in other subjects, has a hierarchy of solutions. They range from the most efficient, in terms of execution time, the most parsimonious in expression, in calls to non-base functions or other standards of beauty. The criteria are a matter of taste.

From the perspective of practice, however, the criterion ought to optimize the weakest link in the chain: the user applying the code to get a result. The most expensive CPU and RAM is cheaper than the cheapest wetware on a per-attempt basis.

The target user may be someone who is writing the code, in which case the test is whether it can be understood again six months on. It may be a found user, whose means are more limited. For that case, the standard should be transparency, even at the price of tedium.

Ideally, the details should be encapsulated in a function, with the fewest possible arguments and the simplest possible return value that does the job.

How any code does in meeting the standard is, and should be, open to debate; however, that should be the standard.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.