check if the data is correct in database

I have a database of emails. like below, i want to filter out those emails are not correct.
for eg:

  1. if email is not having "."
  2. if email have more than one "@"
  3. if email have more than one "." before and after "@"
  4. if email have spaces inside email or outside email.
  5. if email have domain other than "gmail.com" like (hotmail.com, live.com)

please help me like this if in future i will found anything to amend than i can add more conditions.

df <- data.frame(email=c("abc@gmail.com","def@gmail.com","ghi@gmail.com","jkl@gmail.com","mno@gmail.com","pqr@hotmail.com","st@u@live.com","vwx@gmail.com","yza@gmail.com","a.a.b@gmail.c.om",
                   "aac@gmail.com","abb@gmail.com","abc@gmail.com","cab@gmailcom","dfc@gmail.com"))

for example the output be like

email not_having"." more than 1 "@"
abc@gmail.com 0 0 0
def@gmail.com 0 0 0
ghi@gmail.com 0 0 0
jkl@gmailcom 1 0 0
mno@gmail.com 0 0 0
pqr@hotmail.com 0 0 0
st@u@live.com 0 1 0
vwx@gmail.com 0 0 0
yza@gmail.com 0 0 0
a.a.b@gmail.c.om 0 0 1
aac@gmail.com 0 0 0
abb@gmail.com 0 0 0
abc@gmail.com 0 0 0
cab@gmailcom 0 0 0
dfc@gmail.com 0 0 0

Emails require complex regular expressions to parse to account for almost all possible cases, such as

?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

See RFC5322; see also this S/O

Starting at step 5 in the OP reduces the complexity, however, and makes the other tests in the OP unnecessary

suppressPackageStartupMessages({library(dplyr)
                                library(stringr)
                                })

df <- data.frame(email=c("abc@gmail.com","def@gmail.com","ghi@gmail.com","jkl@gmail.com","mno@gmail.com","pqr@hotmail.com","st@u@live.com","vwx@gmail.com","yza@gmail.com","a.a.b@gmail.c.om","aac@gmail.com","abb@gmail.com","abc@gmail.com","cab@gmailcom","dfc@gmail.com"))

is_gmail  <- "gmail.com"

df %>% filter(str_detect(email,is_gmail))
#>            email
#> 1  abc@gmail.com
#> 2  def@gmail.com
#> 3  ghi@gmail.com
#> 4  jkl@gmail.com
#> 5  mno@gmail.com
#> 6  vwx@gmail.com
#> 7  yza@gmail.com
#> 8  aac@gmail.com
#> 9  abb@gmail.com
#> 10 abc@gmail.com
#> 11 dfc@gmail.com

Created on 2020-08-27 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.