Regex extract domain from web page

Hi,

I have been scratching my head all day on this as im not a regex expert.

If i have a web address like:


"https://www.facebook.com/foo/"

"https://www.facebook.is/foo/"

"https://www.facebook.lw/foo/"

"https://www.facebook.de/foo/"

"https://www.instagram.de/foo/"

"https://www.twitter.de/foo/"

"https://www.amazon.de/foo/"

These are all made up addresses :slight_smile:

I wish to be able to just extract the middle part so the first four web addresses would produce facebook, the second instagram, the third twitter and the fourth amazon

Below is the closest I have gotten to it


test <- "https://www.facebook.com/foo/"

stringr::str_extract(test, '(?<=www.)(.*)(?=.)')

# "facebook.com/foo"

Thank you for your time

Hi All,

I think the below should do it


extract_website <- function(website){

  my <- str_split(website, '\\.', simplify = TRUE)
  parts <- ncol(my)
  
  if(website == 'missing'){
    result='missing'; 
  }
  else if(my[parts-1] %in% c('co','com','gov','net', 'org', 'edu')) {
    result=my[parts-2]; 
  } else{
    result=my[parts-1]; 
  }
  result
}

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.