get file name from url?

I'm scrapping the following list of urls:

library(xml2)
library(rvest)
library(stringr)

URL <- "https://thedataweb.rm.census.gov/ftp/cps_ftp.html"
pg <- read_html(URL)
head(html_attr(html_nodes(pg, "a"), "href"))
#> [1] "#cpscert"          "#cpsbasic"         "#cpsbasic_extract"
#> [4] "#cpsmarch"         "#cpssupps"         "#cpsrepwgt"
links <- html_attr(html_nodes(pg, "a"), "href")
zips <- str_subset(links, "zip")
zips[[1]]
#> [1] "http://thedataweb.rm.census.gov/pub/cps/supps/jan15-dec15cert_ext.zip"

# I want to get "jan15-dec15cert_ext"

Created on 2019-03-06 by the reprex package (v0.2.1)

I would like to subset zips so I can get the files names (without the extension). For example, from zips[[1]] I want to get jan15-dec15cert_ext. Can somebody help me with this regular expression magic?

I assume there are more elegant ways, but this should do the trick

enframe(zips, name=NULL) %>% 
  mutate(link.part=stringr::str_extract(value,"[a-z]{3}[0-9]{2}[a-z0-9_-]+"))

Here's a somewhat simplistic regex that will do the trick.

text <- "http://thedataweb.rm.census.gov/pub/cps/supps/jan15-dec15cert_ext.zip"
pattern <- ".+/(.+)\\.\\w+$"

sub(pattern, replacement = "\\1", text)
#> [1] "jan15-dec15cert_ext"

Created on 2019-03-06 by the reprex package (v0.2.1)

There are a number of approaches you can take. I wrote an RStudio addin that helps you preview regular expressions and outputs called regexplain that you might find helpful. It gives you interactive previews along the lines of the image below.

3 Likes

There are functions available in basic R for this, and they come with the benefit of explicit code.

library(tools)

text <- "http://thedataweb.rm.census.gov/pub/cps/supps/jan15-dec15cert_ext.zip"
file_path_sans_ext(basename(text))
# [1] "jan15-dec15cert_ext"

Remember: in R, there's probably a function for any task. Somewhere.

2 Likes

The fs package has a ton of functions for working with file paths and pulling out this info too, if you are into using pipes or if you are looking for help with discovering related functions:

https://fs.r-lib.org/reference/index.html

library(fs)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2

text <- "http://thedataweb.rm.census.gov/pub/cps/supps/jan15-dec15cert_ext.zip"

text %>% 
  path_file() %>% 
  path_ext_remove()
#> jan15-dec15cert_ext

Created on 2019-03-07 by the reprex package (v0.2.1)

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.