Using readr's read_tsv to read a zip compressed tsv file from a url

In the following, I try to read zipped tsv file where the file is given by a url.

library(readr)
Df <- read_tsv("http://crr.ugent.be/blp/txt/blp-stimuli.txt.zip")

It will raise a (bit cryptic) error because according to what is described here https://github.com/tidyverse/readr/issues/720, although read_tsv and related commands will uncompress files and will read from urls, compressed files read from urls will only be automatically uncompressed if the the file is in .gz format.

My question is whether there is anyway to explicitly tell read_tsv and other such commands that the file is a zip and so it should download it, unzip it, and then read it. E.g. is there something like the following

Df <- read_tsv("http://crr.ugent.be/blp/txt/blp-stimuli.txt.zip", compression='zip')

Currently, you can't from readr. You would have to make 2 steps instead of one.

  1. Download zip file
  2. Read from zip file
library(readr)
url <- "http://crr.ugent.be/blp/txt/blp-stimuli.txt.zip"
zip_file <- tempfile(fileext = ".zip")
download.file(url, zip_file, mode = "wb")
df <- read_tsv(zip_file)
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   spelling = col_character(),
#>   morphology = col_character(),
#>   flection = col_character(),
#>   synclass = col_character()
#> )
#> See spec(...) for full column specifications.
df
#> # A tibble: 55,865 x 22
#>    spelling coltheart.N OLD20 nletters  nsyl morphology flection synclass
#>    <chr>          <dbl> <dbl>    <dbl> <dbl> <chr>      <chr>    <chr>   
#>  1 a/c                1  1.95        3     2 irrelevant headword Undefin~
#>  2 aas                6  1.55        3     2 monomorph~ plural   Noun    
#>  3 aback              2  1.85        5     2 complex    positive Adverb  
#>  4 abaft              0  2           5     2 complex.c~ headwor~ Adverb.~
#>  5 aband              0  1.95        5    NA <NA>       <NA>     <NA>    
#>  6 abase              3  1.7         5     2 may_inclu~ infinit~ Verb    
#>  7 abased             3  1.75        6     2 may_inclu~ past pa~ Verb    
#>  8 abashed            1  1.85        7     2 may_inclu~ past pa~ Verb    
#>  9 abate              2  1.75        5     2 may_inclu~ infinit~ Verb    
#> 10 abates             3  1.75        6     2 may_inclu~ singula~ Verb    
#> # ... with 55,855 more rows, and 14 more variables: celex.frequency <dbl>,
#> #   celex.frequency.lemma <dbl>, celex.inflectional.entropy <dbl>,
#> #   lemma.size <dbl>, nlemmas <dbl>, bnc.frequency <dbl>,
#> #   bnc.frequency.million <dbl>, subtlex.frequency <dbl>,
#> #   subtlex.frequency.million <dbl>, subtlex.cd <dbl>,
#> #   subtlex.cd.pct <dbl>, summed.monogram <dbl>, summed.bigram <dbl>,
#> #   summed.trigram <dbl>
unlink(zip_file)

Created on 2019-03-18 by the reprex package (v0.2.1)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.