How I can create a script that downloads all data on all municipalities for all available years from the website www.obcepro.cz/obce. You download the data to R and then save it according to the individual years in files with the names "OBCE_ .csv" where you insert the number of the year for which the given data are downloaded after the expression . In the event that in some year not all data available for other municipalities will be available for each municipality, in the given year add information about the missing observation in the form of NA.
So from the looks of it, each dataset has a unique identifier, e.g. Abertamy = 2727 (you can see this by looking at the link for the name of the region https://www.obcepro.cz/obec/2727). The link for the spreadsheet download is "https://www.obcepro.cz/obec/download/excel/data/" followed by this identifier.
Best steps would be to use
rvest to figure out what these unique identifiers are, loop through the page numbers to extract all of these identifiers and associate them with the village name.
You should be able to put these in a data.frame/tibble within
rvest. Then write a function to take in an identifier and to download the data.
read_excel() from the
readr package can download files directly from a url.
You can even do this still within the same tibble if you use
map(), and the data will be downloaded and stored in a list column in your harvested dataset.
I have applied the technique of rvest but it is not extracting anything and can you please share some code to extract the data from first identifier. I m doing something like this
scraping < - read_html("https://www.obcepro.cz/obce") text < - html_text(html_nodes(scraping, ".WordSection1"))
dear can you please share some code