So there are a million digits of pi here that I’d like to read into a vector or column, but I’m not sure how to read (say) a CSV file with no delimiter. Can anyone help?
I don't know what will happen if you try to read in one million digits but I did test read.table() with a file with just a single number and it reads it in as a data frame with one row and one column. If there is no termination to the line, it raises a warning but the process works.
I gave this one a try here, didn't get the full million, but about 50.000 it seems:
#Loading the rvest package library('rvest') #Specifying the url for desired website to be scraped url <- 'https://www.piday.org/million/' #Reading the HTML code from the website webpage <- read_html(url) pi_xml <- html_nodes(webpage,'#million_pi') pi_data <- html_text(pi_xml) substr(pi_data, 1, 100)
I think the difficulty from taking it from the webpage is that more lines are loaded when scrolling down, so probably read_html doesn't work!?
So when you scroll down to the end, copy everything to the Clipboard you could (conveniently) try:
PI = readClipboard() when you use Windows.
If you stored everything in a file (txt or csv) read_file() from the readr library might do the job.
I can't seem to tell how many digits I have. It's all one number, rather than a single column of digits, which is what I'm after.
I tried changing it to a string to get its length, but R truncated it. I'm not sure that
length() works on numbers.
I want to get each digit as a single row so that I can test for randomness of the digits 0–9.
Yeah, I need one column of one digit numbers, the digits being the ones on the web page.
Just split it into characters and then make a dataframe:
# totally based on @valeri's solution, as I don't know web scraping at all library(rvest) #> Loading required package: xml2 url_to_be_scrapped <- 'https://www.piday.org/million/' webpage_html <- read_html(x = url_to_be_scrapped) pi_xml <- html_nodes(x = webpage_html, css = '#million_pi') pi_text <- html_text(x = pi_xml) pi_vector <- strsplit(x = pi_text, split = "")[] pi_digits_after_decimal_dataframe <- data.frame(digits = as.integer(x = pi_vector[-(1:2)])) str(object = pi_digits_after_decimal_dataframe) #> 'data.frame': 51197 obs. of 1 variable: #> $ digits: int 1 4 1 5 9 2 6 5 3 5 ...
Created on 2019-09-16 by the reprex package (v0.3.0)
And just to add a bit to this here ... since the
html_text function returns a string (and only one string), then
length will return 1. Generally, if you would like to count the number of characters in a string, you can use
nchar. And as @Yarnabrina mentions, if you want to make a vector of digits instead, then you need to split the string into its digits.
And linking to this post regarding programatically scrolling down the page - apparently it can be done using
RSelenium: how to scrape, do not load whole page until we scroll down?
Is it surprising or weird that it only grabs 51K characters? I don't know enough about the R internals to know the difference, but I was hoping that the data.table primitive, supposedly designed for huge data sets, would be able to do it.
It looks like html_text is the bottleneck that's only grabbing 51K digits. I have no idea why that would be the case.
Your skillful use of splitting etc.is very helpful. Having no delimiter meant certain doom I thought!
the around 51.000 characters "limit" is not in any way related to R or data frames as such. The page we are scraping is set up in such a way that only about 50.000 characters are rendered upon first load - to get to the rest you need to scroll down - this is a common "tactic" so that web pages load faster and further content is rendered only if needed (if the user scrolls down in this case) - that is why I linked to the article which discusses programmatic scrolling using
RSelenium (see above)
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.