How to correctly load in a data.frame the data collected in this txt file from the url:
url
url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
With read.table, it does not load all the information:
aa<-read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt")
Warning message:
In read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt") :
incomplete final line found by readTableHeader on 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
> aa
V1 V2 V3 V4
1 <!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN
V5 V6
1 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ><html
V7 V8
1 xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
V9 V10 V11 V12
1 content="text/html; charset=iso-8859-1" /><title>Forcepoint Block
V13 V14 V15 V16 V17 V18
1 Notification</title><style type="text/css">* {margin: 0px; padding: 0px;}body
V19 V20 V21 V22 V23 V24 V25 V26
1 {margin: 10px; font-family: Arial, Helvetica, sans-serif; font-size: 1em;}
I have tried with:
url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'; # URL for monthly mean data
str=readr::read_csv(url)
Parsed with column specification:
cols(
`# --------------------------------------------------------------------` = col_character()
)
Warning: 14 parsing failures.
row col expected actual file
9 -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
11 -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
12 -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
29 -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
30 -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
... ... ......... ......... .............................................................
See problems(...) for more details.
it load all the information in str (str 814 obs. of 1 variable).
But I can't delete the rows of texts and accommodate the data within a data.frame
Everything indicates that they are restrictions of my server !!!
Some way to manipulate the output of:
str = readr :: read_csv (url)
to get a result you get in sample_data?
from row 70 to the end of
typeof (str)
[1] "list"
I'm interested in becoming data.frame.
head(str)
# A tibble: 6 x 1
`# --------------------------------------------------------------------`
<chr>
1 # USE OF NOAA ESRL DATA
2 #
3 # These data are made freely available to the public and the
4 # scientific community in the belief that their wide dissemination
5 # will lead to greater understanding and new scientific insights.
6 # The availability of these data does not constitute publication
> tail(str)
# A tibble: 6 x 1
`# --------------------------------------------------------------------`
<chr>
1 2019 8 2019.625 409.95 409.95 411.86 29
2 2019 9 2019.708 408.54 408.54 412.15 29
3 2019 10 2019.792 408.53 408.53 411.95 30
4 2019 11 2019.875 410.27 410.27 412.27 25
5 2019 12 2019.958 411.76 411.76 412.43 31
6 2020 1 2020.042 413.40 413.40 413.08 29
>
head(str[67:75,1],10)
# A tibble: 9 x 1
`# --------------------------------------------------------------------`
<chr>
1 #
2 # (-99.99 missing data; -1 no data for #daily means in month)
3 #
4 # decimal average interpolated trend #days
5 # date (season corr)
6 1958 3 1958.208 315.71 315.71 314.62 -1
7 1958 4 1958.292 317.45 317.45 315.29 -1
8 1958 5 1958.375 317.50 317.50 314.71 -1
9 1958 6 1958.458 -99.99 317.10 314.85 -1
>
Could you help me?
My unsuccessful attempts:
d<-data.frame(dplyr::pull(str[72:814,1]))
names(d) <- c('year', 'month', 'decimalYear', 'average', 'interpolated', 'trend', 'ndays')
Error in names(d) <- c("year", "month", "decimalYear", "average", "interpolated", :
'names' attribute [7] must be the same length as the vector [1]
> typeof(d)
[1] "list"
Help Me!!!!!!
The file you are accessing comes in a rather messy format - there is a lenghty introductory text on lines 1 to 69, then two lines of column headers and only then the actual data; columns seem to be separated by white spaces.
I suggest the following steps:
- download the file via cURL (it is always easier to troubleshoot import of a local file)
- read in the file via readr, skipping the header, as a single column data frame
- separate the actual columns via
tidyr::separate()
library(tidyverse)
local_file <- tempfile(fileext = ".txt")
curl::curl_download(url = "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt",
destfile = local_file)
local_dataframe <- readr::read_tsv(local_file, skip = 72, col_names = F) %>%
tidyr::separate(col = X1,
into = c("year", "month", "decimal_date", "average", "interpolated", "trend", "no_days"),
sep = "[^[\\d+\\.?\\d*$]]+")
How to properly convert each column of the local_dataframe to numeric?
str(dataframe)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 743 obs. of 7 variables:
$ year : chr "1958" "1958" "1958" "1958" ...
$ month : chr "3" "4" "5" "6" ...
$ decimal_date: chr "1958.208" "1958.292" "1958.375" "1958.458" ...
$ average : chr "315.71" "317.45" "317.50" "99.99" ...
$ interpolated: chr "315.71" "317.45" "317.50" "317.10" ...
$ trend : chr "314.62" "315.29" "314.71" "314.85" ...
$ no_days : chr "1" "1" "1" "1" ...
> sapply(dataframe, mode)
year month decimal_date average interpolated trend
"character" "character" "character" "character" "character" "character"
no_days
"character"
>
you can pipe the data frame to dplyr::mutate_all()
and apply base::as.numeric()
function - say like this:
library(tidyverse)
local_file <- tempfile(fileext = ".txt")
curl::curl_download(url = "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt",
destfile = local_file)
local_dataframe <- readr::read_tsv(local_file, skip = 72, col_names = F) %>%
tidyr::separate(col = X1,
into = c("year", "month", "decimal_date", "average", "interpolated", "trend", "no_days"),
sep = "[^[-?\\d+\\.?\\d*$]]+") %>%
dplyr::mutate_all(as.numeric)
EDIT: on second thought I noticed that the regex I provided in sep =
argument to tidyr::separate()
handled negative numbers incorrectly; this one is better.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.