correctly load in a data.frame the data collected from the url

How to correctly load in a data.frame the data collected in this txt file from the url:
url
url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'

With read.table, it does not load all the information:

aa<-read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt")
Warning message:
In read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt") :
  incomplete final line found by readTableHeader on 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
> aa
         V1   V2     V3                               V4
1 <!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN
                                                       V5     V6
1 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd ><html
                                                V7                        V8
1 xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
                   V9                 V10                 V11   V12
1 content="text/html; charset=iso-8859-1" /><title>Forcepoint Block
                         V13               V14      V15  V16      V17       V18
1 Notification</title><style type="text/css">* {margin: 0px; padding: 0px;}body
       V19   V20          V21    V22        V23         V24        V25   V26
1 {margin: 10px; font-family: Arial, Helvetica, sans-serif; font-size: 1em;}

I have tried with:

url = 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'; # URL for monthly mean data
str=readr::read_csv(url)   
Parsed with column specification:
cols(
  `# --------------------------------------------------------------------` = col_character()
)
Warning: 14 parsing failures.
row col  expected    actual                                                          file
  9  -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
 11  -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
 12  -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
 29  -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
 30  -- 1 columns 2 columns 'ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt'
... ... ......... ......... .............................................................
See problems(...) for more details.
 

it load all the information in str (str 814 obs. of 1 variable).
But I can't delete the rows of texts and accommodate the data within a data.frame

Everything indicates that they are restrictions of my server !!!
Some way to manipulate the output of:

str = readr :: read_csv (url)

to get a result you get in sample_data?

from row 70 to the end of

typeof (str)
[1] "list"

I'm interested in becoming data.frame.

head(str)
# A tibble: 6 x 1
  `# --------------------------------------------------------------------`
  <chr>                                                                   
1 # USE OF NOAA ESRL DATA                                                 
2 #                                                                       
3 # These data are made freely available to the public and the            
4 # scientific community in the belief that their wide dissemination      
5 # will lead to greater understanding and new scientific insights.       
6 # The availability of these data does not constitute publication        
> tail(str)
# A tibble: 6 x 1
  `# --------------------------------------------------------------------`
  <chr>                                                                   
1 2019   8    2019.625      409.95      409.95      411.86     29         
2 2019   9    2019.708      408.54      408.54      412.15     29         
3 2019  10    2019.792      408.53      408.53      411.95     30         
4 2019  11    2019.875      410.27      410.27      412.27     25         
5 2019  12    2019.958      411.76      411.76      412.43     31         
6 2020   1    2020.042      413.40      413.40      413.08     29         
>
 head(str[67:75,1],10)
# A tibble: 9 x 1
  `# --------------------------------------------------------------------`
  <chr>                                                                   
1 #                                                                       
2 #  (-99.99 missing data;  -1 no data for #daily means in month)         
3 #                                                                       
4 #            decimal     average   interpolated    trend    #days       
5 #             date                             (season corr)            
6 1958   3    1958.208      315.71      315.71      314.62     -1         
7 1958   4    1958.292      317.45      317.45      315.29     -1         
8 1958   5    1958.375      317.50      317.50      314.71     -1         
9 1958   6    1958.458      -99.99      317.10      314.85     -1         
> 

Could you help me?

My unsuccessful attempts:

d<-data.frame(dplyr::pull(str[72:814,1]))
names(d) <- c('year', 'month', 'decimalYear', 'average', 'interpolated', 'trend', 'ndays')
Error in names(d) <- c("year", "month", "decimalYear", "average", "interpolated",  : 
  'names' attribute [7] must be the same length as the vector [1]
> typeof(d)
[1] "list"

Help Me!!!!!!

The file you are accessing comes in a rather messy format - there is a lenghty introductory text on lines 1 to 69, then two lines of column headers and only then the actual data; columns seem to be separated by white spaces.

I suggest the following steps:

  • download the file via cURL (it is always easier to troubleshoot import of a local file)
  • read in the file via readr, skipping the header, as a single column data frame
  • separate the actual columns via tidyr::separate()
library(tidyverse)

local_file <- tempfile(fileext = ".txt")

curl::curl_download(url = "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                    destfile = local_file)

local_dataframe <- readr::read_tsv(local_file, skip = 72, col_names = F) %>% 
  tidyr::separate(col = X1,
                  into = c("year", "month", "decimal_date", "average", "interpolated", "trend", "no_days"),
                  sep = "[^[\\d+\\.?\\d*$]]+")
1 Like

How to properly convert each column of the local_dataframe to numeric?

str(dataframe)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	743 obs. of  7 variables:
 $ year        : chr  "1958" "1958" "1958" "1958" ...
 $ month       : chr  "3" "4" "5" "6" ...
 $ decimal_date: chr  "1958.208" "1958.292" "1958.375" "1958.458" ...
 $ average     : chr  "315.71" "317.45" "317.50" "99.99" ...
 $ interpolated: chr  "315.71" "317.45" "317.50" "317.10" ...
 $ trend       : chr  "314.62" "315.29" "314.71" "314.85" ...
 $ no_days     : chr  "1" "1" "1" "1" ...
> sapply(dataframe, mode)
        year        month decimal_date      average interpolated        trend 
 "character"  "character"  "character"  "character"  "character"  "character" 
     no_days 
 "character" 
> 

you can pipe the data frame to dplyr::mutate_all() and apply base::as.numeric() function - say like this:

library(tidyverse)

local_file <- tempfile(fileext = ".txt")

curl::curl_download(url = "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt", 
                    destfile = local_file)

local_dataframe <- readr::read_tsv(local_file, skip = 72, col_names = F) %>% 
  tidyr::separate(col = X1,
                  into = c("year", "month", "decimal_date", "average", "interpolated", "trend", "no_days"),
                  sep = "[^[-?\\d+\\.?\\d*$]]+") %>% 
  dplyr::mutate_all(as.numeric)

EDIT: on second thought I noticed that the regex I provided in sep = argument to tidyr::separate() handled negative numbers incorrectly; this one is better.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.