Having a trouble while reading a csv file. All variables (Factors, numeric etc) are automatically converted into Character strings. Can someone help me? Thank you!

I am currently doing a course project for reproducible resesarch on R. Using R Markdown FIle
And for some reason when I am trying to read the csv file indicated by the project, all variables are converted into "chr"s.

Here are the codes for reading the file.

if (!file.exists("StormData.csv.bz2")) {
     fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
     download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")
   
     if (!file.exists("StormData.csv.bz2")) {
          stop("Can't locate file 'StormData.csv.bz2'!")
     }
}

stormdf <- read.csv("StormData.csv.bz2",quote = "",stringsAsFactors = FALSE)



```{r,echo=TRUE}
str(stormDataRaw)
```
'data.frame':	1773320 obs. of  37 variables:
 $ X.STATE__.   : chr  "1.00" "1.00" "1.00" "1.00" ...
 $ X.BGN_DATE.  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ X.BGN_TIME.  : chr  "\"0130\"" "\"0145\"" "\"1600\"" "\"0900\"" ...
 $ X.TIME_ZONE. : chr  "\"CST\"" "\"CST\"" "\"CST\"" "\"CST\"" ...
 $ X.COUNTY.    : chr  "97.00" "3.00" "57.00" "89.00" ...
 $ X.COUNTYNAME.: chr  "\"MOBILE\"" "\"BALDWIN\"" "\"FAYETTE\"" "\"MADISON\"" ...
 $ X.STATE.     : chr  "\"AL\"" "\"AL\"" "\"AL\"" "\"AL\"" ...
 $ X.EVTYPE.    : chr  "\"TORNADO\"" "\"TORNADO\"" "\"TORNADO\"" "\"TORNADO\"" ...
 $ X.BGN_RANGE. : chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.BGN_AZI.   : chr  "" "" "" "" ...
 $ X.BGN_LOCATI.: chr  "" "" "" "" ...
 $ X.END_DATE.  : chr  "" "" "" "" ...
 $ X.END_TIME.  : chr  "" "" "" "" ...
 $ X.COUNTY_END.: chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.COUNTYENDN.: chr  "" "" "" "" ...
 $ X.END_RANGE. : chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.END_AZI.   : chr  "" "" "" "" ...
 $ X.END_LOCATI.: chr  "" "" "" "" ...
 $ X.LENGTH.    : chr  "14.00" "2.00" "0.10" "0.00" ...
 $ X.WIDTH.     : chr  "100.00" "150.00" "123.00" "100.00" ...
 $ X.F.         : chr  "\"3\"" "\"2\"" "\"2\"" "\"2\"" ...
 $ X.MAG.       : chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.FATALITIES.: chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.INJURIES.  : chr  "15.00" "0.00" "2.00" "2.00" ...
 $ X.PROPDMG.   : chr  "25.00" "2.50" "25.00" "2.50" ...
 $ X.PROPDMGEXP.: chr  "\"K\"" "\"K\"" "\"K\"" "\"K\"" ...
 $ X.CROPDMG.   : chr  "0.00" "0.00" "0.00" "0.00" ...
 $ X.CROPDMGEXP.: chr  "" "" "" "" ...
 $ X.WFO.       : chr  "" "" "" "" ...
 $ X.STATEOFFIC.: chr  "" "" "" "" ...
 $ X.ZONENAMES. : chr  "" "" "" "" ...
 $ X.LATITUDE.  : chr  "3040.00" "3042.00" "3340.00" "3458.00" ...
 $ X.LONGITUDE. : chr  "8812.00" "8755.00" "8742.00" "8626.00" ...
 $ X.LATITUDE_E.: chr  "3051.00" "0.00" "0.00" "0.00" ...
 $ X.LONGITUDE_.: chr  "8806.00" "0.00" "0.00" "0.00" ...
 $ X.REMARKS.   : chr  "" "" "" "" ...
 $ X.REFNUM.    : chr  "1.00" "2.00" "3.00" "4.00" ...

But it should be like this:

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

According to https://stackoverflow.com/questions/25948777/extract-bz2-file-in-r this should work but also see bunzip2 mentioned there.

Try changing the quote string in your call to read.csv(). This command seemed to work for me.

stormdf <- read.csv("StormData.csv.bz2",quote = '"',stringsAsFactors = FALSE)

The quote string is " and in the command it is enclosed within single quotes.

Hi thank you for your reply,

it gives me back:

EOF within quoted string

and still, all the variables are characters :confused:

If you want factors instead of characters, why are you using stringsAsFactors = FALSE? Can you please try with this?

download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
              destfile="StormData.csv.bz2",
              method="curl")
stormdf <- read.csv(file = "StormData.csv.bz2",
                    stringsAsFactors = TRUE)
str(object = stormdf)
#> 'data.frame':    902297 obs. of  37 variables:
#>  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
#>  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
#>  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
#>  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
#>  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
#>  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
#>  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
#>  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
#>  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
#>  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
#>  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
#>  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
#>  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
#>  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
#>  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
#>  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
#>  $ LATITUDE_E: num  3051 0 0 0 0 ...
#>  $ LONGITUDE_: num  8806 0 0 0 0 ...
#>  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

If this doesn't work, can you please provide a reproducible example?

PS:

FJCC's suggestion doesn't give me any error, so please check. I'm not using quotes, but it should not give that error.

1 Like

If I run this code

fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")

stormdf <- read.csv("StormData.csv.bz2",quote = '"',stringsAsFactors = FALSE)

The str() function returns the following (truncated) from that code.

> str(stormdf)
'data.frame':	902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
 $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
 $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
2 Likes

Hi thank you,

I found another problem, there should be around 90000 observations and 37 variables.
When I tried the same code as yours it only gives me back 692288 observations with 37 variables and all variables are factors.

Still 60000 obs. and all chr. :confused:

I am trying to load a dataset from the Url below.

It should contain around 900000 observations

But after entering the following download + read.csv code

fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")

StormData <- read.csv("StormData.csv.bz2",quote = '"',stringsAsFactors = FALSE)

the StormData I just defined only contains around 690000 obs
the repdata_data_StormData on the top side was imported mannually, it works, but I need to output it as a html file so...

What is the size of StormData.csv.bz2 on your local disk? On my system it is 48025 KB.

I was surprised that you wrote 'so' and trailed of...

windows: 46.8 MB (49,177,144 bytes)
the unzipped excel is 535 mb

Thank you for your reply, but I do not get it

Thank you all, I tried fread at the end and it worked.

StormData <-fread(sprintf("repdata_data_StormData. csv.bz2", "repdata_data_StormData. csv.bz2"), stringsAsFactors = TRUE)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.