Import data with currency format

stat_geek · September 5, 2020, 8:49pm

I'm trying to read in some raw data from a CSV file but the data is formatted as a currency format. I know I can read it as a character and convert it after the fact using gsub() but is there a way to read it in directly correct?

The values are embedded in quotes in the CSV file as well, so that shouldn't be an issue. I'm assuming I need to change the col_double() but don't know what the correct specifications would be... or if that's not right, what another approach may be.

My data looks like this in the CSV file:

ID giftAmount
1, "$25.00"
2, "$50.00"
3, "$0.00"
4, "$235.45"

Current Code:

my_col_names  <-  c("ID", "giftAmount")
my_col_types <- cols(ID = col_character(), giftAmount = col_double())
raw_data <- read_csv("./rawData/myData.csv", col_names =  my_col_names, col_types = my_col_types, skip = 1)

technocrat · September 5, 2020, 9:11pm

To do in one pass, make a function of

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(readr)

read_csv("/Users/rc/Desktop/grist.csv") %>% mutate(giftAmount = parse_number(giftAmount))
#> Parsed with column specification:
#> cols(
#>   ID = col_double(),
#>   giftAmount = col_character()
#> )
#> # A tibble: 4 x 2
#>      ID giftAmount
#>   <dbl>      <dbl>
#> 1     1        25 
#> 2     2        50 
#> 3     3         0 
#> 4     4       235.

^{Created on 2020-09-05 by the reprex package (v0.3.0)}

stat_geek · September 8, 2020, 4:36am

Thanks! Does that actually process the data in a single step or make it appear to process in a single step? I was hoping to avoid the extra processing but this seems to have it anyways.

technocrat · September 8, 2020, 6:16am

A function, f, is as good as a single step. What goes on under the hood is of no concern if it works.

nirgrahamuk · September 8, 2020, 9:10am

you can do as one line with readr::read_csv if you know the expected column types.

read_csv("/Users/rc/Desktop/grist.csv",col_names = TRUE,
                   col_types = "nn")

# A tibble: 4 x 2
     ID giftAmount
  <dbl>      <dbl>
1     1        25 
2     2        50 
3     3         0 
4     4       235.

stat_geek · September 8, 2020, 10:54pm

Thanks for your reply but I'm explicitly looking for a solution that is efficient and multiple steps are inefficient. And given other programming languages support this functionality quite easily I'm surprised R doesn't have some functionality for it. Of course I can read everything in as character and convert it, but that's not my question.

Your response does not answer my question, as I'm explicitly asking about what goes on under the hood.

stat_geek · September 8, 2020, 11:19pm

Thank you!

Can you clarify what exactly col_types = "nn" does? Would you also be able to point me to where I would find that type of information in the documentation? I'm also trying to learn how to read and understand the R documentation.

I found this PDF but don't see any reference to this type of notation in the cols() specifications.

technocrat · September 9, 2020, 1:14am

The key to reading R documentation is school algebra, f(x) = y.

In either the R console or the RStudio console type

?read_csv

to get the f unction signature

read_csv(file, col_names = TRUE, col_types = NULL,
locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
quote = """, comment = "", trim_ws = TRUE, skip = 0,
n_max = Inf, guess_max = min(1000, n_max),
progress = show_progress(), skip_empty_rows = TRUE)

Each of file, col_names, col_types are arguments to f with or without
(NULL). The possible values are described under "Arguments"

The first argument is crucial. Identifying the wrong type of object to a function (e.g., a character type when a numeric is expected). Then look at the description

col_types One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be imputed from the first 1000 rows on the input. This is convenient (and fast), but not robust. If the imputation fails, you'll need to supply the correct types yourself.

If a column specification created by cols(), it must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column: c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- to skip the column.

The answer to the question is in the last paragraph: n is an integer, nn is two integers`.

Notice a subtle gotcha, 235 is a double, not an integer. That makes post-processing with the follow-on function preferable when a variable number of digits are to be expected.

Do not neglect, either, the Values, which describes what f returns, the examples, which should be run as a test of understanding f and any references for underlying details of algorithms involved.

technocrat · September 9, 2020, 5:42am

Not.

Let f(x) = y and g(f(x)) = y. One function is wrapped in another. The added step has no possible consequence in the execution time for any interactive use. (And if it needs to scale to the point where it does, it won't be interactive and will be rewritten in a compiled imperative language — unless Haskell or another functional language is used.)

read_csv can take an argument specifying the number of integers to read. Specifying that argument is no more efficient than piping to another function in terms of keystrokes and is at least as cognitively burdensome. And it doesn't address the generality of the problem because there is no guarantee that the object to be rendered will have any particular number of integers.

Whether a functional program accords with experience with other languages is beside the point. R presents as a rich system of library functions (or subroutines if that is how they must be understood) that can be composed as first class objects. The {base} package was developed by, and for, statisticians and contributed packages survive or fade on their reputation for reliable implementations of algorithms with impeccable published pedigrees A rich subsidiary of packages has grown around it that facilitate the use of R for data scrubbing and ancillary tasks. They may, or may not, emulate how it is done elsewhere. For the most part they don't try.

Using a screwdriver handle as a hammer, a jackplane as a chisel or a circular saw to rip a long board are equally possible and equally inadvisable as laying R into a procedural procrustean bed.

system · September 30, 2020, 5:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.