Hello, I am new to R programming and struggling to solve what appears to be a simple problem in cleaning a dataset. If anyone browsing has the knowledge and could share it with me that would be fantastic for my learning.
Here is my problem:
I am trying to manipulate the column "unit_price" which is a Class :character and the values within the column structured as "0,90 €".
I have created a new column named "item_price_euros" with the Class :double to be able to do calculations with the numbers. I am stuck with converting the values from "0,90 €" into "0.90".
Is this a simple case of somehow transforming the delimiter from a comma to a period and dropping the tailing "{space}€"?
you can use a package like the stringr package which comes with very handy string manipulation functions. Wrap those functions into a user written function and you can reshape your currency columns into a numeric format:
x <- '0,90 €'
library('stringr')
x |>
str_remove('\\s€$') |>
str_replace(',', '\\.') |>
as.numeric()
#> [1] 0.9
### as function for convenience
cur_to_num <- function(x){
x |>
str_remove('\\s€$') |>
str_replace(',', '\\.') |>
as.numeric()
}
Data <- data.frame(
cur_col = paste(format(sample(seq.default(0,100,0.01), 1000), decimal.mark = ',', big.mark = '.'), "€")
)
Data$num_col <- cur_to_num(Data$cur_col)
str(Data)
#> 'data.frame': 1000 obs. of 2 variables:
#> $ cur_col: chr "83,44 €" "23,96 €" "14,00 €" "84,71 €" ...
#> $ num_col: num 83.4 24 14 84.7 28.9 ...
The function takes a formatted value like you have stated and just removes the whitespace and the euro currency with a regex, then swaps the comma for a decimal point and converts it to numeric.
Just as a side note for others - a less comfortable way is using base R for string manipulations. It will be often times faster then stringr but as I said less comfortable:
### base R
x <- "24,95 €"
cur_to_num_base <- function(x){
### position of the white space minus 1
pos_ws <- regexpr('\\s',x) - 1
### get the substring
substr <- substring(x, 1, pos_ws)
### replace comma with decimal point
x <- sub(',','\\.',substr)
as.numeric(x)
}
all.equal(cur_to_num(x), cur_to_num_base(x))
#> [1] TRUE
microbenchmark::microbenchmark(
stringr = cur_to_num(x),
base = cur_to_num_base(x),
times = 10000
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> stringr 181.9 233.45 375.99381 278.2 391.95 8716.3 10000
#> base 39.8 55.10 86.38915 69.1 92.25 33528.9 10000