How to manipulate column data

Hello, I am new to R programming and struggling to solve what appears to be a simple problem in cleaning a dataset. If anyone browsing has the knowledge and could share it with me that would be fantastic for my learning.

The dataset I am using can be found here: French bakery daily sales | Kaggle

Here is my problem:
I am trying to manipulate the column "unit_price" which is a Class :character and the values within the column structured as "0,90 €".

I have created a new column named "item_price_euros" with the Class :double to be able to do calculations with the numbers. I am stuck with converting the values from "0,90 €" into "0.90".

Is this a simple case of somehow transforming the delimiter from a comma to a period and dropping the tailing "{space}€"?

All advice and criticism is very welcome.

Hello,

you can use a package like the stringr package which comes with very handy string manipulation functions. Wrap those functions into a user written function and you can reshape your currency columns into a numeric format:

x <- '0,90 €'

library('stringr')
x |>
  str_remove('\\s€$') |>
  str_replace(',', '\\.') |>
  as.numeric()
#> [1] 0.9

### as function for convenience
cur_to_num <- function(x){
  x |>
    str_remove('\\s€$') |>
    str_replace(',', '\\.') |>
    as.numeric()
}

Data <- data.frame(
  cur_col = paste(format(sample(seq.default(0,100,0.01), 1000), decimal.mark = ',', big.mark = '.'), "€")
)

Data$num_col <- cur_to_num(Data$cur_col)
str(Data)
#> 'data.frame':    1000 obs. of  2 variables:
#>  $ cur_col: chr  "83,44 €" "23,96 €" "14,00 €" "84,71 €" ...
#>  $ num_col: num  83.4 24 14 84.7 28.9 ...

Created on 2022-11-19 with reprex v2.0.2

The function takes a formatted value like you have stated and just removes the whitespace and the euro currency with a regex, then swaps the comma for a decimal point and converts it to numeric.

Kind regards

1 Like

Thank you so much! I still am not very fluent with the libraries and their functions but I appreciate you taking the time.

Well, you can ask if anything is not clear. I will answer you to the best of my knowledge. :slight_smile:

1 Like

Everything worked exactly as I had imagined it could, just needed a little bit of that FactOREO golden touch ;^)

1 Like

Just as a side note for others - a less comfortable way is using base R for string manipulations. It will be often times faster then stringr but as I said less comfortable:

### base R
x <- "24,95 €"
cur_to_num_base <- function(x){
  ### position of the white space minus 1
  pos_ws <- regexpr('\\s',x) - 1
  ### get the substring
  substr <- substring(x, 1, pos_ws)
  ### replace comma with decimal point
  x      <- sub(',','\\.',substr)
  as.numeric(x)
}

all.equal(cur_to_num(x), cur_to_num_base(x))
#> [1] TRUE

microbenchmark::microbenchmark(
  stringr = cur_to_num(x),
  base    = cur_to_num_base(x),
  times   = 10000
)
#> Unit: microseconds
#>     expr   min     lq      mean median     uq     max neval
#>  stringr 181.9 233.45 375.99381  278.2 391.95  8716.3 10000
#>     base  39.8  55.10  86.38915   69.1  92.25 33528.9 10000

Created on 2022-11-19 with reprex v2.0.2

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.