Numeric Data cleanup


#1

Hi. I have financial transactions that I read from fixed width files, as that is the format that the source system provide them in. There are 131 columns in total, of which 27 columns are numerical fields. The problem is that the values have a lot of leading zeros’s, but the biggest issue is that the negative sign is at the end of the string: “000000016987.8900-”. I found this on the web and it works fine, but it is very slow if I run it on millions of transactions and on 27 columns.

idxtrades$exchlocaltobase <- sapply(strsplit(idxtrades$exchlocaltobase, "(?=-)", perl = TRUE), function(x) as.numeric(paste(sort(x), collapse="")))
idxtrades$exchclienttobase <- sapply(strsplit(idxtrades$exchclienttobase, "(?=-)", perl = TRUE), function(x) as.numeric(paste(sort(x), collapse="")))
idxtrades$exchbasetosett <- sapply(strsplit(idxtrades$exchbasetosett, "(?=-)", perl = TRUE), function(x) as.numeric(paste(sort(x), collapse="")))
idxtrades$iocexec <- sapply(strsplit(idxtrades$iocexec, "(?=-)", perl = TRUE), function(x) as.numeric(paste(sort(x), collapse="")))
.............

I sure there is a better way, I’m just not that familiar with R to figure it out.


#2

Hi,

For helping us help you, please provide a reproductible example. Take a look at the reprex :package: as an R tool to do that.

If you are new to R, I suggest you to learn a bit more on the different tools that exits. The tidyverse ecosystem could help you work with data and gets results quickly. This online book will give you a great overview of the possibilities

About your issue, without a reproductible example, we could just see that you try to apply same function over each column of a dataframe. There are indeed some tools to help you with that. Not sure how faster it could be as you table seems quite big, but there will be other tools for that too. (R is full of such great tools :slight_smile:)
Here some hints to look into:

  • You need to transform character : stringr will help you (with str_split, str_c, …)
  • You need to work with data.frame and iterate through columns : dplyr will help (with functions like mutate, mutate_all, …)

These two packages are from the tidyverse. See the website for starting place to learn


#3

This is an example that both shows how “for loops are slow” in R, while demonstrating that avoiding for loops through the apply functions doesn’t actually solve the issue of a slow inner loop (which is actually why they are generally slow). The solution is to actually vectorize the work. In this case, the vast majority of the time is devoted to the loop implied by sapply, which ends up running the inline defined function once per entry. This causes a lot of overhead, slowing the process considerably.

My option_vectorized() in the code below avoids any non-vectorized code. There’s other changes that may also speed it up, but a little bit of testing implies that the sapply is the slow part of the original code. I’m sure that could be rewritten in a way that doesn’t use sapply, but I wasn’t willing to work that hard this morning. :wink:

suppressPackageStartupMessages(library(tidyverse))

# Create some sample data with as many rows as I'm willing to wait for
numbers <- rnorm(1e5, mean = 0, sd = 1e5)

number_strings <- formatC(abs(numbers),
                          width = "15",
                          format = "f",
                          flag = "0",
                          digits = 4) %>%
  paste0(if_else(numbers < 0, "-", ""))

# The original function in question
option_strsplit <- function(input_strings) {
  sapply(strsplit(input_strings, "(?=-)", perl = TRUE),
         function(x) as.numeric(paste(sort(x), collapse="")))
}

# A different algorithm that avoids per-item functions
option_vectorized <- function(input_strings) {
  # Find which strings have a negative sign
  neg_strings <- str_detect(input_strings, "-$")
  
  # Trim the negative sign if needed and convert to numeric
  # Provides the absolute value
  abs_numeric <- as.numeric(if_else(neg_strings,
                                    str_sub(input_strings, end = -2L),
                                    input_strings))
  
  # Multiply the absolute value by -1 if needed
  abs_numeric * (neg_strings * (-2) + 1)
}

# Benchmark 10 runs
microbenchmark::microbenchmark(
  option_strsplit(number_strings),
  option_vectorized(number_strings),
  times = 10L
)
#> Unit: milliseconds
#>                               expr        min       lq     mean    median
#>    option_strsplit(number_strings) 2423.61685 2507.386 2693.591 2607.4278
#>  option_vectorized(number_strings)   86.20587  116.448  150.889  126.4484
#>         uq       max neval
#>  2960.6643 3105.8842    10
#>   178.7068  288.6861    10

As @cderv mentioned, mutate_all or similar in the dplyr package is a good way to apply this to each column.


#4

In terms of speeding things up, as nick said, vectorising the work do do stuff on a much as possible in one go is the best plan.

In the service of that goal, I would ask if there are any hyphens you do want to keep (for instance in the non-numeric columns)- do you actually need to check for which columns you are changing.

As another thought, with leading zero and trailing hyphens, do all of the entries have the same number of characters, in which case it may be possible to treat it as a splitting fixed width text problem, which would save all checking the detail of the characters. This is based on the general principle of how you think about your data determines what you can do to it.


#5

Hi @martindut,

Considering the amount of data, I’d use data.table:

library(data.table)
# Convert to data.table, unless file has been imported with fread.
setDT(idxtrades)
# move dash from last to first position 
fix_dash <- function(x)
    as.numeric( 
        ifelse( 
            substring(x, nchar(x)) == '-', 
            paste0('-', substr(x, 1, nchar(x) - 1)), 
            x
        ) 
    )
# list all numeric columns to recode
cols2update <- c('exchlocaltobase', 'exchclienttobase', 'exchbasetosett', 'iocexec')
# update all columns at once
idxtrades[, (cols2update) := lapply(.SD, fix_dash), .SDcols = cols2update ]

Hope this helps,
Luca


#6

Thanks for all the replies and suggestions. I will definitely try them all and give feedback.


#7

Hi @nick. Thanks for this. The option_vectorized works very well, but I’m still very new to R, so I’m trying to figure out how to apply this to only certain columns.


#8

Hi @thoughtfulnz, I would prefer to run it only on the known numeric columns. I suspect that that all the entries have the same numebr of characters, but I will have to verify.


#9

Hi @datamaps. Thanks for this solution. I works very well seening that I can specify the columns, however it is a fair bit slower against @nick solution if I only run it on one column.


#10

Hi @martindut.

You can actually change the function in my previous comment using @nick approach:

fix_dash <- function(x)
    if_else( 
        str_detect(x, "-$"), 
        -1 * as.numeric(str_sub(x, end = -2L)),
        as.numeric(x)
    )

Now, running time on a vector is more or less the same
But you can run it on multiple columns:

k <- 10
m <- data.table(matrix(rep(number_strings, k), ncol = k))
cnames <- paste0('X', 1:k)
setnames(m, cnames)
system.time( m[, (cnames) := lapply(.SD, fix_dash), .SDcols = cnames] )

#11

data.table is one option. If you want to use dplyr, the mutate_at function should work.

result <- mutate_at(idxtrades, vars(exchlocaltobase, other_col_names), fixing_function)

#12

Just thinking a little bit more, treating it as a text split problem with tidyr, they don’t even need to be fixed width columns, as you can “separate” the text in the column bases on the hyphen and ignore the empty set after the split

df <- data.frame(example1=c("001-","002-","003-","004-"),
                 example2=c("001-","002-","003-","004-"))

library(tidyr)
library(dplyr)
df %>%
  separate(example1, into=c("example1"), sep="-", extra="drop") %>%
  separate(example2, into=c("example2"), sep="-", extra="drop")

#13

The “hyphen” is providing information on the sign of the value, though, so it can’t be dropped entirely.


#14

Whoops I missed that,
if the sign might or might not be present for a given transaction, it goes back to being a fixed number of numerals problem if separating, as you want to preserve the -

df <- data.frame(example1=c("001-","002-","003-","004-"),
                 example2=c("001-","002-","003","004"))

library(tidyr)
library(dplyr)

#treating as numeric, losing leading 0s
df %>%
  separate(example1, into=c("example1", "sign1"), sep=3, fill="right") %>%
  separate(example2, into=c("example2", "sign2"), sep=3, fill="right") %>%
  mutate(example1 = if_else(sign1 == "-", -1 * as.double(example1), as.double(example1)),
         example2 = if_else(sign2 == "-", -1 * as.double(example2), as.double(example2)))
#treating as text, keeping leading 0s
df %>%
  separate(example1, into=c("example1", "sign1"), sep=3, fill="right") %>%
  separate(example2, into=c("example2", "sign2"), sep=3, fill="right") %>%
  mutate(example1 = paste0(sign1,example1),
         example2 = paste0(sign2,example2))


#15

Hi All. Thanks for all the suggestions. At the end I went with mutate_at, as suggested by @nick, with fix_dash function.