How can I convert for loop into apply in R?

I have a dataframe that looks like this which has over 30 thousand rows:

 ID   Name   Attr
  1   John     A1
  2   Peter    A1
  3   Mark     A2
  4   Joe      A2
  5   Tim      A3
  .    .        .
  .    .        .
  .    .        .

and I am trying to encrypt certain columns of the dataframe using the sodium package. I currently have a function that does something like this:

key <- "somespecialkey"

encryptData <- function(x) {
   
   # Generate a list of 0 vectors based on nrows of df
   name <- numeric(nrow(df))
   nonce <- numeric(nrow(df))
  
   # Loop through each row to encrypt the Name Column
   for (i in 1:nrow(df)) {

     # 1. Generate a singular nonce for each row of data.
     rnonce <- random(24)
 
     # 2. Encrypt name column of df
     serializedName <- serialize(df$Name[i], NULL)
     cipher <- data_encrypt(serializedName, key, rnonce)

     # 3. Store the encrypted name, class and nonce
     name[i] <- bin2hex(cipher)
     nonce[i] <- bin2hex(rnonce)
 
   }
  
   # Bind the new vectors back to the dataframe
   df$Name <- name
   df$nonce <- nonce
  
   return(df)
}

and when I want to encrypt the data, I just call this function as such:

df <- encryptdata(df)

However, as my dataset is huge with many rows, I realise that the for loop is taking too long to perform this request. I'm trying to optimize this function using an apply function instead of a for loop, but am struggling to do so. Can anyone shed any light on this?

Thanks!

Notes:

  • It would be good if you could share a usable data for questions like this (e.g. via using dput). I've included this in my answer.
  • I'm including code that tries to simulate your particular use-case by using the randomNames package to help generate a data frame with 30000 IDs and random names
  • Though you haven't explicitly stated this, I deduce from the code provided that you'd prefer a base-r solution. That would definitely not be my preference, but I've tried to hack together a base r solution that which, while perhaps not being particularly elegant, should at least contain all of the pieces that you'd need to tailor your own solution.
  • I also provide a tidyverse solution which would be my preference (and in case others may find it useful)
  • NB: Neither of these approaches are particularly performant as the data grows (I haven't benchmarked either against your loop, but I'd be surprised if they were markedly faster). Having pretty much zero familiarity with the sodium package, I think this is mainly because you generate a unique nonce for every row of your data, effectively forcing sodium's vectorised data_encrypt function to operate like a non-vecorised function. It is not immediately clear to me why one would want to do this.

Data:

# create data
df <- structure(
  list(
    ID = c(1, 2, 3, 4, 5),
    Name = c("John", "Peter",  "Mark", "Joe", "Tim"),
    Attr = c("A1", "A1", "A2", "A2", "A3")
  ),
  row.names = c(NA, -5L),
  class = c("tbl_df", "tbl", "data.frame")
)

# create "large", "simulated" data frame to replicate user's use-case
df_large <- data.frame(
  ID = 1:30000,
  Name = randomNames::randomNames(30000, which.names = 'first')
)

# create encryption key (that works with sodium functions)
key <- sha256(charToRaw("somespecialkey"))

Base R approach:

# function to encrypt vector on inputs with a given key and return a data frame
# with a column for the resultant encrypted input and nonce values
encrypt_vector <- function(input, key) {
  list_output <- lapply(input, function(x) {
    # 1. Generate a singular nonce for each row of data.
    rnonce <- random(24)
    
    # 2. Encrypt name column of df
    serializedName <- serialize(x, NULL)
    cipher <- data_encrypt(serializedName, key, rnonce)
    
    # 3. Store the encrypted name, class and nonce
    list(name = bin2hex(cipher), nonce = bin2hex(rnonce))
  })
  data.frame(do.call(rbind, list_output)) 
}

# apply to small data
x <- encrypt_vector(df$Name, key)
df$Name <- x$name
df$nonce <- x$nonce

# apply to large data (slow)
x <- encrypt_vector(df_large$Name, key)
df_large$Name <- x$name
df_large$nonce <- x$nonce

Tidyverse approach:

# function to encrypt vector on inputs with a given key and return a data frame
# with a column for the resultant encrypted input and nonce values
encrypt_vector <- function(input, key) {
  list_output <- map(input, function(x) {
    # 1. Generate a singular nonce for each row of data.
    rnonce <- random(24)
    
    # 2. Encrypt name column of df
    serializedName <- serialize(x, NULL)
    cipher <- data_encrypt(serializedName, key, rnonce)
    
    # 3. Store the encrypted name, class and nonce
    tibble(name = bin2hex(cipher), nonce = bin2hex(rnonce))
  })
}

# apply to small data
df <- df %>% 
  mutate(encryption = encrypt_vector(Name, key)) %>% 
  select(-Name) %>% 
  unnest() %>% 
  rename(Name = name) 

# apply to large data
df_large <- df_large %>% 
  mutate(encryption = encrypt_vector(Name, key)) %>% 
  select(-Name) %>% 
  unnest() %>% 
  rename(Name = name) 
3 Likes

Hi Hendrik,

Thanks for your detailed response! Below are my inputs to some of your doubts:

Replies:

  • Sorry about that, will share usable data in future!
  • I'm open to both base-R or tidyverse solution actually, just didn't think of a tidyverse way of doing this.
  • I agree with you that neither of these approaches are performant as the data grows. Do you have any suggestions on how I could speed this data encryption + decryption process up? I've tried your base-R solution and it works, but still takes a long time to run. As for the tidyverse solution, it took too long and I ended up getting the following error:

"Error: vector memory exhausted (limit reached?)"

  • I'm generating a unique nonce for each row of the dataframe because I need a way to reference back to this unique nonce, as the data needs to be decrypted each time the shiny app runs. Also, I need this data to be encrypted and stored in a MySQL database for security reasons. Not sure if this is the best approach, if you know of any better way please enlighten me!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.