Help is needed to use apply (or similar) to update values from one dataframe to many.

jovenden · May 9, 2022, 5:25am

library(tidyverse)
library(dplyr)

# Input test data

Component2 <- data.frame(
        stringsAsFactors = FALSE,
  Component_Product_Code = c("Component6",
                             "Component7","Component8","Component9","Component22"),
                Quantity = c(1L, 2L, 4L, 28L, 90L),
        Wastage_Quantity = c(0L, 0L, 0L, 0L, 0L),
               Unit_Cost = c(127.2, 112.02, 23.33, 0.0342, 0.0453),
              Total_Cost = c(127.2, 224.04, 93.32, 0.9576, 4.077),
       Can_Auto_Assemble = c("Yes", "Yes", "Yes", "Yes", "Yes"),
    Can_Auto_Disassemble = c("No", "No", "No", "No", "No"),
             IsObsoleted = c("No", "No", "No", "No", "No"),
         Expense_Account = c(NA, NA, NA, 51120L, 51120L),
                Comments = c(NA, NA, NA, NA, NA)
)

Component9 <- data.frame(
        stringsAsFactors = FALSE,
  Component_Product_Code = c("Component11",
                             "Component12","Component13","Component14",
                             "Component15"),
                Quantity = c(5L, 5L, 2L, 1L, 1L),
        Wastage_Quantity = c(0L, 0L, 0L, 0L, 0L),
               Unit_Cost = c(107, 53, 49.3629, 55.3854, 134.186),
              Total_Cost = c(535, 265, 98.7258, 55.3854, 134.186),
       Can_Auto_Assemble = c("Yes", "Yes", "Yes", "Yes", "Yes"),
    Can_Auto_Disassemble = c("No", "No", "No", "No", "No"),
             IsObsoleted = c("No", "No", "No", "No", "No"),
         Expense_Account = c(51250L, 51244L, NA, NA, NA),
                Comments = c(NA, NA, NA, NA, NA)
)

Updated_price_list <- data.frame(
        stringsAsFactors = FALSE,
  Assembled_Product_Code = c("Component14", "Component23"),
              Total_Cost = c(12.5396, 9.9738)
)

# I have many thousands of data.frames. Here, I have given two examples called Component2 and Component9.
# Each data frame has a column called Total_Cost that specifies costs of components that it is made of.
# For example, dataframe Component2 consists of Components6, 7, 8, 9 and 22 where the cost of each component is
# found in the Total_Cost column.
# I need to update the cost of components across the thousands of dataframes according to information in a
# dataframe called Updated_price_list.
# I have given an example of Updated_price_list that contains the costs of only two components - 14 and 23.
# Dataframe Component9 contains component14. The cost of component14 needs to be updated from the price-list.
# For example, it needs to change from 55.3854 to 12.5396.
# Dataframe Component9 does not contain component 23, nor does Component2 contain either component 14 or 23,
# so no change is needed.
# I am working on a solution that uses lapply and a function.

# Placing the data frames into a list.
List <- list(Component2, Component9)

# Making a function to update the Total_Cost column. 
Search_replace <- function(x){
  (x$Component_Product_Code == Updated_price_list$Assembled_Product_Code)
  x$Component_Product_Code <- Updated_price_list$Assembled_Product_Code
}

# Using lapply to update each data frame.
Updated_dataframes <- (lapply(List, Search_replace))

# However, this leads to errors.
# Your help in advance would be very much appreciated.

nirgrahamuk · May 9, 2022, 10:58am

i think its


# Making a function to update the Total_Cost column. 
Search_replace <- function(x){
  dplyr::rows_update(x=x,
                     y=Updated_price_list %>% rename("Component_Product_Code"="Assembled_Product_Code"),
                     by="Component_Product_Code",
                     unmatched = "ignore")
}

Updated_dataframes <- (lapply(List, Search_replace))

jovenden · May 10, 2022, 3:31am

Thanks, nigrahamuk for this solution.

However, it returns an error when I run it on the reprex.

Error in dplyr::rows_update():
! ... must be empty.
x Problematic argument:
• unmatched = "ignore"
Run rlang::last_error() to see where the error occurred.

Also, I think the lapply line needs to be "lapply(List, Search_replace)" rather than "Updated_dataframes <- (lapply(List, Search_replace))" as I'm expecting the dataframes to be updated in place (ie. in the Global Environment). Is this correct?

jovenden · May 10, 2022, 5:47am

Hello again nirgrahamuk - I've been investigating the operation of the rows_update function from library(dplyr). I have found that if all the values in y are not in x, then an error is generated that is not corrected by the inclusion of the unmatched = ignore command. I've pasted some output below to illustrate this. Best, Jenny

Component_test
Component_Product_Code Dummy_Cost Can_Auto_Disassemble Total_Cost
1 Component34 5.79 No 5.79
2 Component24 6.97 No 6.97
3 Component14 7.97 No 7.97
Updated_price_list
Component_Product_Code Total_Cost
1 Component14 12.5396
2 Component23 9.9738
rows_update(Component_test, Updated_price_list)
Matching, by = "Component_Product_Code"
Error in rows_update():
! Attempting to update missing rows.
Run rlang::last_error() to see where the error occurred.

rows_update(Component_test, Updated_price_list, unmatched = "ignore")
Error in rows_update():
! ... must be empty.
x Problematic argument:
• unmatched = "ignore"
Run rlang::last_error() to see where the error occurred.

jovenden · May 10, 2022, 6:31am

I think I've solved the issues, by using rows_upsert instead. This avoids the intersection error of rows_insert. Also, I have found that the argument "unmatched = ignore" is not needed, neither is the argument "by="Component_Product_Code".

The function is now
Search_replace <- function(x){
dplyr::rows_upsert(x=x,
y= Updated_price_list %>% rename("Component_Product_Code"="Assembled_Product_Code"),
)
}

I will need to remove spurious rows from some updated tables. If rows in y are not present in x, then rows_upsert appends the rows from y onto the bottom of the updated table.

Thanks for getting me started on the search for the solution to my original problem. Best, Jenny

nirgrahamuk · May 10, 2022, 8:28am

Your issue here is strange, it may reflect a problem with version numbers of the packages underlying your tidyverse installation, or perhaps a curiosity in the orderthat functions are loaded that causes weirdness through conflicts.
The prototypical example of the functionality I demonstrated might be

(myx <- data.frame(key=1,value=0))
(myy <- data.frame(key=c(1,2),value=c(2,2)))

#problem
dplyr::rows_update(myx,
                   myy)
# Matching, by = "key"
# Error in `dplyr::rows_update()`:
# ! `y` must contain keys that already exist in `x`.
# i The following rows in `y` have keys that don't exist in `x`: `c(2)`.
# i Use `unmatched = "ignore"` if you want to ignore these `y` rows.

# fixed
dplyr::row`s_update(myx,
                   myy,
                   unmatched = "ignore")
# Matching, by = "key"
#  key value
# 1   1     2

jovenden · May 11, 2022, 4:50am

Hi again Nirgrahamuk - Thanks for your reply. I was running dplyr v 1.0.8. When I upgraded to v 1.0.9, I was able to duplicate your example. Seems my problem is solved. Thanks for your help. Best, Jenny

system · May 18, 2022, 4:50am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.