How to Compare two data frames and get Diffrence file

salmansyed07 · October 23, 2019, 3:19pm

I have 9 sheets in single excel workbook ( have 2 work books, first is originally crated other is reviewed) I have to compare these sheets with same reviewed workbook in which my reviewer made changes.

I am using below codes for reading the entire workbook and transferring into the individual data frame as sheet in the workbook

# for reading sheets from the reviewed file

    workbook_new <- lapply(ws_name_new, function(x) read_excel(path = path_new, sheet = x, skip = 2, na = ""))
    workbook_new <- lapply(workbook_new, function(x) x[-2,])
    sheet_list_new <- workbook_new[sapply(workbook_new, function(x) dim(x)[1]) > 0]

# for reading sheets from the originally created file

    workbook_old <- lapply(ws_name_old, function(x) read_excel(path = path_old, sheet = x, skip = 2, na = ""))
    workbook_old <- lapply(workbook_old, function(x) x[-2,])
    sheet_list_old <- workbook_old[sapply(workbook_old, function(x) dim(x)[1]) > 0]

# Below code I am using for the removing the not required sheets

    # for new Workbook
    ws_name_new <- excel_sheets(path = path_new)
    ws_name_new <- ws_name_new[-grep("*_Cal", ws_name_new)]
    ws_name_new <- ws_name_new[-grep("*_cal", ws_name_new)]
    ws_name_new <- ws_name_new[-gre`enter code here`p("*Summary", ws_name_new)]
    ws_name_new <- ws_name_new[-c(1, 9, 16, 17, 18, 19, 20, 21, 22,23,24,25, 26, 27)]
    
    # For Old file
    ws_name_old <- excel_sheets(path = path_old)
    ws_name_old <- ws_name_old[-grep("*_Cal", ws_name_old)]
    ws_name_old <- ws_name_old[-grep("*_cal", ws_name_old)]
    ws_name_old <- ws_name_old[-grep("*Summary", ws_name_old)]
    ws_name_old <- ws_name_old[-c(1, 9, 16, 17, 18, 19, 20, 21, 22,23,24,25, 26, 27)]

the approach I am following is first I am importing the every single sheet from the excel workbook and creating two data frames. df_1_new from the reviewed file and df_1_old from the originally created. Now I want to compare df_1_new with df_1_old and get the final difference file.

I have tried compare_df() function but its giving me two rows in difference file one the original row other is the row which is modified. I need only those rows which are changed by the reviewer.

I have 9 data frame for each of the work book so total is 18 which I need to compare and get the difference file. consider the given wxaple

Example:
Suppose I have below two data framed. df_1 is which I have created and DF_2 is my reviewer corrected. Now because the file is very huge and I have more then 100 column I can not check them one by one. I looking of the solution if both the data frame can be compare and the final out put I get is the columns and rows in which changes has been made


    key <- c(A1, B1, C1, D1)
    a <- c(10,20,30,40)
    b <- c('book', 'pen', 'textbook', 'pencil_case')
    c <- c(TRUE,FALSE,TRUE,FALSE)
    d <- c(2.5, 8, 10, 7)
    df_1_New <- data.frame(key,a,b,c,d)
    
    key <- c(A1, B1, C1, D1)
    a <- c(10,5,40,40)
    b <- c('book', 'pen', 'textbook', 'pen_case')
    c <- c(TRUE,FALSE,TRUE,TRUE)
    d <- c(2.5, 3, 10, 7)
    df_1_Old <- data.frame(key, a,b,c,d)
    
    ctable <- compare_df(df_1_New, df_1_Old, c("Key"), keep_unchanged_cols = FALSE)
    DF_1_Diff <- ctable$comparison_df
    
    DF_Diff<- DF_1_Diff %>% filter(chng_type == '+') %>%  select(-chng_type)

I want to filter out the changes only so I am applying above code and dropping the 'cnge_type' col as well. I also wanted drop last lets say 10 row from the df I can do that with use of dplyr in the same so that I can make me code more efficient.

NOTE : I have to perform same task other 8 also and the name of the data frame is following the as df_2_new, df_2_old, df_2_new, df_3_old and so on and so forth.

Can any one suggest me or guide me the efficient way.

grosscol · October 23, 2019, 3:30pm

It would make addressing your question easier if the example code could be run as is. To that end, the key values need to be wrapped in quotes, "Key" probably needs to be "key" in the arg to compare_df, and the package that compare_df comes from should be required before the function is called.

Along with the example inputs, could you give an example of the desired output? I'm having trouble sorting out what the desired result should look like.

salmansyed07 · October 23, 2019, 4:09pm

library(compareDF)

key <- c('A1', 'B1', 'C1', 'D1', "", "", "","", "25P", "75P", "Average", "Median", "25P", "75P", "Average", "Median")
a<- c(10,20,30,40, 10,20,30,40, 10,20,30,40,10,20,30,40)
b <- c('book', 'pen', 'textbook', 'pencil_case', 'book', 'pen', 'textbook', 'pencil_case', 'book', 'pen', 'textbook', 'pencil_case','book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE)
d <- c(2,5, 8, 10, 7, 2,5, 8, 10, 7, 2,5, 8, 10, 7,1)

df_1_New <- data.frame(key,a,b,c,d)

key <- c('A1', 'B1', 'C1', 'D1', "", "", "","", "25P", "75P", "Average", "Median", "25P", "75P", "Average", "Median")
a <- c(10,5,40,40, 10,5,40,40,10,5,40,40, 10,5,40,40)
b <- c('book', 'pen', 'textbook', 'pen_case', 'book', 'pen', 'textbook', 'pen_case', 'book', 'pen', 'textbook', 'pen_case','book', 'pen', 'textbook', 'pen_case')
c <- c(TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE)
d <- c(2,5, 3, 10,2,5, 3, 10, 2,5, 3, 10, 2,5, 3, 10)
df_1_Old <- data.frame(key, a, b, c, d)

ctable <- compare_df(df_1_New, df_1_Old, c("key"), keep_unchanged_cols = FALSE)
DF_1_Diff <- ctable$comparison_df

if you try the above code you will find that in key column "average , percentile coming twice and there also some blank cell. I want to remove all and keep only the A1, B1, C1 and D1. you will also notice the chng column which showing '+' and '-' row. want to keep only the '+' value and drop the chng column.

Answer I am expecting . which column and row in which change has been made:-

key a b c d
B1 20 pen FALSE 5
C1 30 textbook TRUE 8
D1 40 pencil_case FALSE 10

but I am getting

!key chng_type a b c d
+ 10 book TRUE 7
+ 20 pen FALSE 2
+ 30 textbook TRUE 5
+ 40 pencil_case FALSE 8
- 10 book TRUE 2
- 5 pen FALSE 5
- 40 textbook TRUE 3
- 40 pen_case TRUE 10
25P + 10 book TRUE 10
25P + 10 book TRUE 8
25P - 10 book TRUE 2
75P + 20 pen FALSE 7
75P + 20 pen FALSE 10
75P - 5 pen FALSE 5
Average + 30 textbook TRUE 2
Average + 30 textbook TRUE 7
Average - 40 textbook TRUE 3
B1 + 20 pen FALSE 5
B1 - 5 pen FALSE 5
C1 + 30 textbook TRUE 8
C1 - 40 textbook TRUE 3
D1 + 40 pencil_case FALSE 10
D1 - 40 pen_case TRUE 10
Median + 40 pencil_case FALSE 5
Median + 40 pencil_case FALSE 1
Median - 40 pen_case TRUE 10

grosscol · October 23, 2019, 5:56pm

To make a table of where the changes are, you can pass through a pair of data frames and check equality. It's rough, but will probably suffice:

# Helper function that given two columns, 
#  emits TRUE where values are the same and FALSE where different.
changes_by_col <- function(lhs_col, rhs_col){
  if(is.factor(lhs_col)){
    as.character(lhs_col) != as.character(rhs_col)
  }else{
    lhs_col != rhs_col
  }
}

# Pass through two data frames in a column-wise fashion.
changed <- mapply(changes_by_col, df_new, df_old)

The output will be a matrix of TRUE and FALSE where the TRUEs indicate the value changed from one data frame to the other. E.g. which values are not the same between df_new and df_old?

changed <- mapply(changes_by_col, df_new, df_old)
changed
#       key     a     b     c     d
#[1,] FALSE FALSE FALSE FALSE FALSE
#[2,] FALSE  TRUE FALSE FALSE  TRUE
#[3,] FALSE  TRUE FALSE FALSE FALSE
#[4,] FALSE FALSE  TRUE  TRUE FALSE

You can use that to find the rows with changes:

rows_changed <- apply(changed, MARGIN=1, FUN=any)
rows_changed
#[1] FALSE  TRUE  TRUE  TRUE

row_numbers_changed <- c(1:nrow(df_new))[rows_changed]
row_numbers_changed
#[1] 2 3 4

You can use the rows changed as a filter or as an index for subsequent manipulation of the old or new data frames.

salmansyed07 · October 23, 2019, 6:42pm

thanks grosscol but that was not somwthing I am looking for.. I have tried below codes ...solved my purpose but not efficient and taking so much time

library(compareDF)

key <- c('A1', 'B1', 'C1', 'D1', "", "", "","", "25P", "75P", "Average", "Median", "25P", "75P", "Average", "Median")
a<- c(10,20,30,40, 10,20,30,40, 10,20,30,40,10,20,30,40)
b <- c('book', 'pen', 'textbook', 'pencil_case', 'book', 'pen', 'textbook', 'pencil_case', 'book', 'pen', 'textbook', 'pencil_case','book', 'pen', 'textbook', 'pencil_case')
c <- c(TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE, TRUE,FALSE,TRUE,FALSE)
d <- c(2,5, 8, 10, 7, 2,5, 8, 10, 7, 2,5, 8, 10, 7,1)

df_1_New <- data.frame(key,a,b,c,d)

key <- c('A1', 'B1', 'C1', 'D1', "", "", "","", "25P", "75P", "Average", "Median", "25P", "75P", "Average", "Median")
a <- c(10,5,40,40, 10,5,40,40,10,5,40,40, 10,5,40,40)
b <- c('book', 'pen', 'textbook', 'pen_case', 'book', 'pen', 'textbook', 'pen_case', 'book', 'pen', 'textbook', 'pen_case','book', 'pen', 'textbook', 'pen_case')
c <- c(TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE, TRUE,FALSE,TRUE,TRUE)
d <- c(2,5, 3, 10,2,5, 3, 10, 2,5, 3, 10, 2,5, 3, 10)
df_1_Old <- data.frame(key, a, b, c, d)

ctable <- compare_df(df_1_New, df_1_Old, c("key"), keep_unchanged_cols = FALSE)
DF_1_Diff <- ctable$comparison_df 
DF_1_Diff <- DF_1_Diff %>% 
    filter(chng_type == '+')%>% 
    select(-c(chng_type)) %>% 
    subset(key !='25P') %>% subset(key !='75P') %>% subset(key !="") %>% subset(key !='Median') %>% subset(key !='Average')

system · November 13, 2019, 6:42pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.