Comparing Multiple Columns Amongst Different Rows

Hi everyone,

So I'm fairly new to the RStudio world and am having some trouble comparing data between rows based upon multiple column parameters. Say I have some excel table (shown in R) that looks like this:

A tibble: 6 x 4

'Fruit Number' 'Fruit' 'Length' 'Color complexion'

1 1 Apple 2 0.34
2 2 Banana 4 0.23
3 3 Orange 2 0.68
4 4 Peach 3 0.11
5 5 Guava 4 0.47
6 6 Banana 4 0.25

In my analysis, I want to essentially loop through each row so that row 1 is compared to all other rows, row 2 is compared to all other rows, etc. (each row is compared to every other row). Then I want to compare the 'Fruit' , 'Length', and 'Color complexion'. If the row has the same 'Fruit' name and 'Length' as another row with a 'Color complexion' within 25% similarity then I want to take both rows out of the data set. I know this can be achieved relatively easily with a loop and some if statements, but am unfamiliar with how to do this in RStudio.

This is also my first post, so I do apologize if it is a little unconventional. Thanks!

Hi,

Welcome to the RStudio community!

First of all, next time try to generate a reprex if you post a question with data and code. This will greatly help us get started quickly and help you out. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

That said, what you are trying to do is not 'unconventional' at all :slight_smile: It can be solved with some basic dplyr magic.

library(dplyr)

#Get the data
myData = data.frame(
  fruitNumber = 1:7,
  fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
  length = c(2,4,2,3,4,4,2),
  colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42)
)
myData
#>   fruitNumber  fruit length colorComplexion
#> 1           1  Apple      2            0.34
#> 2           2 Banana      4            0.23
#> 3           3 Orange      2            0.68
#> 4           4  Peach      3            0.11
#> 5           5  Guava      4            0.47
#> 6           6 Banana      4            0.25
#> 7           7 Orange      2            0.42

#Calculate the similarities per group
myData = myData %>% group_by(fruit, length) %>% 
  mutate(
    similarity = min(colorComplexion) / max(colorComplexion),
    n = n()) %>% ungroup()
myData
#> # A tibble: 7 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           2 Banana      4            0.23      0.92      2
#> 3           3 Orange      2            0.68      0.618     2
#> 4           4 Peach       3            0.11      1         1
#> 5           5 Guava       4            0.47      1         1
#> 6           6 Banana      4            0.25      0.92      2
#> 7           7 Orange      2            0.42      0.618     2

#Remove those with high similarity
#i.e. keep the ones with low similarity or only one sample
myData = myData %>% filter(similarity < 0.75 | n == 1)
myData
#> # A tibble: 5 x 6
#>   fruitNumber fruit  length colorComplexion similarity     n
#>         <int> <chr>   <dbl>           <dbl>      <dbl> <int>
#> 1           1 Apple       2            0.34      1         1
#> 2           3 Orange      2            0.68      0.618     2
#> 3           4 Peach       3            0.11      1         1
#> 4           5 Guava       4            0.47      1         1
#> 5           7 Orange      2            0.42      0.618     2

Created on 2021-07-14 by the reprex package (v2.0.0)

EXPLANATION

  • I have added an extra Orange to show the case where the similarity between two would be low (an thus kept)
  • I grouped data by fruit and length as requested
  • I defined similarity in a group as the max percentage difference between the largest and smallest colorComplexion (so this would also work if there are more than 2 in a group), though you can easily change this function if needed
  • I also calculated the number of items in a group
  • Finally, if there are more than one item per group and the group's similarity < 0.75 they get removed.

If you don't know the dplyr package (part of the Tidyverse), just check out what it all can do here.

Hope this helps,
PJ

Another way of attack is to combine every row with the others and doing a filtering on the result like

myData2 <- tidyr::expand_grid(myData,myData,.name_repair="universal") %>%
  filter(
    (fruitNumber...1 < fruitNumber...5) &
    (length...3 == length...7)  &  
    (abs(colorComplexion...4-colorComplexion...8) < 0.25) 
    # & filter(fruit...2 == fruit...6) 
  )

Edited:
or better readable

myData3 <- merge(myData,myData,by=NULL) %>% 
   filter(
    (fruitNumber.x < fruitNumber.y) &
    (length.x == length.y)  &  
    (abs(colorComplexion.x-colorComplexion.y) < 0.25) 
    # & filter(fruit.x == fruit.y) 
  )

Thank you so much PJ and HanOostdijk! I will certainly give that a try and get back to you with any questions. Thanks again!

Hi again,

I can't seem to generate a reprex, so I'll work on that in the future! But quick additional question. Say I had another column that consisted of "Side" which can either be "Top" or "Bottom" and I wanted to filter based on this additional layer. So instead of just filtering based upon 25% similarity for color complexion and same fruit and length. I also want to only filter out the data if the fruit and length are the same, the color complexion is within 25% similarity, but the side is opposite.

So for the banana example, I would only want to filter it out if one Banana had a "Side" of "Top" and the other Banana had a "Side" of "Bottom".

Let me know if you want a concrete example with a reprex, and I can work on generating that! Thanks again for all the help!

For a reprex I use the RStudio addin that is installed automatically when you install the reprex package.
I select the code I want to show as a reprex and choose in the addins (on my screen below the help menu-item) the 'reprex selection' option. Nothing more to do.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

#Get the data
myData <- data.frame(
  fruitNumber = 1:7,
  fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", "Banana", "Orange"),
  length = c(2,4,2,3,4,4,2),
  colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42),
  Side = c("Top","Top","Top","Top","Top","Bottom","Bottom")
)

myData3 <- merge(myData,myData,by=NULL) %>% 
   filter(
    (fruitNumber.x < fruitNumber.y) &
    (length.x == length.y)  &  
    (abs(colorComplexion.x-colorComplexion.y) < 0.25) &
    (fruit.x == fruit.y) & 
    (Side.x != Side.y)
  )
Created on 2021-07-14 by the reprex package (v2.0.0)

Hi again,

Just for completeness, I'll show you the extension of my approach for your new request as well

library(dplyr)

#Get the data
myData = data.frame(
  fruitNumber = 1:8,
  fruit = c("Apple", "Banana", "Orange", "Peach", "Guava", 
            "Banana", "Orange", "Peach"),
  length = c(2,4,2,3,4,4,2,3),
  colorComplexion= c(0.34,0.23,0.68,0.11,0.47,0.25, 0.42,0.13),
  side = c("Top","Top","Top","Top","Top","Bottom","Bottom", "Top")
)

#Compare and filter
myData = myData %>% group_by(fruit, length) %>% 
  mutate(
    similarity = min(colorComplexion) / max(colorComplexion),
    nSides = n_distinct(side),
    n = n()) %>% 
  ungroup() %>% 
  filter(similarity < 0.75 | n == 1 | nSides == 1)

myData
#> # A tibble: 6 x 8
#>   fruitNumber fruit  length colorComplexion side   similarity nSides     n
#>         <int> <chr>   <dbl>           <dbl> <chr>       <dbl>  <int> <int>
#> 1           1 Apple       2            0.34 Top         1          1     1
#> 2           3 Orange      2            0.68 Top         0.618      2     2
#> 3           4 Peach       3            0.11 Top         0.846      1     2
#> 4           5 Guava       4            0.47 Top         1          1     1
#> 5           7 Orange      2            0.42 Bottom      0.618      2     2
#> 6           8 Peach       3            0.13 Top         0.846      1     2

Created on 2021-07-15 by the reprex package (v2.0.0)

I added the side category and also added a new Peach to with the same side as the previous one. So in the results, the Banana is filtered out because it satisfies all removal criteria, the Orange is kept because its similarity < 0.75% and the Peach is kept because both have the same side.

You can keep adding criteria once you get the hang of the way dplyr grouping and filtering works :slight_smile:

Hope this helps,
PJ

Thank you guys again for all the help, this is really great!

Alright, one last question I swear! The whole process worked for me, but is there any way to write the filtered out rows to a new data frame?

Both the @pieterjanvc solution and mine deliver the result in the form of a new data.frame .
If you mean that the data.frame should have the same format (same columns) as the input,
you could use the rbind function with a selection of the columns :


newData <- rbind (
  myData3 %>% 
      select(fruitNumber.x:Side.x) %>%
      rename(fruitNumber=fruitNumber.x,
             fruit=fruit.x,
             length=length.x,
             colorComplexion=colorComplexion.x,
             Side=Side.x
             )
  ,
  myData3 %>% 
      select(fruitNumber.y:Side.y) %>%
      rename(fruitNumber=fruitNumber.y,
             fruit=fruit.y,
             length=length.y,
             colorComplexion=colorComplexion.y,
             Side=Side.y
             )
 )
print(newData)
#>   fruitNumber  fruit length colorComplexion   Side
#> 1           2 Banana      4            0.23    Top
#> 2           6 Banana      4            0.25 Bottom

If you were using @pieterjanvc solution, you could save the #Compare and filtersection output to myDataKeep, rather than overwriting myData and then

myDataDrop <- anti_join(myData, myDataKeep)

Great, I got the whole thing to work! Thank you to everyone

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.