Select Unique rows woth values keeping all the columns

bbi · February 14, 2020, 11:21pm

Hi,

I have a dataset with 220 obs of 41 variables and would like to extract only the rows with unique values (keeping all the col) in col 3 (ID number so unpredictable values) how can I do that ? The other variable are not completely between the different rows with identical ID numbers but not important at that point which rows is kept.
I try several options
df2 <- df%>% select(ID, Name) %>% unique() # But do not know then how to keep the other columns

which(duplicated(cf$ID))
df2<-df[-doublons,3, ]# actually working but >50 duplicated (some 4x, 4x) so if do not want to enter them manually!

uniqueID <- unique(df$ID)
dfset <- subset(df, MRN==unique(df$ID) #error message "longer object length is not a multiple of shorter object length"

Thanks a lot for your help, hope my question is understandable.

siddharthprabhu · February 15, 2020, 5:27am

I think this is what you want. Let me know if your objective is something different.

library(tidyverse)

df <- tribble(~col1, ~col2, ~col3,
              "X", "A", 1,
              "Y", "B", 2,
              "X", "B", 2,
              "Y", "C", 3,
              "Z", "B", 3,
              "Z", "C", 4)

head(df)
#> # A tibble: 6 x 3
#>   col1  col2   col3
#>   <chr> <chr> <dbl>
#> 1 X     A         1
#> 2 Y     B         2
#> 3 X     B         2
#> 4 Y     C         3
#> 5 Z     B         3
#> 6 Z     C         4

df %>% 
  group_by(col3) %>% 
  filter(n() == 1)
#> # A tibble: 2 x 3
#> # Groups:   col3 [2]
#>   col1  col2   col3
#>   <chr> <chr> <dbl>
#> 1 X     A         1
#> 2 Z     C         4

^{Created on 2020-02-15 by the reprex package (v0.3.0)}

Dawie · February 15, 2020, 1:19pm

You could try the dplyr::distinct() function

df %>% distinct(col3, .keep_all = TRUE)

This will return:

A tibble: 4 x 3

col1 col2 col3

1 X A 1
2 Y B 2
3 Y C 3
4 Z C 4

valeri · February 15, 2020, 6:35pm

Just want to add that distinct() will keep the first occurrence of a given value (that might or might not be what you would want).

siddharthprabhu · February 16, 2020, 5:00am

@Dawie The way I've interpreted the OP's question is that only rows containing a unique value in the specified column (col3 in my example) need to be returned. distinct returns the first observation encountered for each possible value in col3 (which is a different result).

Perhaps @bbi could clarify which of these is the expected result.

Dawie · February 16, 2020, 8:47am

@valeri thank you for mentioning that. To add to this, you need to arrange df in a way that it will return the correct row you want.

bbi · February 18, 2020, 1:41pm

Thank you very much for your help!
Yes @siddharthprabhu understand correctly, I want to keep only the rows with unique value in col3.
One remaining question (probably quiet basic, I am still new to R). I am not able to use groupe_by :
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"
My datasheet consists of a mixt of characters, numeric and factors and the variables of interest (col3 is a character, ID number).
Tribble seems nice but as have 220 rows, 41 col and the values in col3 are ID numbers with 9-10 numbers I was hoping for an easier way.
Thanks again for your help.

nirgrahamuk · February 18, 2020, 1:50pm

bbi,
what did you try with group_by ?

bbi · February 18, 2020, 1:55pm

yes
df %>%
group_by(col3) %>%
filter(n() == 1)

Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"

nirgrahamuk · February 18, 2020, 1:58pm

if you use class(df) what does it show ?
my guess is that your df is not a dataframe nor a tibble.
Perhaps go back to your code that makes your df and rerun that first.

bbi · February 18, 2020, 2:11pm

ok solved thank you so much ! One remaining stupid question what argument is expected for the filter(n() == 1) if not specification of the variable (ID is col3 title) ?

df
%>% group_by(col3) %>%
filter(n(df$ID) == 1)

Error in n(df$ID) : unused argument (df$ID)

nirgrahamuk · February 18, 2020, 2:36pm

filter() wants to subset your data based on a condition. i.e (== 1 )
n() is a function that counts the rows within defined group_by, which in your case is col3.

I don't know what your intensions are so its hard to guide you. but when using dplyr/tidyverse, referencing varaibles by df$myvariable would not be required as the dplyr functions know the data/df that was piped to them ( %>% is a pipe) and will recognise that ID is a column in there (if it is).

bbi · February 18, 2020, 2:58pm

Thank you very much. Want I am trying to do is to extract (filter) all the rows in my datasheet that have unique ID number.
my df look like this
day epo ID LastName FirstName DOB Gender ..--> up to 41col
1 1 44422432 James ab 2.12.97 H
1 2 44422432 James ab 2.12.97 H
2 1 44422432 James ab 2.12.97 H
1 1 893448451 Henri er 3.5.78 H
2 1 893448451 Henri er 3.5.78 H
--> up to 220 rows

siddharthprabhu · February 18, 2020, 3:01pm

For your sample data, you just need to put the ID variable in the call to group_by. So something like this should work:

# assuming your data frame is named df
df %>%
group_by(ID) %>%
filter(n() == 1)

By the way, tribble() is generally used to prepare small data sets. The idea is to show you how to solve a given problem on some dummy data. You need to modify the solution provided by the community and apply it to your own dataset which would have been imported using some other method. You're not really expected to create it from scratch using tribble().

bbi · February 18, 2020, 3:43pm

Thank you so much to spend time helping me with my baby step as an R beginner.
I tried but however it is still not working (no error message but my datasheet remain unchanged)

> rm(list=ls())
> library(readxl)
> library(writexl)
> library(xlsx)
> library(tidyverse)
> 
> df<- read_excel("df", 
+                               sheet = "a")
New names:
* `` -> ...40
* `` -> ...41
> 
> df %>% 
+   group_by(ID) %>% 
+   filter(n() == 1)
# A tibble: 35 x 41
# Groups:   ID [35]

> View(df)

nirgrahamuk · February 18, 2020, 3:47pm

I'm afraid its not so simply to say you want all rows in a dataframe that have a unique id number, because you are not proposing an explicit method of choosing which accompanying data to show along with the unique id, should it be the first entry, the last, or should it be an aggregation, like the max or the min or the average of the other values.

Here are some examples of picking the row based on an order approach.
I'll start by taking the example df, and make a df2 which a 'rn' variable to hold the position of the row from the perspective of the col3 group.

Then I show how to take top or bottom

df <- tribble(~col1, ~col2, ~col3,
              "X", "A", 1,
              "Y", "B", 2,
              "X", "B", 2,
              "Y", "C", 3,
              "Z", "B", 3,
              "Z", "C", 4)

df2 <- df %>%
  group_by(col3) %>%
  mutate(rn = row_number()) 

df2 
# A tibble: 6 x 4
# Groups:   col3 [4]
  col1  col2   col3    rn
  <chr> <chr> <dbl> <int>
1 X     A         1     1
2 Y     B         2     1
3 X     B         2     2
4 Y     C         3     1
5 Z     B         3     2
6 Z     C         4     1

df2%>%
  top_n(n = -1)
#Selecting by rn
# A tibble: 4 x 4
# Groups:   col3 [4]
  col1  col2   col3    rn
  <chr> <chr> <dbl> <int>
1 X     A         1     1
2 Y     B         2     1
3 Y     C         3     1
4 Z     C         4     1

df2%>%
  top_n(n = 1)
#Selecting by rn
# A tibble: 4 x 4
# Groups:   col3 [4]
  col1  col2   col3    rn
  <chr> <chr> <dbl> <int>
1 X     A         1     1
2 X     B         2     2
3 Z     B         3     2
4 Z     C         4     1

system · March 10, 2020, 3:47pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.