Filtering Data: By top x of data frames variable

I am trying to filter a dataframe by a variable and struggling. Trying to filter the d.f only selecting the highest x number of a value in the data frame. I have been trying to use filter and top_n, but want to be able to set the top_n on a specific variable. So far all I can get is on the entire data.frame and there is no rationale for what variable the top_n is selecting on.

Any suggestions on how to make this work?

Use the wt parameter for top_n()

Function's documentation says it defaults to the last variable in the data frame.

If you need more specific help, please provide a proper REPRoducible EXample (reprex) illustrating your issue.

I attached a default data set just to make it more clear what I am trying to do. I want to filter the data by returning the two highest values in the Score variable for all of the UCLA datapoints. Then again for the two highest values in the Score variable for all of the FLORIDA datapoints. I am not including any code because I am only filtering the score out right now by highest which I don't think is very helpful.

Home. Score.
UCLA. 4
UCLA. 7
UCLA. 9
UCLA. 10
FLORIDA. 3
FLORIDA. 5
FLORIDA. 6
FLORIDA. 8

Hi tmulflur,

A reprex could be as simple as just providing the example dataset and, optionally, the code with which you're trying to achieve your goal. Doing so, will make it easier for us to assist you quicker, instead of recreating the dataset manually.

Anyway, is the following approach, specifically the slice_head(), providing you the results you're looking for?

library(tidyverse)

df <- tribble(
  ~home, ~score,
  "UCLA", 4,
  "UCLA", 7,
  "UCLA", 9,
  "UCLA", 10,
  "FLORIDA", 3,
  "FLORIDA", 5,
  "FLORIDA", 6,
  "FLORIDA", 8,
  )
df %>% glimpse()
#> Rows: 8
#> Columns: 2
#> $ home  <chr> "UCLA", "UCLA", "UCLA", "UCLA", "FLORIDA", "FLORIDA", "FLORID...
#> $ score <dbl> 4, 7, 9, 10, 3, 5, 6, 8

df %>% 
  group_by(home) %>% 
  arrange(desc(score)) %>% 
  slice_head(n = 2) %>% 
  ungroup()
#> # A tibble: 4 x 2
#>   home    score
#>   <chr>   <dbl>
#> 1 FLORIDA     8
#> 2 FLORIDA     6
#> 3 UCLA       10
#> 4 UCLA        9

Created on 2021-02-23 by the reprex package (v1.0.0)

1 Like

You can group it, first for the home, then you can find the top 1 (or 2 or 3) hits within the group and can define the variable whose intensity is compared (here score):

df %>% 
  group_by(home) %>%
   top_n(1, score) 

# A tibble: 2 x 2
# Groups:   home [2]
 home    score
 <chr>   <dbl>
1 UCLA       10
2 FLORIDA     8

Indeed, there are different ways to approach this and this one is even more concise.
Although, according to the help documentation the top_n() has been superseded, I would then recommend to use slice_max().

Thank you guys! This community is awesome, that helped big time.

Here is the code I am running now:

fhdecposs=filter(decposs, Half_Status <= 1) %>%
group_by(Home) %>%
arrange(fhdecposs, (Poss_Num)) %>%
slice_tail(n=6)

This worked once, but when I try to rerun the code over I get an error:
Error: arrange() failed at implicit mutate() step.

  • Problem with mutate() input ..1.
    x Input ..1 can't be recycled to size 774.
    :information_source: Input ..1 is fhdecposs.
    :information_source: Input ..1 must be size 774 or 1, not 36.

I am trying to troubleshoot myself, but will take any suggestions! Being a beginner at this would be so frustrating without the R community.

What purpose is fhdecposs meant to serve here ?

I am trying to arrange it by a variable (Poss_Num) first so when I slice the tail I am getting the 6 highest values for each group. fhdecposs is the data

If the data is piped in (%>%) then also passing a dataframe name is an error. Try removing it and starting over.

1 Like

That worked!! Thank you so much. So simple. I am new to this and just get lost quickly when trying to put it all together.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.