Using sqldf to Delete Values from A dataset

Hey Everyone,

I've been stuck on the following query. I'm trying to delete half of the rows in my dataframe that contain a certain string. Here is my code:

sqldf("Delete * From CyclicTest Where (
      SELECT month from CyclicTest Where (
      SELECT TOP 50 PERCENT where month = 'May'"))

Is sqldf the best way? or is there a different way I could go about this?

You could do something like

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(2021)
df <- data.frame(
  month = sample(c("apr","may","jun"),40,replace=T),
  value = rnorm(40)
)
df1  <- df %>%
  group_by(month) %>%
  mutate (pc =percent_rank(value)) %>%
  filter( month != "may" | pc >= 0.5)
Created on 2021-09-14 by the reprex package (v2.0.0)

What argument do I put for x in the percent_rank function? I did 0.5 and got a return value of a new column titled pc with just NaN.

You don't have a numeric variable in your dataset ?
Then you can use the row number as such:

df1  <- df %>%
  group_by(month) %>%
  mutate (pc = percent_rank(row_number())) %>%
  filter( month != "may" | pc >= 0.5)

So this is closer but it ends up removing all of the rows that contain "May", I just need it to remove half.

With the code in this thread, it works for me and it should also work for you.
If you want us to help you, show us your code. E.g. by using a reprex with a limited number of input rows.

1 Like

This is what i'm deploying.

CyclicTest  <- CyclicTest %>%
  group_by(month) %>%
  mutate (pc = percent_rank(row_number(9))) %>%
  filter(month != "May" | pc >= 0.5)

it removes every instance of "May"

You are right.
See the contents of CyclicTest after running

CyclicTest  <- CyclicTest %>%
  group_by(month) %>%
  mutate (pc = percent_rank(row_number(9)))

and then compare your code with the proposed code.

Enjoy!