Erasing rows from a dataframe whose string in a coloumn of that df does not appear often enough in other rows

Daniel_R · August 10, 2021, 4:40am

Hi there!

Is it possible to erase rows within a data frame whose string in a particular coloum does not appear often enough?

I am using a data frame to train a neural network. It uses 3/4 of the dataframe as training data,
the remaining 1/4 as test data. If I happen to have a string that only appears once in the dataframe
and it ends up in the test data, the neural network has no idea what to do and returns errors. Even
if I get actual test data and the entire data frame is the training data, it does not seem smart to
have a string in the training data that only appears once.

Unfortunately, my data frame is huge, with over 50000 entries, so there is no way I can check for
every possible string. Is there a way to tell R:

Count every string in this (preferable these, I use 2 coloums with strings) coloum, count each string used,
if a row's entry is used less than let's say n=10 times throughout the entire dataframe, erase the entire row from the dataframe.

Thank you in advance!

pathos · August 10, 2021, 7:57am

You might be interested in lubridate https://lubridate.tidyverse.org

nirgrahamuk · August 10, 2021, 8:30am

I think this is a use case for forcats/tidyverse.
There are several lump functions to help solve issues of rare categories:
Lump together factor levels into "other" — fct_lump • forcats (tidyverse.org)

Daniel_R · August 11, 2021, 3:21am

That should work very well, thank you!

system · September 1, 2021, 3:22am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.