Erasing rows from a dataframe whose string in a coloumn of that df does not appear often enough in other rows

Hi there!

Is it possible to erase rows within a data frame whose string in a particular coloum does not appear often enough?

I am using a data frame to train a neural network. It uses 3/4 of the dataframe as training data,
the remaining 1/4 as test data. If I happen to have a string that only appears once in the dataframe
and it ends up in the test data, the neural network has no idea what to do and returns errors. Even
if I get actual test data and the entire data frame is the training data, it does not seem smart to
have a string in the training data that only appears once.

Unfortunately, my data frame is huge, with over 50000 entries, so there is no way I can check for
every possible string. Is there a way to tell R:

Count every string in this (preferable these, I use 2 coloums with strings) coloum, count each string used,
if a row's entry is used less than let's say n=10 times throughout the entire dataframe, erase the entire row from the dataframe.

Thank you in advance!

You might be interested in lubridate https://lubridate.tidyverse.org

I think this is a use case for forcats/tidyverse.
There are several lump functions to help solve issues of rare categories:
Lump together factor levels into "other" — fct_lump • forcats (tidyverse.org)

That should work very well, thank you! :slight_smile:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.