Note: I do have a working solution for this, but am curious if there's a better/more intuitive/tidyverse-friendly way to do it. So, not an urgent problem!
I'm working on correcting a recurrent typo in a data frame. It can occur in any column, so I want to correct it all at once for the whole data frame. But I also have a column in my data frame called "updateID" that I use to track changes to each row, so in order to update that column, I will need to get the indices for all rows in which the typo was found (and corrected).
Here's a minimal example. Say we have a data frame df
, containing some nonsense data:
# Create an example data frame
df <- data.frame(color = c("blue", "green", "red<"),
animal = c("horse", "duck", "<rat"),
place = c("beach<", "bedroom", "street"),
updateID = "previousUpdate")
See how some of the cells have <
sprinkled around? That's the example character that I want to remove. I can do that, using the tidyverse, like this:
library(dplyr)
library(stringr)
# replace "<" with "" (i.e. remove it)
df <- df %>%
mutate(across(.fns = ~str_replace_all(.x, "<", "")))
# see the resulting df:
df
So, that's all well and good. But before I do that, I have to figure out which rows got changed. So far, this is how I'm approaching it:
# create an example data frame
df <- data.frame(color = c("blue", "green", "red<"),
animal = c("horse", "duck", "<rat"),
place = c("beach<", "bedroom", "street"),
updateID = "previousUpdate")
# get indices of the rows containing "<"
inds <- which(apply(df, 1, function(x) any(grepl("<", x))))
inds
[1] 1 3
# replace "<" with "" (i.e. remove it)
df <- df %>%
mutate(across(.fns = ~str_replace_all(.x, "<", "")))
# Mark the rows that got changed
df[inds, "updateID"] <- "newUpdate"
(Just to be very clear, this is the desired output:)
color animal place updateID
1 blue horse beach newUpdate
2 green duck bedroom previousUpdate
3 red rat street newUpdate
Buuuuuut... I don't love this approach! It seems like a whole lot of lines and a lot of mixing of tidyverse and base to accomplish what feels like it should be a relatively simple, doable task.
I have the intuition that this should be doable with dplyr::rowwise
and/or dplyr::across
, but I'm pretty new to using both of those functions. I've tried several permutations of them and can't quite get it to work.
Does anyone have a nice, concise approach to this? Thank you very much!