Get indices for rows containing a value in any column

kaijabean · February 3, 2021, 9:26pm

Note: I do have a working solution for this, but am curious if there's a better/more intuitive/tidyverse-friendly way to do it. So, not an urgent problem!

I'm working on correcting a recurrent typo in a data frame. It can occur in any column, so I want to correct it all at once for the whole data frame. But I also have a column in my data frame called "updateID" that I use to track changes to each row, so in order to update that column, I will need to get the indices for all rows in which the typo was found (and corrected).

Here's a minimal example. Say we have a data frame df, containing some nonsense data:

# Create an example data frame
df <- data.frame(color = c("blue", "green", "red<"), 
                 animal = c("horse", "duck", "<rat"), 
                 place = c("beach<", "bedroom", "street"),
                 updateID = "previousUpdate")

See how some of the cells have < sprinkled around? That's the example character that I want to remove. I can do that, using the tidyverse, like this:

library(dplyr)
library(stringr)

# replace "<" with "" (i.e. remove it)
df <- df %>%
    mutate(across(.fns = ~str_replace_all(.x, "<", "")))

# see the resulting df:
df

So, that's all well and good. But before I do that, I have to figure out which rows got changed. So far, this is how I'm approaching it:

# create an example data frame
df <- data.frame(color = c("blue", "green", "red<"), 
                 animal = c("horse", "duck", "<rat"), 
                 place = c("beach<", "bedroom", "street"),
                 updateID = "previousUpdate")

# get indices of the rows containing "<"
inds <- which(apply(df, 1, function(x) any(grepl("<", x))))
inds
[1] 1 3

# replace "<" with "" (i.e. remove it)
df <- df %>%
    mutate(across(.fns = ~str_replace_all(.x, "<", "")))

# Mark the rows that got changed
df[inds, "updateID"] <- "newUpdate"

(Just to be very clear, this is the desired output:)

  color animal   place       updateID
1  blue  horse   beach      newUpdate
2 green   duck bedroom previousUpdate
3   red    rat  street      newUpdate

Buuuuuut... I don't love this approach! It seems like a whole lot of lines and a lot of mixing of tidyverse and base to accomplish what feels like it should be a relatively simple, doable task.

I have the intuition that this should be doable with dplyr::rowwise and/or dplyr::across, but I'm pretty new to using both of those functions. I've tried several permutations of them and can't quite get it to work.

Does anyone have a nice, concise approach to this? Thank you very much!

technocrat · February 4, 2021, 1:13am

Consider changing the logic a bit and detecting rows that need changing first and mutate or not the updateID field and then mutating the affected value.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})
DF <- tibble(
  color = c("blue", "green", "red<"),
  animal = c("horse", "duck", "<rat"),
  place = c("beach<", "bedroom", "street"),
  updateID = "previousUpdate"
)

pattern <- "<"

detect <- function(x, y, z) str_detect(paste(x, y, z), pattern)

DF %>%
  mutate(updateID = ifelse(
    detect(place, color, animal),
    "newUpdate",
    updateID
  ))
#> # A tibble: 3 x 4
#>   color animal place   updateID      
#>   <chr> <chr>  <chr>   <chr>         
#> 1 blue  horse  beach<  newUpdate     
#> 2 green duck   bedroom previousUpdate
#> 3 red<  <rat   street  newUpdate

andresrcs · February 4, 2021, 1:18am

Another option using latest dplyr features

library(dplyr)
library(stringr)

df <- data.frame(color = c("blue", "green", "red<"), 
                 animal = c("horse", "duck", "<rat"), 
                 place = c("beach<", "bedroom", "street"),
                 updateID = "previousUpdate")
df %>%
    mutate(
        updateID = if_else(if_any(where(is.character), ~ str_detect(.x, "<")),
                           "newUpdate",
                           updateID),
        across(where(is.character), .fns = ~ str_remove_all(.x, "<"))
        )
#>   color animal   place       updateID
#> 1  blue  horse   beach      newUpdate
#> 2 green   duck bedroom previousUpdate
#> 3   red    rat  street      newUpdate

^{Created on 2021-02-04 by the reprex package (v1.0.0)}

joels · February 19, 2021, 5:44pm

How about pivoting to long to operate on the data frame and then restoring the wide shape at the end? Operating on a long data frame allows you to operate on a single column of values instead of across multiple columns. We just need to add a row-number column (essentially each observation's ID number) so that we can put each long-row back in the correct wide-row at the end.

library(tidyverse)

# Create an example data frame
df <- data.frame(color = c("blue", "green", "red<"), 
                 animal = c("horse", "duck", "<rat"), 
                 place = c("beach<", "bedroom", "street"),
                 updateID = "previousUpdate")

df %>% 
  rownames_to_column(var="rownum") %>% 
  pivot_longer(-c(rownum, updateID)) %>% 
  group_by(rownum) %>% 
  mutate(updateID = case_when(any(grepl("<", value)) ~ "newUpdate",
                              TRUE ~updateID),
         value=gsub("<", "", value)) %>% 
  pivot_wider()
#> # A tibble: 3 x 5
#> # Groups:   rownum [3]
#>   rownum updateID       color animal place  
#>   <chr>  <chr>          <chr> <chr>  <chr>  
#> 1 1      newUpdate      blue  horse  beach  
#> 2 2      previousUpdate green duck   bedroom
#> 3 3      newUpdate      red   rat    street

^{Created on 2021-02-19 by the reprex package (v1.0.0)}

system · March 12, 2021, 5:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.