search multiple columns for values above '60', if true return the value from another column

MN_AGRICO · April 5, 2023, 6:47pm

I have dataframes like the example below. Unlike this example, the real dataframes I need to analyze are always variable in both dimensions (hundreds of plate_numbers and tens of marker columns).

  plate_number marker_1 marker_2 marker_3
1      plate_1       20       80        5
2      plate_2       10       30       75
3      plate_3       40       90       15
4      plate_4       80       90       30
5      plate_5       30       25       95

I would like to check if the marker values are higher then 60. If so, I would like to have the corresponding plate_number returned. If not, it should return NA.

The desired output would be like below. Keep in mind that the dimensions of the dataframe are variable.

  plate_number marker_1 marker_2 marker_3
1      plate_1       NA  plate_1       NA
2      plate_2       NA       NA  plate_2
3      plate_3       NA  plate_3       NA
4      plate_4  plate_4  plate_4       NA
5      plate_5       NA       NA  plate_5

I know it is probably a simple task in R, but I am not that experienced yet. I tried some things but it only worked for the first column and after that I got stuck.

Any help would be really appreciated!

jrkrideau · April 6, 2023, 12:21am

This seems to do it. Not terribly elegant but .... The data.frame /tibble dimensions do not matter , well unless you are talking millions of rows of data.

library(tidyverse)
dat1  <- tibble::tribble(
  ~plate_number, ~marker_1, ~marker_2, ~marker_3,
  "plate_1",       20L,       80L,        5L,
  "plate_2",       10L,       30L,       75L,
  "plate_3",       40L,       90L,       15L,
  "plate_4",       80L,       90L,       30L,
  "plate_5",       30L,       25L,       95L
)

  
dat2  <- dat1  %>%  pivot_longer(!plate_number,
                         names_to = "marker",
                         values_to = "scores")

dat2$scores <- ifelse(dat2$scores >= 60, dat2$scores, "NA")

dat3 <- dat2  %>%   pivot_wider(names_from = marker, values_from = scores)

technocrat · April 6, 2023, 2:12am

This is simple enough not to absolutely require cut-and-paste reprex (see the FAQ, but it's a good idea to cut friction as much as possible.

This is a concise way to do this, with little syntax to master. I'll unpack it below

# fake data created by random sampling 
# without a seed, so they are likely to
# be all different and different each
# time data frame is created with this
# snippet
m <- matrix(
  c(plate = 1:50,
  sample(20:100,50, replace = TRUE),
  sample(20:100,50, replace = TRUE),
  sample(20:100,50, replace = TRUE)),
  nrow = 50,
  ncol = 4
)

colnames(m) <- c("plate","m1","m2","m3")
head(m)
#>      plate m1 m2  m3
#> [1,]     1 32 72  51
#> [2,]     2 47 60  89
#> [3,]     3 53 84  23
#> [4,]     4 94 98  78
#> [5,]     5 50 72 100
#> [6,]     6 98 61  47

mark_na <- function(x) ifelse(x < 60,NA,x)

m[,2:4] <- apply(m[,2:4],2,mark_na)

head(m)
#>      plate m1 m2  m3
#> [1,]     1 NA 72  NA
#> [2,]     2 NA 60  89
#> [3,]     3 NA 84  NA
#> [4,]     4 94 98  78
#> [5,]     5 NA 72 100
#> [6,]     6 98 61  NA

^{Created on 2023-04-05 with reprex v2.0.2}

My paradigm of using R is school algebra—f(x)=y. x is an object that needs some transformation, y is the object containing the transformation and f is the function object that does the transformation. Each of these may be, and usually is, composite.
The object chosen for x has a big influence on f. I've used a matrix because all the contents to be subject to f is numeric. A matrix must be either all character or all numeric. Internally, both columns and rows are vectors. A data frame, which is where incoming data usually lands, can mix character and numeric types. *However, both columns and rows are lists. This is an important difference because a matrix can be treated as a single object and transformed more simply.
The function isolates the logical condition to be tested—whether a value is less than 61, because those are the values to be replaced with NA.
The matrix object, m, has objects and rows. Here m[,2:4] means all rows of m (because the row position in the brackets is empty and columns 2:4 (if we wanted only the second and fourth column, it would be m[,c(2,4)]). Think row/column, row/column. If only dealing with columns, it can be shorthanded m[2:3] which we usually do. When we want to change only some rows, it would be `m[1:7,2:3]. I find it helpful to always have the comma—one less thing to keep track of.
At this point, we know that we are changing every thing in m except the first, plate column with a value of less than 61 to NA and we know how. Now, we do that in a single pass by applying our function to the target columns by columns (we could also do it row-wise). That's what apply does.
As far as variable dimensions, dim() works like the subset operator, row/column. m is

> dim(m)
[1] 50  4

The script will work for any numbers of rows. Some wand waving is required for a variable number of columns. Come back with a reprex if you need help with that case.

MN_AGRICO · April 6, 2023, 1:35pm

Thank you for your input, it is much appriciated and it helps me to learn bit by bit!

MN_AGRICO · April 6, 2023, 1:40pm

Dear technocrat,

Thanks a lot, your clear explanation was very helpful! I used ncol() to check the column number of my df and the output is saved as a value (NC) . This value was then used in your example, [,2:NC]. It does the trick and will no work for every dataframe with variable column numbers!

Thanks a lot!

system · April 13, 2023, 1:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.