Specific function to get a value matching row criteria (poor title)

cwright1 · April 13, 2021, 1:55pm

Hi everyone,

I'm needing to perform a specific task on my very large data frame, but I cannot put the terminology into correct words - sorry about that. I've spent a long time on this and I feel like it should be easy but I just can't get it working.

Minimal reprex:

Let's say we have 2 colors A and B and we are mixing them in various quantities and looking at the resulting intensity (I know that's kind of dumb but this is just for example):

mydf <- data.frame("colorA_oz"=c(2,4,6,8,0,0,0,0,2,4,6,8),
                   "colorB_oz"=c(1,5,7,11,1,5,7,11,0,0,0,0),
                   "observed_intensity"=c(3,5,8,8,9,12,13,11,1,10,3,15))

There are 4 different combinations of volume (oz) for each color - plus each color on its own (indicated by the other color having a quantity of 0 oz). I made up an 'observed intensity' column.

My question: How can I get the 'expected intensity' value?
For the first row: 2oz of colorA + 1oz of colorB = intensity of 3. To get the 'expected' value, we lookup colorA intensity alone, colorB intensity alone, and add them together. For the first row the answer should be 10.

Because in row 1.....colorA=2. So find the row where colorA=2 and colorB=0 and take the observed intensity (it's 1). Then also in row1, colorB=1. So find the row where colorA=0 and colorB=1, and this time the intensity is 9. Add them together for a value of 10.

How can I do this for all rows?

mara · April 14, 2021, 1:22pm

Hopefully this helps you get started. A different way of thinking about this data is that you have color "groups" that we'll identify by id (I'm using row numbers before I transform the data), and for each id you have a data frame with the id, the "type" of color (for lack of a better word), e.g., colorA_oz or colorB_oz, and then you have the observed intensity (see the first data frame in the reprex I've created below).

For the second data frame I've created below, you don't need id, because I'm isolating cases where the other color was zero (if you had multiple matches for a given value of one color where the other was equal to zero, you'd run into trouble, but, then again, the method you've described above wouldn't work anyhow in such a case).

Parts I've left to you:
You can use the zero_cases as a sort of "lookup table" in combination with pivoted_df to find the expected value of a given color by matching (performing a join) on color_type and color_val to get the expected intensity for a given color type and value. You would then calculate the expected value for the group by summarizing by rowid .

library(tidyverse)

mydf <- data.frame("colorA_oz"=c(2,4,6,8,0,0,0,0,2,4,6,8),
                   "colorB_oz"=c(1,5,7,11,1,5,7,11,0,0,0,0),
                   "observed_intensity"=c(3,5,8,8,9,12,13,11,1,10,3,15))

mydf %>%
  rowid_to_column() %>%
  pivot_longer(c("colorA_oz", "colorB_oz"), names_to = "color_type", values_to = "color_val")
#> # A tibble: 24 x 4
#>    rowid observed_intensity color_type color_val
#>    <int>              <dbl> <chr>          <dbl>
#>  1     1                  3 colorA_oz          2
#>  2     1                  3 colorB_oz          1
#>  3     2                  5 colorA_oz          4
#>  4     2                  5 colorB_oz          5
#>  5     3                  8 colorA_oz          6
#>  6     3                  8 colorB_oz          7
#>  7     4                  8 colorA_oz          8
#>  8     4                  8 colorB_oz         11
#>  9     5                  9 colorA_oz          0
#> 10     5                  9 colorB_oz          1
#> # … with 14 more rows

zero_cases <- mydf %>%
  filter(colorA_oz == 0 | colorB_oz == 0) %>%
  pivot_longer(!observed_intensity, names_to = "color_type", values_to = "color_val") %>%
  relocate("observed_intensity", .after = everything()) %>%
  filter(color_val > 0)

zero_cases
#> # A tibble: 8 x 3
#>   color_type color_val observed_intensity
#>   <chr>          <dbl>              <dbl>
#> 1 colorB_oz          1                  9
#> 2 colorB_oz          5                 12
#> 3 colorB_oz          7                 13
#> 4 colorB_oz         11                 11
#> 5 colorA_oz          2                  1
#> 6 colorA_oz          4                 10
#> 7 colorA_oz          6                  3
#> 8 colorA_oz          8                 15

^{Created on 2021-04-14 by the reprex package (v2.0.0)}

Edit/bonus hint

You'll get NAs in your join for cases where one of the colors is zero (since I filtered them out), so you can replace the NAs with zero, e.g., using the code below (I'm sure there are other ways to do it as well):

pivoted_df %>%
  left_join(zero_cases) %>%
  mutate(expected_intensity = replace_na(expected_intensity, 0))
#> Joining, by = c("color_type", "color_val")
#> # A tibble: 24 x 5
#>    rowid observed_intensity color_type color_val expected_intensity
#>    <int>              <dbl> <chr>          <dbl>              <dbl>
#>  1     1                  3 colorA_oz          2                  1
#>  2     1                  3 colorB_oz          1                  9
#>  3     2                  5 colorA_oz          4                 10
#>  4     2                  5 colorB_oz          5                 12
#>  5     3                  8 colorA_oz          6                  3
#>  6     3                  8 colorB_oz          7                 13
#>  7     4                  8 colorA_oz          8                 15
#>  8     4                  8 colorB_oz         11                 11
#>  9     5                  9 colorA_oz          0                  0
#> 10     5                  9 colorB_oz          1                  9
#> # … with 14 more rows

^{Created on 2021-04-14 by the reprex package (v2.0.0)}

Edit 2

Realized I forgot to mention that you'd need those row ids in the original data frame if you wanted to return to form in the end:

mydf2 <- mydf %>%
  rowid_to_column() # adds rowid to the data frame you started out with

mydf2 %>%
  left_join(summary_df)
#> Joining, by = "rowid"
#>    rowid colorA_oz colorB_oz observed_intensity expected_intensity
#> 1      1         2         1                  3                 10
#> 2      2         4         5                  5                 22
#> 3      3         6         7                  8                 16
#> 4      4         8        11                  8                 26
#> 5      5         0         1                  9                  9
#> 6      6         0         5                 12                 12
#> 7      7         0         7                 13                 13
#> 8      8         0        11                 11                 11
#> 9      9         2         0                  1                  1
#> 10    10         4         0                 10                 10
#> 11    11         6         0                  3                  3
#> 12    12         8         0                 15                 15

cwright1 · April 14, 2021, 6:52pm

Hi Mara,

Sometimes implementing tasks analogous to this one but with much more complex data can get quickly frustrating. It is people like you, who go out of their way to help others, that both motivate me to keep going and also to help others myself. Thanks for taking the time here. Your example certainly did get me started, and I was able to get the rest completed.

Here's the approach:


mydf <- data.frame("colorA_oz"=c(2,4,6,8,0,0,0,0,2,4,6,8),
                   "colorB_oz"=c(1,5,7,11,1,5,7,11,0,0,0,0),
                   "observed_intensity"=c(3,5,8,8,9,12,13,11,1,10,3,15))


#Get the color intensity of (colorA when colorB=0) or (colorB when colorA=0)
#Thanks Mara!
zero_cases <- mydf %>%
  filter(colorA_oz == 0 | colorB_oz == 0) %>%
  pivot_longer(!observed_intensity, names_to = "color_type", values_to = "color_val") %>%
  relocate("observed_intensity", .after = everything()) %>%
  filter(color_val > 0)

#Isolate by colorA or colorB
#(Note that 'unique' is used here because my own data have duplicated entries)
colorA_isolated <- unique(zero_cases[zero_cases$color_type=="colorA_oz",])
colorB_isolated <- unique(zero_cases[zero_cases$color_type=="colorB_oz",])

#Merge the isolate color data in a way that gives us separate columns:
colors_merged <- merge(colorA_isolated, colorB_isolated, by="color_val", all=TRUE)

#Make unique identifiers by combining color type and color amount (aka color_val)
colors_merged$pasted1 <- paste0(colors_merged$color_type.x, "_", colors_merged$color_val)
colors_merged$pasted2 <- paste0(colors_merged$color_type.y, "_", colors_merged$color_val)

#Clean / just keep the important parts
colors_merged <- colors_merged[,c(3,5,6,7)]

#Let's rename columns to 'exp1' and 'exp2' to represent the expected values when the other drug is zero 
#These will eventually be added together to give the true 'expected' value
names(colors_merged)[names(colors_merged) == "observed_intensity.x"] <- "exp1"
names(colors_merged)[names(colors_merged) == "observed_intensity.y"] <- "exp2"

#Create the same unique identifiers now based on the original data (for the purpose of matching the colors_merged data frame)
mydf$pasted1 <- paste0("colorA_oz", "_", mydf$colorA_oz)
mydf$pasted2 <- paste0("colorB_oz", "_", mydf$colorB_oz)

#Add the exp1 and exp2 columns to the original data:
mydf <- merge(mydf, colors_merged[,c("exp1", "pasted1")], by="pasted1", all.x=TRUE)
mydf <- merge(mydf, colors_merged[,c("exp2", "pasted2")], by ="pasted2", all.x=TRUE)

#Set the NAs in these columns to zero
mydf$exp1[is.na(mydf$exp1)] <- 0
mydf$exp2[is.na(mydf$exp2)] <- 0

#Add the 'exp1' and 'exp2' columns together to get the true expected value:
mydf$expected <- mydf$exp1 + mydf$exp2
mydf$expected <- mydf$exp1 + mydf$exp2

#Just for cleanup, let's remove 'exp1' and 'exp2'
mydf$exp1 <- NULL
mydf$exp2 <- NULL

Now the result:

mydf

        pasted2     pasted1 colorA_oz colorB_oz observed_intensity expected
1   colorB_oz_0 colorA_oz_2         2         0                  1        1
2   colorB_oz_0 colorA_oz_4         4         0                 10       10
3   colorB_oz_0 colorA_oz_6         6         0                  3        3
4   colorB_oz_0 colorA_oz_8         8         0                 15       15
5   colorB_oz_1 colorA_oz_0         0         1                  9        9
6   colorB_oz_1 colorA_oz_2         2         1                  3       10
7  colorB_oz_11 colorA_oz_8         8        11                  8       26
8  colorB_oz_11 colorA_oz_0         0        11                 11       11
9   colorB_oz_5 colorA_oz_4         4         5                  5       22
10  colorB_oz_5 colorA_oz_0         0         5                 12       12
11  colorB_oz_7 colorA_oz_0         0         7                 13       13
12  colorB_oz_7 colorA_oz_6         6         7                  8       16

It's correct. Thanks again, you really saved the day

mara · April 14, 2021, 7:07pm

woohoo! There's something deeply satisfying about seeing the observed_intensity and expected be equal in cases where one of the observed colors is zero!

Glad to have helped. I'm going to mark your response as the solution for any future travelers who come across a similar issue.

system · April 21, 2021, 7:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.