Easy in Excel, difficult in R - advanced ifelse based on a grid?

Slavek · January 17, 2023, 7:59pm

I have this df where Annual.Visits is just a number

data.source <- data.frame(
  stringsAsFactors = FALSE,
  RegNo = c("1MAM","1MAM","1MAM","1MAM",
            "1MAM","1MAM","1MAM","1MAM","1MAM","1MAM","1MAM",
            "1MAM","1MAM","1MAM","181DL2","181DL2","181DL2",
            "181DL2","181DL2","181DL2","3GRL","3GRL","3GRL","3GRL",
            "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
            "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
            "3GRL","589UOO","589UOO","589UOO","589UOO","589UOO",
            "589UOO","589UOO","589UOO","589UOO","589UOO",
            "589UOO"),
  Year = c(2022,2022,2022,2022,2022,
           2022,2021,2021,2021,2020,2020,2019,2019,2019,2019,
           2019,2019,2019,2019,2019,2022,2022,2021,2021,
           2021,2021,2021,2021,2020,2020,2020,2020,2020,2019,
           2019,2019,2019,2018,2018,2022,2022,2022,2020,
           2020,2020,2020,2020,2020,2019,2019),
  Annual.Visits = c(6,6,6,6,6,6,3,3,3,2,2,
                    3,3,3,6,6,6,6,6,6,2,2,6,6,6,6,6,6,5,5,
                    5,5,5,4,4,4,4,2,2,3,3,3,6,6,6,6,6,6,2,
                    2)
)

data.source

first stage is this

stage1 <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  ...1 = c("181DL2", "1MAM", "3GRL", "589UOO"),
  `2018` = c(NA, NA, 2, NA),
  `2019` = c(6, 3, 4, 2),
  `2020` = c(NA, 2, 5, 6),
  `2021` = c(NA, 3, 6, NA),
  `2022` = c(NA, 6, 2, 3)
)

stage1

as a result I think I need to get this:

result <- data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  ...1 = c("181DL2", "1MAM", "3GRL", "589UOO"),
  `2018` = c(NA, 0, 1, 0),
  `2019` = c(NA, 1, 1, 1),
  `2020` = c(NA, 1, 1, 1),
  `2021` = c(NA, 1, 1, 0)
)

result

this is very simple in excel but complicated in R.
As a result I know that 1 of 2018 RegNo were repeated in 2022, 3 of 2018 RegNo were repeated in 2022, 3 of 2020 RegNo were repeated in 2022 and 2 of RegNo were repeated in 2022. Therefore 25% of 2018 RegNo were repeated in 2022, 75% of 2019 RegNo were repeated in 2022, 75% of 2020 RegNo were repeated in 2022 and 50% of 2021 RegNo were repeated in 2022..

Excel formula between stage1 and result is very simple in Excel:
=IF($G5>0,IF(AND(B5>0,$G5>0),1,0),"")

Xanadu · January 17, 2023, 8:25pm

Better off using SQL at that point.. Better performance over large content.

andresrcs · January 17, 2023, 9:02pm

What is wrong with using Excel then? It is perfectly valid to use the tool you are most fluent on as long as it gets the job done.

If this is more of a didactic exercise to transition into R, I think you should start by showing what have you tried so far and what is your specific coding issue.

Slavek · January 18, 2023, 6:24pm

I know but Excel is not good in dealing with millions of records whereas R is!

This is what I've done:

library(dplyr)
stage2 <- data.source %>% 
  group_by(RegNo) %>%
  mutate(
  Back2018 = case_when(
    (Year==2018 & Annual.Visits > 0) ~ 1),
  Back2019 = case_when(
    (Year==2019 & Annual.Visits > 0) ~ 1),
  Back2020 = case_when(
    (Year==2020 & Annual.Visits > 0) ~ 1),
  Back2021 = case_when(
    (Year==2021 & Annual.Visits > 0) ~ 1))

stage2

stage3 <- stage2 %>%
  group_by(RegNo) %>%
  bind_rows(mutate(.data = ., RegNo = "Total")) %>%
  summarise_at(.vars = vars(starts_with(match = "Back")),.funs = list(Sc = ~mean(.,na.rm=TRUE)))

stage3

but two things are missing,

I should not have any records for 181DL2 as there is no appearance of this RegNo in 2022 - I should have NAs for this RegNo
I should get 0s instead of NAs for all 4 mutate statements but all my attempts with ELSE in case_when failed

andresrcs · January 18, 2023, 7:33pm

Sorry but I fail to understand the logic behind your code and I see no way to achieve what you are describing with the sample data you are providing. You have multiple repeated observations without an identifier so by simply looking at this and without any context, it seems impossible to know if any of the RegNo are "coming back" in 2022. Even if you omit that fact, your code is not achieving what you are saying, it is simply showing if at least one record exists on any given year but 2022, to make it more evident to you, a simple pivoting produces the same output.

library(tidyverse)

data.source <- data.frame(
    stringsAsFactors = FALSE,
    RegNo = c("1MAM","1MAM","1MAM","1MAM",
              "1MAM","1MAM","1MAM","1MAM","1MAM","1MAM","1MAM",
              "1MAM","1MAM","1MAM","181DL2","181DL2","181DL2",
              "181DL2","181DL2","181DL2","3GRL","3GRL","3GRL","3GRL",
              "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
              "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
              "3GRL","589UOO","589UOO","589UOO","589UOO","589UOO",
              "589UOO","589UOO","589UOO","589UOO","589UOO",
              "589UOO"),
    Year = c(2022,2022,2022,2022,2022,
             2022,2021,2021,2021,2020,2020,2019,2019,2019,2019,
             2019,2019,2019,2019,2019,2022,2022,2021,2021,
             2021,2021,2021,2021,2020,2020,2020,2020,2020,2019,
             2019,2019,2019,2018,2018,2022,2022,2022,2020,
             2020,2020,2020,2020,2020,2019,2019),
    Annual.Visits = c(6,6,6,6,6,6,3,3,3,2,2,
                      3,3,3,6,6,6,6,6,6,2,2,6,6,6,6,6,6,5,5,
                      5,5,5,4,4,4,4,2,2,3,3,3,6,6,6,6,6,6,2,
                      2)
)

data.source %>%
    arrange(Year) %>% 
    filter(Year != 2022) %>% 
    pivot_wider(id_cols = RegNo, names_from = Year, names_prefix = "Back_", values_from = Annual.Visits, values_fn = ~ as.numeric(any(. > 0))) %>% 
    bind_rows(summarise(., RegNo = "Total", across(.cols = -RegNo, .fns = ~ as.numeric(any(.)))))

#> # A tibble: 5 × 5
#>   RegNo  Back_2018 Back_2019 Back_2020 Back_2021
#>   <chr>      <dbl>     <dbl>     <dbl>     <dbl>
#> 1 3GRL           1         1         1         1
#> 2 1MAM          NA         1         1         1
#> 3 181DL2        NA         1        NA        NA
#> 4 589UOO        NA         1         1        NA
#> 5 Total          1         1         1         1

^{Created on 2023-01-18 with reprex v2.0.2}

If you manage to better explain your logic or put together a better reprex I'll take another look at this.

Slavek · January 18, 2023, 7:53pm

Thank you. I new R could do that!
The final result should be generated only for RegNo which are present in 2022 in stage1 so the 181DL2 should be excluded from the final table or all its values should be NAs.
Also, I should get either 0 or 1 for other RegNos which would allow me to calculate average values for all "Back"s.
In the end I am expecting the following averages:
Back2018: 33% (0, 1 and 0)
Back2019: 100% (1, 1 and 1)
Back2020: 100% (1, 1 and 1)
Back2021: 66% (1, 1 and 0)

andresrcs · January 18, 2023, 8:01pm

If I ignore the repeated records I think this is what you are trying to do

library(tidyverse)

data.source <- data.frame(
    stringsAsFactors = FALSE,
    RegNo = c("1MAM","1MAM","1MAM","1MAM",
              "1MAM","1MAM","1MAM","1MAM","1MAM","1MAM","1MAM",
              "1MAM","1MAM","1MAM","181DL2","181DL2","181DL2",
              "181DL2","181DL2","181DL2","3GRL","3GRL","3GRL","3GRL",
              "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
              "3GRL","3GRL","3GRL","3GRL","3GRL","3GRL","3GRL",
              "3GRL","589UOO","589UOO","589UOO","589UOO","589UOO",
              "589UOO","589UOO","589UOO","589UOO","589UOO",
              "589UOO"),
    Year = c(2022,2022,2022,2022,2022,
             2022,2021,2021,2021,2020,2020,2019,2019,2019,2019,
             2019,2019,2019,2019,2019,2022,2022,2021,2021,
             2021,2021,2021,2021,2020,2020,2020,2020,2020,2019,
             2019,2019,2019,2018,2018,2022,2022,2022,2020,
             2020,2020,2020,2020,2020,2019,2019),
    Annual.Visits = c(6,6,6,6,6,6,3,3,3,2,2,
                      3,3,3,6,6,6,6,6,6,2,2,6,6,6,6,6,6,5,5,
                      5,5,5,4,4,4,4,2,2,3,3,3,6,6,6,6,6,6,2,
                      2)
)

data.source %>%
    group_by(RegNo, Year) %>% 
    summarise(Annual.Visits = unique(Annual.Visits)) %>% 
    arrange(Year) %>% 
    pivot_wider(id_cols = RegNo,
                names_from = Year,
                names_prefix = "Back_",
                values_from = Annual.Visits,
                values_fn = ~ . > 0,
                values_fill = FALSE) %>% 
    rowwise() %>% 
    transmute(across(.cols = -Back_2022, .fns = ~as.numeric(. & Back_2022))) %>%
    ungroup() %>% 
    bind_rows(summarise(., RegNo = "Total", across(.cols = -RegNo, .fns = ~ sum(.)), .groups = "drop"))
#> `summarise()` has grouped output by 'RegNo'. You can override using the
#> `.groups` argument.
#> # A tibble: 5 × 5
#>   RegNo  Back_2018 Back_2019 Back_2020 Back_2021
#>   <chr>      <dbl>     <dbl>     <dbl>     <dbl>
#> 1 3GRL           1         1         1         1
#> 2 181DL2         0         0         0         0
#> 3 1MAM           0         1         1         1
#> 4 589UOO         0         1         1         0
#> 5 Total          1         3         3         2

^{Created on 2023-01-18 with reprex v2.0.2}

Slavek · January 19, 2023, 12:29pm

You are a genius!!!
You are also proving again we can do almost everything with R!

Slavek · January 19, 2023, 1:13pm

I know the task is resolved but I'm dealing with really large data sets so is it possible to have totals on the top of the table instead of standard bottom?

andresrcs · January 19, 2023, 2:28pm

For starters, it is not standard to have a totals row in a data frame, it is a bad practice actually. You have to differentiate a "data frame" meant to contain data for processing and a "table" which is meant for visually presenting data and it usually includes summaries like totals.

Having said that, if you want to put that row at the top just reverse the order of the row bind.

system · January 26, 2023, 2:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.