Sorting data based on adjacent cell value

ernijs12 · May 8, 2022, 1:12pm

Dear reader,

I am in great need of help. I have data on purchased items, for example, hammer, nails, saw, ax. Each row is a purchase order of one of these items.
Each order has a supplier denoted with a supplier number, and a date when it was ordered.

I have to sort all of each items observations into one of four groups:

Single supplier to multiple
Multiple suppliers to single
Only single
Only multiple

Single supplier to multiple should hold items that during first years had only one supplier and then in later years had 2 or more.

Multiple suppliers to single should hold items that during first years where sourced from 2 or more suppliers and then in later years only had one supplier.

Only single should hold items that throughout all years have been sourced from only one supplier.

Only multiple should hold items that have been sourced from multiple supplier every year.

I know it is hard to understand this problem from the description, so I have attached an example. In the picture the table on left hand side is the unedited table and on right hand side are the 4 filtered tables, the result I am trying to get. This example has only 2 yeas (2020, 2021) but the real data set will have more.

If you have any ideas, it would be greatly appreciated. huge thanks in advance!

ernijs12 · May 9, 2022, 10:08am

I was thinking maybe an if loop

AlexisW · May 10, 2022, 4:23am

It would help if you shared your data e.g. with dput(), so that we can directly work with it. See here for instructions about making a reprex.

So I just re-typed some minimal example, with fake data:

library(tidyverse)
df <- tibble(item = rep(c("ax","nail","hammer","saw"), each = 3),
                         supplier_number = c(1,2,2,
                                                                3,3,3,
                                                                3,3,4,
                                                                5,5,6),
                         date = c("01/01/2020","01/02/2020","01/01/2021",
                                         "01/01/2020","01/02/2020","01/01/2021",
                                         "01/01/2020","01/02/2020","01/01/2021",
                                         "01/01/2020","01/01/2021","01/02/2021")) |>
    mutate(date = as.Date(date, "%m/%d/%Y"))

df
#> # A tibble: 12 x 3
#>    item   supplier_number date      
#>    <chr>            <dbl> <date>    
#>  1 ax                   1 2020-01-01
#>  2 ax                   2 2020-01-02
#>  3 ax                   2 2021-01-01
#>  4 nail                 3 2020-01-01
#>  5 nail                 3 2020-01-02
#>  6 nail                 3 2021-01-01
#>  7 hammer               3 2020-01-01
#>  8 hammer               3 2020-01-02
#>  9 hammer               4 2021-01-01
#> 10 saw                  5 2020-01-01
#> 11 saw                  5 2021-01-01
#> 12 saw                  6 2021-01-02

So, first thing, I'll take this data, extract the year from the date, and, for each item, find out what is the first year for which we have data. Then I'll filter the data frame to only keep the first year, and count the number of rows: this tells me how many suppliers we had on the first year.

first_year <- df |>
    mutate(year = lubridate::year(date)) |>
    group_by(item) |>
    filter(year == min(year)) |>
    ungroup() |>
    select(item, supplier_number) |>
    distinct() |>
    group_by(item) |>
    summarize(nb_suppliers_first_year = n())
first_year
#> # A tibble: 4 x 2
#>   item   nb_suppliers_first_year
#>   <chr>                    <int>
#> 1 ax                           2
#> 2 hammer                       1
#> 3 nail                         1
#> 4 saw                          1

Similarly, I can find out how many suppliers we had on the other years: this time I'll filter the data frame to keep only, for each item, the rows which do NOT correspond to the first year:

later_years <- df |>
    mutate(year = lubridate::year(date)) |>
    group_by(item) |>
    filter(year != min(year)) |>
    ungroup() |>
    select(item, supplier_number) |>
    distinct() |>
    group_by(item) |>
    summarize(nb_suppliers_later_years = n())
later_years
#> # A tibble: 4 x 2
#>   item   nb_suppliers_later_years
#>   <chr>                     <int>
#> 1 ax                            1
#> 2 hammer                        1
#> 3 nail                          1
#> 4 saw                           2

Now that I have this data for each item, I can re-inject that into the initial dataframe:

df2 <- df |>
    left_join(first_year,
                        by = "item") |>
    left_join(later_years,
                        by = "item")
df2
#> # A tibble: 12 x 5
#>    item   supplier_number date       nb_suppliers_first_year nb_suppliers_later~
#>    <chr>            <dbl> <date>                       <int>               <int>
#>  1 ax                   1 2020-01-01                       2                   1
#>  2 ax                   2 2020-01-02                       2                   1
#>  3 ax                   2 2021-01-01                       2                   1
#>  4 nail                 3 2020-01-01                       1                   1
#>  5 nail                 3 2020-01-02                       1                   1
#>  6 nail                 3 2021-01-01                       1                   1
#>  7 hammer               3 2020-01-01                       1                   1
#>  8 hammer               3 2020-01-02                       1                   1
#>  9 hammer               4 2021-01-01                       1                   1
#> 10 saw                  5 2020-01-01                       1                   2
#> 11 saw                  5 2021-01-01                       1                   2
#> 12 saw                  6 2021-01-02                       1                   2

At this point, we're done, we can simply filter, for example to keep the rows where the number of suppliers first year is 1 and there are more than 1 suppliers in the other years (and we can do something similar for all your combinations of interest):

df2 |>
    filter(nb_suppliers_first_year == 1,
                 nb_suppliers_later_years > 1)
#> # A tibble: 3 x 5
#>   item  supplier_number date       nb_suppliers_first_year nb_suppliers_later_y~
#>   <chr>           <dbl> <date>                       <int>                 <int>
#> 1 saw                 5 2020-01-01                       1                     2
#> 2 saw                 5 2021-01-01                       1                     2
#> 3 saw                 6 2021-01-02                       1                     2

^{Created on 2022-05-10 by the reprex package (v2.0.1)}

ernijs12 · May 12, 2022, 3:10pm

Thank you very much Alexis. I really appreciate you taking your time and helping me with this. You actually really did help and I will take your advice about dput() for future.

system · May 19, 2022, 3:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.