Insert index number between rows based on certain text

Rscotty · December 15, 2018, 10:45am

I'm back seeking your help again. I have messy data and I have cleaned and put in a structured format as much as I could. I need help from this point onward.

Solution based on tidyverse and without writing up new function would be great but all solution is still awesomely appreciated.


library(tidyverse)

#sample dataset used for all examples
sample <- tibble::tribble(
                ~Config,            ~Input,
                    "name",             "BMS",
                   "id",         "1234567",
  "postSetupProcessing",     "SplitbyLine",
                "lines",               "2",
            "DrugGroup", "123,235,531,987",
               "report",          "Market",
               "report",       "Waterfall",
               "report",              "KM",
                 "name",           "Intel",
                   "id",          "434976",
            "DrugGroup", "123,498,412,999",
               "report",          "Market",
               "report",       "Waterfall",
               "report",              "KM",
               "report",  "sankey diagram",
                 "name",         "J and J",
                   "id",           "18745",
                  "new",          "Warfin",
                  "old",         "Warfin2",
               "report",          "Market",
               "report",       "Waterfall"
  )

I want to insert the same index number between the certain text of another column. Column Config has text "name'" reoccurring at various interval of row. I want to insert index starting from 1 to all rows that begin from the row which has "name" text and end just before another row which has "name" text.

expected_output_1 <- tibble::tribble(
                       ~Index,               ~Config,            ~Input,
                           1L,                "name",             "BMS",
                           1L,                  "id",         "1234567",
                           1L, "postSetupProcessing",     "SplitbyLine",
                           1L,               "lines",               "2",
                           1L,           "DrugGroup", "123,235,531,987",
                           1L,              "report",          "Market",
                           1L,              "report",       "Waterfall",
                           1L,              "report",              "KM",
                           2L,                "name",           "Intel",
                           2L,                  "id",          "434976",
                           2L,           "DrugGroup", "123,498,412,999",
                           2L,              "report",          "Market",
                           2L,              "report",       "Waterfall",
                           2L,              "report",              "KM",
                           2L,              "report",  "sankey diagram",
                           3L,                "name",         "J and J",
                           3L,                  "id",           "18745",
                           3L,                 "new",          "Warfin",
                           3L,                 "old",         "Warfin2",
                           3L,              "report",          "Market",
                           3L,              "report",       "Waterfall"
                       )

Using the above expected_output_1 result, I'd like to collapse any repeated text in Config column for each unique index and put them in one cell. For e.g. for index 1, report is repeated 3 times, therefore, it is collapsed into just 1 row and its input filled are all collapsed into one cell.

expected_output_2 <- tibble::tribble(
                       ~Index,               ~Config,                               ~Input,
                    
                           1L,                "name",                                "BMS",
                           1L,                  "id",                            "1234567",
                           1L, "postSetupProcessing",                        "SplitbyLine",
                           1L,               "lines",                                  "2",
                           1L,           "DrugGroup",                    "123,235,531,987",
                           1L,              "report",                "Market,Waterfall,KM",
                           2L,                "name",                              "Intel",
                           2L,                  "id",                             "434976",
                           2L,           "DrugGroup",                    "123,498,412,999",
                           2L,              "report", "Market,Waterfall,KM,sankey diagram",
                           3L,                "name",                            "J and J",
                           3L,                  "id",                              "18745",
                           3L,                 "new",                             "Warfin",
                           3L,                 "old",                            "Warfin2",
                           3L,              "report",                   "Market,Waterfall"
                       )

Any help will be Greatly appreciated

cderv · December 15, 2018, 4:29pm

Here is a solution using tidyverse to manipulate your tables. This is for sure one solution among others but it does the job

library(tidyverse)

#sample dataset used for all examples
sample <- tibble::tribble(
  ~Config,            ~Input,
  "name",             "BMS",
  "id",         "1234567",
  "postSetupProcessing",     "SplitbyLine",
  "lines",               "2",
  "DrugGroup", "123,235,531,987",
  "report",          "Market",
  "report",       "Waterfall",
  "report",              "KM",
  "name",           "Intel",
  "id",          "434976",
  "DrugGroup", "123,498,412,999",
  "report",          "Market",
  "report",       "Waterfall",
  "report",              "KM",
  "report",  "sankey diagram",
  "name",         "J and J",
  "id",           "18745",
  "new",          "Warfin",
  "old",         "Warfin2",
  "report",          "Market",
  "report",       "Waterfall"
)

# initiate a counting index
index <- 0L
# iterate on rows and increment index when name is encountered
expected_output_1  <- sample %>%
  mutate(Index = pmap_int(., function(Config, ...) {
    if(Config == "name") index <<- index + 1L 
    index}))


expected_output_1 %>%
  group_by(Index, Config) %>%
  mutate(Input = paste0(Input, collapse = ",")) %>%
  select(Index, Config, Input)
#> # A tibble: 21 x 3
#> # Groups:   Index, Config [15]
#>    Index Config              Input              
#>    <int> <chr>               <chr>              
#>  1     1 name                BMS                
#>  2     1 id                  1234567            
#>  3     1 postSetupProcessing SplitbyLine        
#>  4     1 lines               2                  
#>  5     1 DrugGroup           123,235,531,987    
#>  6     1 report              Market,Waterfall,KM
#>  7     1 report              Market,Waterfall,KM
#>  8     1 report              Market,Waterfall,KM
#>  9     2 name                Intel              
#> 10     2 id                  434976             
#> # ... with 11 more rows

^{Created on 2018-12-15 by the reprex package (v0.2.1)}

above

<<- is used to increment the index value in GlobalEnv
pmap is used to iterate over rows of the dataframe

Rscotty · December 16, 2018, 12:27am

@cderv Thank you for your solution. The solution exactly meets my expected result.

One side note, would you be able to explain how ... argument works? Also, I have never wrote functions, would you recommend any resources?

mutate(Index = pmap_int(., function(Config, ...)

cderv · December 16, 2018, 8:30am

This is the advanced part of my solution. There may be something simpler...

Look at purrr documentation to see how function are passed to map & friends. here, pmap(data.frame, function) expect a function that will work on rows of your data.frame. So i provided a function with argument matching name of the DF columns. As I only need Config, I provided ..., meaning "every others arguments" and I don't use them. Without, I don't think this work because the rows as three columns, so the function must have three argument.

To know more about functions, there is a chapter in R for data.science

and also others on using purrr and so many great stuff. You should read this book.

system · December 23, 2018, 8:34am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.