Create new columns based on rows

vd_18 · February 23, 2021, 2:01am

I have a dataset that looks like so:

ex <- structure(list(reg_desc = c("1-Northeast Region", "1-Northeast Region", 
"1-Northeast Region", "1-Northeast Region", "1-Northeast Region"
), state = c("04-Connecticut", "05-Maine", "04-Connecticut", 
"05-Maine", NA), trigger_city = c("14860-Bridgeport-Stamford-Norwalk", 
"12620-Bangor", NA, NA, NA), Category = c("M", "M", "S", "S", 
"R"), Cred_Fac = c(0, 0, 0.317804971641414, 0, 1), Mean = c(50323.3311111111, 
48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396
)), row.names = c(1L, 7L, 118L, 119L, 136L), class = "data.frame")

Rows containing M in the Category column are metropolitan, S for state, and R for region. I'm trying to find a way where I could create new columns, State_Cred, State_Mean, Reg_Cred, Reg_Mean and fill in the columns based on matching geography. So, my end result is:

hi1 <- data.frame(reg_desc = c("1-Northeast Region", "1-Northeast Region", 
                                  "1-Northeast Region", "1-Northeast Region", "1-Northeast Region"
), state = c("04-Connecticut", "05-Maine", "04-Connecticut", 
             "05-Maine", NA), trigger_city = c("14860-Bridgeport-Stamford-Norwalk", 
                                               "12620-Bangor", NA, NA, NA), Category = c("M", "M", "S", "S", 
                                                                                         "R"), Cred_Fac = c(0, 0, 0.317804971641414, 0, 1), Mean = c(50323.3311111111, 
                                                                                                                                                     48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396),
State_Cred_Fac = c(0.317805,0.000000,NA,NA,NA),Mean_State = c(44220.82,43724.15,NA,NA,NA),
Reg_Cred_Fac = c(1.000000,1.000000,1.000000,1.000000,NA),
Mean_Region = c(50492.07,50492.07,50492.07,50492.07,NA))

So far I've manually been inserting the values, but I'd like to know if there is a way I could do this with code instead?
Thanks in advance

technocrat · February 23, 2021, 7:12am

Thanks for the complete minimal working example.

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

The reprex below illustrates ex in the role of x and target in the role of y.

suppressPackageStartupMessages({
  library(dplyr)
})

ex <- structure(list(reg_desc = c(
  "1-Northeast Region", "1-Northeast Region",
  "1-Northeast Region", "1-Northeast Region", "1-Northeast Region"
), state = c(
  "04-Connecticut", "05-Maine", "04-Connecticut",
  "05-Maine", NA
), trigger_city = c(
  "14860-Bridgeport-Stamford-Norwalk",
  "12620-Bangor", NA, NA, NA
), Category = c(
  "M", "M", "S", "S","R"
), Cred_Fac = c(0, 0, 0.317804971641414, 0, 1), Mean = c(
  50323.3311111111,
  48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396
)), row.names = c(1L, 7L, 118L, 119L, 136L), class = "data.frame")

hi1 <- data.frame(
  reg_desc = c("1-Northeast Region", "1-Northeast Region", "1-Northeast Region", "1-Northeast Region", "1-Northeast Region"),
  state = c("04-Connecticut", "05-Maine", "04-Connecticut", "05-Maine", NA),
  trigger_city = c("14860-Bridgeport-Stamford-Norwalk", "12620-Bangor", NA, NA, NA),
  Category = c("M", "M", "S", "S", "R"),
  Cred_Fac = c(0, 0, 0.317804971641414, 0, 1),
  Mean = c(50323.3311111111, 48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396),
  State_Cred_Fac = c(0.317805, 0.000000, NA, NA, NA), Mean_State = c(44220.82, 43724.15, NA, NA, NA),
  Reg_Cred_Fac = c(1.000000, 1.000000, 1.000000, 1.000000, NA),
  Mean_Region = c(50492.07, 50492.07, 50492.07, 50492.07, NA)
)

# simplify target object 

target <- hi1[,7:10]

# inspect both

ex
#>               reg_desc          state                      trigger_city
#> 1   1-Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk
#> 7   1-Northeast Region       05-Maine                      12620-Bangor
#> 118 1-Northeast Region 04-Connecticut                              <NA>
#> 119 1-Northeast Region       05-Maine                              <NA>
#> 136 1-Northeast Region           <NA>                              <NA>
#>     Category Cred_Fac     Mean
#> 1          M 0.000000 50323.33
#> 7          M 0.000000 48944.43
#> 118        S 0.317805 44220.82
#> 119        S 0.000000 43724.15
#> 136        R 1.000000 50492.07
target
#>   State_Cred_Fac Mean_State Reg_Cred_Fac Mean_Region
#> 1       0.317805   44220.82            1    50492.07
#> 2       0.000000   43724.15            1    50492.07
#> 3             NA         NA            1    50492.07
#> 4             NA         NA            1    50492.07
#> 5             NA         NA           NA          NA

The unique rows of the ex object represent an observation of a single subject. In the first row of ex, the subject is subdivided into four variables

Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk        M

with the two observations

Cred_Fac     Mean

The hi1 object will add

State_Cred_Fac Mean_State Reg_Cred_Fac Mean_Region

which are the means, respectively of the means of the Cred_Fac variables for all the records records for states and regions to which the record in the row belongs.

We'll assemble f piecewise.

The return values sought are typeof numeric, consisting of a function on the some subset of the vectors of ex$Cred_Fac and ex$Mean. This is the innermost operation from which the rest will start. The function is mean, and we'll use the optional argument to deal with missing values, NAs.

mean(some_vector, na.rm = TRUE)

(By the way, it's good practice to encode missing values as NA, rather than zero, since mean with the na.rm = TRUE argument will not mess up the denominator the same way as using 0.00.)

The next step is finding another function that plucks the right subset of the vector of means that we will be sending to mean. For that we can use '{dplyr}group_by`

> ex %>% group_by(state) %>% summarize(State_Cred_Fac = mean(Cred_Fac))
# A tibble: 3 x 2
  state          State_Cred_Fac
* <chr>                   <dbl>
1 04-Connecticut          0.159
2 05-Maine                0    
3 NA                      1

The results illustrate the point about using 0 to encode missing

> mean(ex[c(1,3),5])
[1] 0.1589025
# compare
> mean(c(ex[3,5],NA), na.rm = TRUE)
[1] 0.317805

A helper function, c, the combine function, has been introduced to gather up the values into a vector. If you provide multiple numeric or logical arguments without enclosing them with c

# wrong, second and third arguments are ignored
> mean(1,2,4)
[1] 1
# right
> mean(c(1,2,4))
[1] 2.333333

To convert the errant zeros to NA

ex %>% mutate(Cred_Fac = ifelse(Cred_Fac == 0,NA,Cred_Fac))

Similarly for ex$Mean, if applicable,

  state          State_Mean
* <chr>               <dbl>
1 04-Connecticut     47272.
2 05-Maine           46334.
3 NA                 50492.

There remains taking the return values of the group_by operations and using them to construct the state-level variables in target. (The region-level works similarly; the intermediate variables are solely for clarity in exposition.)

ex %>% group_by(state) %>% summarize(State_Mean = mean(Mean)) -> part1
ex %>% group_by(state) %>% summarize(State_Cred_Fac = mean(Cred_Fac)) -> part2
inner_join(ex,part1,key="state") %>% inner_join(.,part2)
Joining, by = "state"
Joining, by = "state"
            reg_desc          state                      trigger_city Category
1 1-Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk        M
2 1-Northeast Region       05-Maine                      12620-Bangor        M
3 1-Northeast Region 04-Connecticut                              <NA>        S
4 1-Northeast Region       05-Maine                              <NA>        S
5 1-Northeast Region           <NA>                              <NA>        R
  Cred_Fac     Mean State_Mean State_Cred_Fac
1 0.000000 50323.33   47272.08      0.1589025
2 0.000000 48944.43   46334.29      0.0000000
3 0.317805 44220.82   47272.08      0.1589025
4 0.000000 43724.15   46334.29      0.0000000
5 1.000000 50492.07   50492.07      1.0000000

system · March 2, 2021, 7:12am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.