Thanks for the complete minimal working example.
Every R
problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.
The reprex
below illustrates ex
in the role of x and target
in the role of y.
suppressPackageStartupMessages({
library(dplyr)
})
ex <- structure(list(reg_desc = c(
"1-Northeast Region", "1-Northeast Region",
"1-Northeast Region", "1-Northeast Region", "1-Northeast Region"
), state = c(
"04-Connecticut", "05-Maine", "04-Connecticut",
"05-Maine", NA
), trigger_city = c(
"14860-Bridgeport-Stamford-Norwalk",
"12620-Bangor", NA, NA, NA
), Category = c(
"M", "M", "S", "S","R"
), Cred_Fac = c(0, 0, 0.317804971641414, 0, 1), Mean = c(
50323.3311111111,
48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396
)), row.names = c(1L, 7L, 118L, 119L, 136L), class = "data.frame")
hi1 <- data.frame(
reg_desc = c("1-Northeast Region", "1-Northeast Region", "1-Northeast Region", "1-Northeast Region", "1-Northeast Region"),
state = c("04-Connecticut", "05-Maine", "04-Connecticut", "05-Maine", NA),
trigger_city = c("14860-Bridgeport-Stamford-Norwalk", "12620-Bangor", NA, NA, NA),
Category = c("M", "M", "S", "S", "R"),
Cred_Fac = c(0, 0, 0.317804971641414, 0, 1),
Mean = c(50323.3311111111, 48944.4266666667, 44220.8220792079, 43724.1495, 50492.0654351396),
State_Cred_Fac = c(0.317805, 0.000000, NA, NA, NA), Mean_State = c(44220.82, 43724.15, NA, NA, NA),
Reg_Cred_Fac = c(1.000000, 1.000000, 1.000000, 1.000000, NA),
Mean_Region = c(50492.07, 50492.07, 50492.07, 50492.07, NA)
)
# simplify target object
target <- hi1[,7:10]
# inspect both
ex
#> reg_desc state trigger_city
#> 1 1-Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk
#> 7 1-Northeast Region 05-Maine 12620-Bangor
#> 118 1-Northeast Region 04-Connecticut <NA>
#> 119 1-Northeast Region 05-Maine <NA>
#> 136 1-Northeast Region <NA> <NA>
#> Category Cred_Fac Mean
#> 1 M 0.000000 50323.33
#> 7 M 0.000000 48944.43
#> 118 S 0.317805 44220.82
#> 119 S 0.000000 43724.15
#> 136 R 1.000000 50492.07
target
#> State_Cred_Fac Mean_State Reg_Cred_Fac Mean_Region
#> 1 0.317805 44220.82 1 50492.07
#> 2 0.000000 43724.15 1 50492.07
#> 3 NA NA 1 50492.07
#> 4 NA NA 1 50492.07
#> 5 NA NA NA NA
The unique rows of the ex
object represent an observation of a single subject. In the first row of ex
, the subject is subdivided into four variables
Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk M
with the two observations
Cred_Fac Mean
The hi1
object will add
State_Cred_Fac Mean_State Reg_Cred_Fac Mean_Region
which are the means, respectively of the means of the Cred_Fac variables for all the records records for states and regions to which the record in the row belongs.
We'll assemble f piecewise.
- The return values sought are typeof numeric, consisting of a function on the some subset of the vectors of
ex$Cred_Fac
and ex$Mean
. This is the innermost operation from which the rest will start. The function is mean
, and we'll use the optional argument to deal with missing values, NA
s.
mean(some_vector, na.rm = TRUE)
(By the way, it's good practice to encode missing values as NA
, rather than zero, since mean
with the na.rm = TRUE
argument will not mess up the denominator the same way as using 0.00.)
- The next step is finding another function that plucks the right subset of the vector of means that we will be sending to
mean
. For that we can use '{dplyr}group_by`
> ex %>% group_by(state) %>% summarize(State_Cred_Fac = mean(Cred_Fac))
# A tibble: 3 x 2
state State_Cred_Fac
* <chr> <dbl>
1 04-Connecticut 0.159
2 05-Maine 0
3 NA 1
The results illustrate the point about using 0 to encode missing
> mean(ex[c(1,3),5])
[1] 0.1589025
# compare
> mean(c(ex[3,5],NA), na.rm = TRUE)
[1] 0.317805
A helper function, c
, the combine function, has been introduced to gather up the values into a vector. If you provide multiple numeric or logical arguments without enclosing them with c
# wrong, second and third arguments are ignored
> mean(1,2,4)
[1] 1
# right
> mean(c(1,2,4))
[1] 2.333333
To convert the errant zeros to NA
ex %>% mutate(Cred_Fac = ifelse(Cred_Fac == 0,NA,Cred_Fac))
Similarly for ex$Mean
, if applicable,
state State_Mean
* <chr> <dbl>
1 04-Connecticut 47272.
2 05-Maine 46334.
3 NA 50492.
- There remains taking the return values of the
group_by
operations and using them to construct the state-level variables in target
. (The region-level works similarly; the intermediate variables are solely for clarity in exposition.)
ex %>% group_by(state) %>% summarize(State_Mean = mean(Mean)) -> part1
ex %>% group_by(state) %>% summarize(State_Cred_Fac = mean(Cred_Fac)) -> part2
inner_join(ex,part1,key="state") %>% inner_join(.,part2)
Joining, by = "state"
Joining, by = "state"
reg_desc state trigger_city Category
1 1-Northeast Region 04-Connecticut 14860-Bridgeport-Stamford-Norwalk M
2 1-Northeast Region 05-Maine 12620-Bangor M
3 1-Northeast Region 04-Connecticut <NA> S
4 1-Northeast Region 05-Maine <NA> S
5 1-Northeast Region <NA> <NA> R
Cred_Fac Mean State_Mean State_Cred_Fac
1 0.000000 50323.33 47272.08 0.1589025
2 0.000000 48944.43 46334.29 0.0000000
3 0.317805 44220.82 47272.08 0.1589025
4 0.000000 43724.15 46334.29 0.0000000
5 1.000000 50492.07 50492.07 1.0000000