i have a dataframe containing 47 US States dupicated many times in a column. I want to exclude all States except Hawaii (and keep all the Hawaii duplicates) In the second column there are associated numbers (bee colonies lost). I want to calculate the percent proportion of each Hawaii entry (i.e. 12%, 38%, 40%, 10%) to a total of 100%.
what you try to achieve is a usual subset followed by the creation of a new column. There are numerous ways to do this:
Data <- data.frame(
state = rep(c('Hawaii','Nowaii'), each = 10),
value = sample(1:1000,20)
)
# base R
tapply(Data$value, Data$state, FUN = \(x) x/sum(x))
#> $Hawaii
#> [1] 0.23751584 0.04005070 0.07376426 0.22179975 0.02762991 0.06387833
#> [7] 0.10291508 0.02534854 0.08111534 0.12598226
#>
#> $Nowaii
#> [1] 0.082061779 0.174171939 0.165984369 0.039077038 0.148492743 0.019166357
#> [7] 0.157610718 0.007629326 0.184406401 0.021399330
# just add [[1]] behind the tapply and you get only Hawaii
# dplyr (most common)
library('dplyr')
Data |>
filter(state == 'Hawaii') |>
mutate(share = value / sum(value))
#> state value share
#> 1 Hawaii 937 0.23751584
#> 2 Hawaii 158 0.04005070
#> 3 Hawaii 291 0.07376426
#> 4 Hawaii 875 0.22179975
#> 5 Hawaii 109 0.02762991
#> 6 Hawaii 252 0.06387833
#> 7 Hawaii 406 0.10291508
#> 8 Hawaii 100 0.02534854
#> 9 Hawaii 320 0.08111534
#> 10 Hawaii 497 0.12598226
# data.table (faster, but a bit more challenging at thee beginning)
library('data.table')
Data <- as.data.table(Data)
Data[state == 'Hawaii',.(share = value/sum(value))]
#> share
#> 1: 0.23751584
#> 2: 0.04005070
#> 3: 0.07376426
#> 4: 0.22179975
#> 5: 0.02762991
#> 6: 0.06387833
#> 7: 0.10291508
#> 8: 0.02534854
#> 9: 0.08111534
#> 10: 0.12598226
# not really recommended with subsetting
Data[state == 'Hawaii',share := value/sum(value)]
Data
#> state value share
#> 1: Hawaii 937 0.23751584
#> 2: Hawaii 158 0.04005070
#> 3: Hawaii 291 0.07376426
#> 4: Hawaii 875 0.22179975
#> 5: Hawaii 109 0.02762991
#> 6: Hawaii 252 0.06387833
#> 7: Hawaii 406 0.10291508
#> 8: Hawaii 100 0.02534854
#> 9: Hawaii 320 0.08111534
#> 10: Hawaii 497 0.12598226
#> 11: Nowaii 441 NA
#> 12: Nowaii 936 NA
#> 13: Nowaii 892 NA
#> 14: Nowaii 210 NA
#> 15: Nowaii 798 NA
#> 16: Nowaii 103 NA
#> 17: Nowaii 847 NA
#> 18: Nowaii 41 NA
#> 19: Nowaii 991 NA
#> 20: Nowaii 115 NA
# collapse (as fast as data.table but with a closer syntax to dplyr)
library('collapse')
Data |>
fsubset(state == 'Hawaii') |>
# notice fsum() instead of sum() - collapse comes with fast statistical functions
fmutate(share = value / fsum(value))
#> state value share
#> 1: Hawaii 937 0.23751584
#> 2: Hawaii 158 0.04005070
#> 3: Hawaii 291 0.07376426
#> 4: Hawaii 875 0.22179975
#> 5: Hawaii 109 0.02762991
#> 6: Hawaii 252 0.06387833
#> 7: Hawaii 406 0.10291508
#> 8: Hawaii 100 0.02534854
#> 9: Hawaii 320 0.08111534
#> 10: Hawaii 497 0.12598226