Exclude character entries from a dataframe column

Hi Community,

i have a dataframe containing 47 US States dupicated many times in a column. I want to exclude all States except Hawaii (and keep all the Hawaii duplicates) In the second column there are associated numbers (bee colonies lost). I want to calculate the percent proportion of each Hawaii entry (i.e. 12%, 38%, 40%, 10%) to a total of 100%.

Im very thankful for any help!
Kind regards

Hello,

what you try to achieve is a usual subset followed by the creation of a new column. There are numerous ways to do this:

Data <- data.frame(
  state = rep(c('Hawaii','Nowaii'), each = 10),
  value = sample(1:1000,20)
)

# base R
tapply(Data$value, Data$state, FUN = \(x) x/sum(x))
#> $Hawaii
#>  [1] 0.23751584 0.04005070 0.07376426 0.22179975 0.02762991 0.06387833
#>  [7] 0.10291508 0.02534854 0.08111534 0.12598226
#> 
#> $Nowaii
#>  [1] 0.082061779 0.174171939 0.165984369 0.039077038 0.148492743 0.019166357
#>  [7] 0.157610718 0.007629326 0.184406401 0.021399330
# just add [[1]] behind the tapply and you get only Hawaii

# dplyr (most common)
library('dplyr')

Data |>
  filter(state == 'Hawaii') |>
  mutate(share = value / sum(value))
#>     state value      share
#> 1  Hawaii   937 0.23751584
#> 2  Hawaii   158 0.04005070
#> 3  Hawaii   291 0.07376426
#> 4  Hawaii   875 0.22179975
#> 5  Hawaii   109 0.02762991
#> 6  Hawaii   252 0.06387833
#> 7  Hawaii   406 0.10291508
#> 8  Hawaii   100 0.02534854
#> 9  Hawaii   320 0.08111534
#> 10 Hawaii   497 0.12598226

# data.table (faster, but a bit more challenging at thee beginning)
library('data.table')

Data <- as.data.table(Data)
Data[state == 'Hawaii',.(share = value/sum(value))]
#>          share
#>  1: 0.23751584
#>  2: 0.04005070
#>  3: 0.07376426
#>  4: 0.22179975
#>  5: 0.02762991
#>  6: 0.06387833
#>  7: 0.10291508
#>  8: 0.02534854
#>  9: 0.08111534
#> 10: 0.12598226
# not really recommended with subsetting
Data[state == 'Hawaii',share := value/sum(value)]
Data
#>      state value      share
#>  1: Hawaii   937 0.23751584
#>  2: Hawaii   158 0.04005070
#>  3: Hawaii   291 0.07376426
#>  4: Hawaii   875 0.22179975
#>  5: Hawaii   109 0.02762991
#>  6: Hawaii   252 0.06387833
#>  7: Hawaii   406 0.10291508
#>  8: Hawaii   100 0.02534854
#>  9: Hawaii   320 0.08111534
#> 10: Hawaii   497 0.12598226
#> 11: Nowaii   441         NA
#> 12: Nowaii   936         NA
#> 13: Nowaii   892         NA
#> 14: Nowaii   210         NA
#> 15: Nowaii   798         NA
#> 16: Nowaii   103         NA
#> 17: Nowaii   847         NA
#> 18: Nowaii    41         NA
#> 19: Nowaii   991         NA
#> 20: Nowaii   115         NA

# collapse (as fast as data.table but with a closer syntax to dplyr)
library('collapse')

Data |>
  fsubset(state == 'Hawaii') |>
  # notice fsum() instead of sum() - collapse comes with fast statistical functions
  fmutate(share = value / fsum(value))
#>      state value      share
#>  1: Hawaii   937 0.23751584
#>  2: Hawaii   158 0.04005070
#>  3: Hawaii   291 0.07376426
#>  4: Hawaii   875 0.22179975
#>  5: Hawaii   109 0.02762991
#>  6: Hawaii   252 0.06387833
#>  7: Hawaii   406 0.10291508
#>  8: Hawaii   100 0.02534854
#>  9: Hawaii   320 0.08111534
#> 10: Hawaii   497 0.12598226

Created on 2022-10-31 with reprex v2.0.2

Kind regards

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.