I'm trying to change the names in a column so I can protect the identities of my sources while making a dataset public. Does anyone have suggestions for how to do that with the example dataset below?
I would suggest something like this. First make a dataset with the names and a unique number for each then merge back onto the dataset. Later, remove the names column.
names<-c('john','mary','joseph','john','john')
ages<-c(30,40,33,30,30)
library(tidyverse)
(datinit <- tibble(names=names, ages=ages))
#> # A tibble: 5 x 2
#> names ages
#> <chr> <dbl>
#> 1 john 30
#> 2 mary 40
#> 3 joseph 33
#> 4 john 30
#> 5 john 30
names_cw <- datinit %>%
select(names) %>%
distinct() %>%
mutate(Number=row_number())
names_cw
#> # A tibble: 3 x 2
#> names Number
#> <chr> <int>
#> 1 john 1
#> 2 mary 2
#> 3 joseph 3
datinit %>%
left_join(names_cw, by="names") %>%
select(-names)
#> # A tibble: 5 x 2
#> ages Number
#> <dbl> <int>
#> 1 30 1
#> 2 40 2
#> 3 33 3
#> 4 30 1
#> 5 30 1
There's a lot to think about when masking data. It could be as easy as just giving every source a unique number, but you might also consider the methods here. In particular if there is more than one variable that needs masking.
to complement this approach, I offer this 'cute' usage of the fact that factors ( often used to represent strings) have an internal integer representation. Therefore the following is possible:
new <- mutate(datinit,
names=as.numeric(as.factor(names)))
One of the data structures that is easy to neglect in R is the hash, a lookup table. In the example, with only a handful of unique names (a good name to avoid, because it's the name of an R primative), it's not difficult line up the names with some arbitrary numeric.
But when dealing with more than a handful, doing this semi-by hand becomes infeasible.