Substituting names for letters in a dataset?

I'm trying to change the names in a column so I can protect the identities of my sources while making a dataset public. Does anyone have suggestions for how to do that with the example dataset below?

#create dataset

names<-c(john,mary,joseph,john,john)
ages<-c(30,40,33,30,30)
nameset<-cbind(names,ages)

#change names to alphabets

I would suggest something like this. First make a dataset with the names and a unique number for each then merge back onto the dataset. Later, remove the names column.

names<-c('john','mary','joseph','john','john')
ages<-c(30,40,33,30,30)

library(tidyverse)

(datinit <- tibble(names=names, ages=ages))
#> # A tibble: 5 x 2
#>   names   ages
#>   <chr>  <dbl>
#> 1 john      30
#> 2 mary      40
#> 3 joseph    33
#> 4 john      30
#> 5 john      30

names_cw <- datinit %>%
  select(names) %>%
  distinct() %>%
  mutate(Number=row_number())

names_cw
#> # A tibble: 3 x 2
#>   names  Number
#>   <chr>   <int>
#> 1 john        1
#> 2 mary        2
#> 3 joseph      3

datinit %>%
  left_join(names_cw, by="names") %>%
  select(-names)
#> # A tibble: 5 x 2
#>    ages Number
#>   <dbl>  <int>
#> 1    30      1
#> 2    40      2
#> 3    33      3
#> 4    30      1
#> 5    30      1

Created on 2020-02-24 by the reprex package (v0.3.0)

2 Likes

There's a lot to think about when masking data. It could be as easy as just giving every source a unique number, but you might also consider the methods here. In particular if there is more than one variable that needs masking.

to complement this approach, I offer this 'cute' usage of the fact that factors ( often used to represent strings) have an internal integer representation. Therefore the following is possible:

new <- mutate(datinit,
names=as.numeric(as.factor(names)))

3 Likes

One of the data structures that is easy to neglect in R is the hash, a lookup table. In the example, with only a handful of unique names (a good name to avoid, because it's the name of an R primative), it's not difficult line up the names with some arbitrary numeric.

But when dealing with more than a handful, doing this semi-by hand becomes infeasible.

To illustrate a more scaleable approach

suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(hash)) 
subj_id <- c(9095, 3906, 1175, 1567, 2692, 6287)
name <- structure(list(name = c("able", "baker", "charlie", "dany", "elaine", 
"fay")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
name %>% distinct(name) -> name.keys
h <- hash(keys = name, name = subj_id)
name %>% mutate(value = h$name) %>% select(-name) %>% rename(subject_id = value)
#> # A tibble: 6 x 1
#>   subject_id
#>        <dbl>
#> 1       9095
#> 2       3906
#> 3       1175
#> 4       1567
#> 5       2692
#> 6       6287

Created on 2020-02-24 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.