scramble data with tidyverse

Dear tidyverse
I am trying to scramble real world data to protect data owners. I want to achieve within + or - one standard deviation for all numeric fields.
E.g. seps column is now 81 but after scramble, it would be within one sd of 81.
Any better way to scrambling is also welcomed.
I hope you will help me achieve this.
Thank you very much.
Sample txt file attached.

Will this do the trick?

# Load libraries ----------------------------------------------------------
library("tidyverse")


# Define example data -----------------------------------------------------
set.seed(48448)
my_data <- tibble(
  id = sample(LETTERS, 20),
  v1 = rnorm(20),
  v2 = rnorm(20),
  v3 = rnorm(20)
)


# Wrangle data ------------------------------------------------------------

# Set width of scrambling
d <- 1

# Create long version of scrambled data
my_data_scrambled_long <- my_data %>% 
  pivot_longer(cols = -id,
               names_to = "var",
               values_to = "value") %>% 
  group_by(var) %>% 
  mutate(value_scrambled = value + runif(n = n(),
                                         min = -d*sd(value),
                                         max = d*sd(value))) %>% 
  ungroup

# Create wide version of scrambled data
my_data_scrambled_wide <- my_data_scrambled_long %>% 
  select(-value) %>% 
  pivot_wider(id_cols = id, names_from = var, values_from = value_scrambled)


# View scrambling ---------------------------------------------------------
my_data
my_data_scrambled_wide

Yielding:

> my_data
# A tibble: 20 × 4
   id         v1      v2      v3
   <chr>   <dbl>   <dbl>   <dbl>
 1 G      0.762   0.990  -1.28  
 2 M     -1.17   -0.386   0.908 
 3 S     -0.863   0.0378  0.648 
 4 B     -0.413  -0.424  -0.573 
 5 V      1.09    0.155  -0.371 
 6 Y     -1.27   -0.143  -0.0217
 7 W      0.621   0.244  -0.408 
 8 N      0.427  -2.65   -0.532 
 9 C      0.413  -0.771   0.0209
10 T     -0.730   0.642   0.288 
11 Q      0.170   0.908  -0.625 
12 J      1.38    0.948   0.288 
13 Z     -0.688  -0.332   0.141 
14 A     -0.556  -1.18    0.851 
15 P      0.155  -1.28    0.897 
16 F     -0.245   0.438   0.403 
17 R     -0.632  -0.649   0.966 
18 O     -1.19   -0.682  -1.66  
19 E     -0.0599  0.780   0.935 
20 L     -1.51    0.834   0.871 
> my_data_scrambled_wide
# A tibble: 20 × 4
   id         v1     v2      v3
   <chr>   <dbl>  <dbl>   <dbl>
 1 G      1.27    1.52  -1.34  
 2 M     -0.751   0.377  0.262 
 3 S     -1.46   -0.225  0.131 
 4 B     -1.20   -0.459 -0.584 
 5 V      0.710  -0.407  0.0121
 6 Y     -1.11   -0.744  0.542 
 7 W      1.14   -0.677  0.229 
 8 N      1.12   -2.23  -0.876 
 9 C      0.420  -0.209  0.0419
10 T      0.0289  0.132  0.0977
11 Q      0.470   1.77  -0.828 
12 J      1.55    0.644  0.287 
13 Z     -1.26   -0.463  0.621 
14 A     -0.107  -0.326  0.860 
15 P      0.365  -1.57   0.611 
16 F      0.148   0.811 -0.153 
17 R     -1.40   -1.39   0.912 
18 O     -1.95   -1.28  -1.04  
19 E      0.148   1.27   1.32  
20 L     -2.29    1.72   1.29  

Excellent. Thank you so much. That solves my problem.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.