deidentify and duplicate data

Hi - I am trying to use the deindentify() command and I get this error


The student ID numbers have a "@" character in front of them which I think is one of the issues and there are duplicate ID numbers listed as well. Is there a way to de-idenitfy in R with the @ in front of the ID and I need the duplicates to be rename in the set with this duplicates renamed the same. I hope this makes sense. I appreciate any help and advise. Thanks!

Welcome to RStudio community, @cwiggz! We can give you a bit of general guidance here, but I think we'll probably need you to make a reprex, or reproducible example, in order to properly help you.

The reprex will have stuff like:

  • The code you're using (not just the line you're stuck on or the error you're getting); and
  • A sample of the data you're using—or, if you can't provide that, some simulated data that is a similar shape (eg. the same columns).

If you can prep something like this for us, it'll give us a whole lot more context that can help us get to the root of the problem :slight_smile:

That said, it seems like there are a few things going on here that we can help with. I'm not familiar with a deidentify() function in R. is this supplied by a package you're using? (This is one of the benefits of supplying a reprex: it can help us establish where things come from!).

If there are @ symbols in your student numbers, you can remove them using the str_replace() function in the readr package.

I'm not quite sure I understand your explanation of how you want duplicates to be handled. If you could give us an example of a correctly handled duplicate along with your reprex, we can probably help you work that out :slight_smile:

Thanks!

1 Like

Here's a quick illustration of how to get rid of @ characters if you don't need them. Run:

illustration <- c("@3","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_remove_all(value,"@")) -> illustration
illustration

2 Likes

And here is the str_replace() variant mentioned above.

illustration <- c("@3","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_replace(value,"@","")) -> illustration
illustration

2 Likes

@rensa Thanks for your reply! I will work on a reprex, I am REALLY new to R so this may take me a little bit to figure out. The deindentify() function is from the deidentifyr pkg. I found this on github when searching for a way to de-identify my students. As for the duplication, the data set is all students who have taken math courses at my college. I am tracking there grades and subsequent success. Thus, their student id is repeated every time they took a math course.
What the set looks like now:
Student ID Course Grade
@11111111 MAT137 A
@11111111 MAT167 C+
@11111111 MAT186 B
@2222222 MAT137 C
@3333333 MAT137 A-

When de-identified:
Student Rename Course Grade
fghj2345sd MAT137 A
fghj2345sd MAT167 C+
fghj2345sd MAT186 B
abcdf6789f MAT137 C
wrytu2746r MAT137 A-

Same student id's need to be renamed the same name so that they are still trackable as the same student. I hope this helps explain my problem a little better. I will work on the reprex! And thank you for the advise on how to get rid of the "@" symbol.

1 Like

Apart from the link that rensa posted, this is also extremely helpful to understand what a reprex is:

Now, I'm not sure, but are you looking for something like this?

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))

dataset <- within(data = dataset,
                  expr = {
                    Student.ID <- as.integer(x = Student.ID)
                  })

dataset
#>   Student.ID Course Grade
#> 1          1 MAT137     A
#> 2          1 MAT167    C+
#> 3          1 MAT186     B
#> 4          2 MAT137     C
#> 5          3 MAT137    A-

Created on 2019-03-02 by the reprex package (v0.2.1)

PS: I found the deidentifyr :package:, but not the deindentify function. Perhaps, you typed the n by mistake?

2 Likes

Try the package anonymizer. Here's an expanded version of my initial illustration.

library(dplyr)
illustration <- c("@3","@4","@4") %>% as_tibble()
illustration
illustration %>% mutate(value=str_replace(value,"@","")) -> illustration
illustration

library(anonymizer)
illustration %>% mutate(value=anonymize(value, .algo = "crc32", .seed = 1)) -> illustration
illustration

Actually your problem with deidentifyr package is not the "@" character, the problem is that it does not accept duplicate ids, if you add the course column to make each row unique, it works.

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))

library(deidentifyr)
deidentify(dataset, Student.ID, Course)
#>           id Grade
#> 1 f9a2c0fa32     A
#> 2 88796051c5    C+
#> 3 9f215474a0     B
#> 4 d17788bf5f     C
#> 5 300f0621e9    A-

But the idea here is to make each student identifiable along multiple tables as well, so
I would go with @Chuck advise using anonymizer package because works with duplicate Ids.

dataset <- data.frame(Student.ID = c("@11111111", "@11111111", "@11111111", "@2222222", "@3333333"),
                      Course = c("MAT137", "MAT167", "MAT186", "MAT137", "MAT137"),
                      Grade = c("A", "C+", "B", "C", "A-"))
library(dplyr)
library(anonymizer)
dataset %>%
    mutate(Student.ID = anonymize(Student.ID, .algo = "crc32", .seed = 1))
#>   Student.ID Course Grade
#> 1   3ac8169d MAT137     A
#> 2   3ac8169d MAT167    C+
#> 3   3ac8169d MAT186     B
#> 4   1c846636 MAT137     C
#> 5   1f97526b MAT137    A-
2 Likes

Thank you! The deidentify() function was not needed, your code worked PERFECTLY! You just saved me a ton of time, I am very grateful!

1 Like

Glad I could help. :slight_smile:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.