Recode values in large dataset(400 000 rows) with a code key (358 rows)

I have a large dataset with 400 000 rows with a variable I want to recode into another version. For this I have a 358 rows long code key, parts of it can be seen in the picture below. I want to recode x2 to the x1 version. Im not shure where to begin, how do I solve this in r.

image

Hello and welcome to the forum.

For your problem I propose a solution based on base::merge(); an alternative using dplyr::left_join() is also possible.

What you require is a codebook table that would contain key value pairs for x1 <-> x2 matching. It will be good idea to have a common column name for the code column, such as "animal" in this toy example:

input <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
                     value = c(1,2,3,4))

print(input)
      animal value
1        cat     1
2        dog     2
3      mouse     3
4 catepillar     4

codebook <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
                       class = c("vertebrate", "vertebrate", "vertebrate", "insect"))

print(codebook)
      animal      class
1        cat vertebrate
2        dog vertebrate
3      mouse vertebrate
4 catepillar     insect

output <- merge(input, codebook, by = "animal")

print(output)
      animal value      class
1        cat     1 vertebrate
2 catepillar     4     insect
3        dog     2 vertebrate
4      mouse     3 vertebrate
2 Likes

Thank you for your input.

Im not shure this works as the codebook has only two variables and these are used to decipher the input which consists of one variable. Sorry if I missunderstand.

Lets say:

df1/input has 1 variable, a (there is also NA in this variable)

df2/codebook has 2 variables, a and b

I want to change the variable a in df1 into variable b with the help of df2 where a ->b is available.

Looking once more at your example you might wish to consider tidyr::separate_rows() function to handle the comma separated values in your x2 column; something along the lines of:

input <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
                    value = c(1,2,3,4))

print(input)
      animal value
1        cat     1
2        dog     2
3      mouse     3
4 catepillar     4

codebook <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
                       class = c("felis, carnivora, vertebrate", "canis, carnivora, vertebrate", "mus, rodentia, vertebrate", "larva, lepidoptera, insect"))

print(codebook)
      animal                        class
1        cat felis, carnivora, vertebrate
2        dog canis, carnivora, vertebrate
3      mouse    mus, rodentia, vertebrate
4 catepillar   larva, lepidoptera, insect

codebook <- tidyr::separate_rows(codebook, class, sep = ',')

print(codebook)
       animal        class
1         cat        felis
2         cat    carnivora
3         cat   vertebrate
4         dog        canis
5         dog    carnivora
6         dog   vertebrate
7       mouse          mus
8       mouse     rodentia
9       mouse   vertebrate
10 catepillar        larva
11 catepillar  lepidoptera
12 catepillar       insect

output <- merge(input, codebook, by = "animal")

print(output)
      animal value        class
1         cat     1        felis
2         cat     1    carnivora
3         cat     1   vertebrate
4  catepillar     4        larva
5  catepillar     4  lepidoptera
6  catepillar     4       insect
7         dog     2        canis
8         dog     2    carnivora
9         dog     2   vertebrate
10      mouse     3          mus
11      mouse     3     rodentia
12      mouse     3   vertebrate
1 Like

Its almost working now.

With your code my 400 000 observations become 70 000 observations. I want to keep all my 400 000.

I also tried this code: output <- left_join(input, codebook, by = c("animal"), copy = False, suffix = c(".x", ".y"))

Here I got more observations than I started with, about 450 000.

I dont know if it helps but the input has also ID numbers for every observation.

Thank you so far.

It seems that there are some observations that have no match in the codebook, causing you to lose observations, and duplicates in codebook keys, causing cartesian multiplication.

I am afraid this is as far as I can get without actually seeing your data.

1 Like

You are correct. The problem lies in the codebook and how it is coded. Thank you for your help.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.