I have a large dataset with 400 000 rows with a variable I want to recode into another version. For this I have a 358 rows long code key, parts of it can be seen in the picture below. I want to recode x2 to the x1 version. Im not shure where to begin, how do I solve this in r.
Hello and welcome to the forum.
For your problem I propose a solution based on base::merge()
; an alternative using dplyr::left_join()
is also possible.
What you require is a codebook table that would contain key value pairs for x1 <-> x2 matching. It will be good idea to have a common column name for the code column, such as "animal" in this toy example:
input <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
value = c(1,2,3,4))
print(input)
animal value
1 cat 1
2 dog 2
3 mouse 3
4 catepillar 4
codebook <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
class = c("vertebrate", "vertebrate", "vertebrate", "insect"))
print(codebook)
animal class
1 cat vertebrate
2 dog vertebrate
3 mouse vertebrate
4 catepillar insect
output <- merge(input, codebook, by = "animal")
print(output)
animal value class
1 cat 1 vertebrate
2 catepillar 4 insect
3 dog 2 vertebrate
4 mouse 3 vertebrate
Thank you for your input.
Im not shure this works as the codebook has only two variables and these are used to decipher the input which consists of one variable. Sorry if I missunderstand.
Lets say:
df1/input has 1 variable, a (there is also NA in this variable)
df2/codebook has 2 variables, a and b
I want to change the variable a in df1 into variable b with the help of df2 where a ->b is available.
Looking once more at your example you might wish to consider tidyr::separate_rows()
function to handle the comma separated values in your x2 column; something along the lines of:
input <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
value = c(1,2,3,4))
print(input)
animal value
1 cat 1
2 dog 2
3 mouse 3
4 catepillar 4
codebook <- data.frame(animal = c("cat", "dog", "mouse", "catepillar"),
class = c("felis, carnivora, vertebrate", "canis, carnivora, vertebrate", "mus, rodentia, vertebrate", "larva, lepidoptera, insect"))
print(codebook)
animal class
1 cat felis, carnivora, vertebrate
2 dog canis, carnivora, vertebrate
3 mouse mus, rodentia, vertebrate
4 catepillar larva, lepidoptera, insect
codebook <- tidyr::separate_rows(codebook, class, sep = ',')
print(codebook)
animal class
1 cat felis
2 cat carnivora
3 cat vertebrate
4 dog canis
5 dog carnivora
6 dog vertebrate
7 mouse mus
8 mouse rodentia
9 mouse vertebrate
10 catepillar larva
11 catepillar lepidoptera
12 catepillar insect
output <- merge(input, codebook, by = "animal")
print(output)
animal value class
1 cat 1 felis
2 cat 1 carnivora
3 cat 1 vertebrate
4 catepillar 4 larva
5 catepillar 4 lepidoptera
6 catepillar 4 insect
7 dog 2 canis
8 dog 2 carnivora
9 dog 2 vertebrate
10 mouse 3 mus
11 mouse 3 rodentia
12 mouse 3 vertebrate
Its almost working now.
With your code my 400 000 observations become 70 000 observations. I want to keep all my 400 000.
I also tried this code: output <- left_join(input, codebook, by = c("animal"), copy = False, suffix = c(".x", ".y"))
Here I got more observations than I started with, about 450 000.
I dont know if it helps but the input has also ID numbers for every observation.
Thank you so far.
It seems that there are some observations that have no match in the codebook, causing you to lose observations, and duplicates in codebook keys, causing cartesian multiplication.
I am afraid this is as far as I can get without actually seeing your data.
You are correct. The problem lies in the codebook and how it is coded. Thank you for your help.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.