# Identify changes in variable names/labels in multiple (over 20 files) versions of .csv data dictionaries in R

Hi everyone. I am having a task of comparing multiple (over 20 versions) .csv data dictionaries and identifying changes in variable naming/labeling over time. Anyone who's done some work around this, insights or suggestions on how to go about this in R?

Have you looked at the diffobj package at all?

It has a nice vignette here, but it only briefly mentions diffCsv(), which you'll probably want to look at in more detail:

3 Likes

@mara has a solution that would work, but may be overkill for your purposes without trimming the payload of the csv. What I would do is to read in each csv and, assuming they are tidy use

csv1_header <- colnames(csv1) %>% readr::write_csv("headers.csv", append = TRUE)
...



and then run dffCsv

1 Like

Do you mean changes like this:

gender,age
m,21
f,25


to...

sex,age
m,21
f,25


Or are you tracking changes in a CSV that describes a different file? Like this:

field,type,description
gender,factor,"gender of student"
age,integer,"age in years"


to...

field,type,description
sex,factor,"sex of student"
age,integer,"age in years"

2 Likes

@nwerth, I mean the second options, just variable names/labels without any data

Hi @technocrat : I have gone through the diffobj package and encountered the following error as per your suggestions;

library(diffobj)
library(tidyverse)

csv_01 <- tibble::tibble(name = "", age = "", sex = "")
csv_02 <- tibble::tibble(names = "", Age = "", SeX = "")

#> Error in .local(target, current, ...): Unable to read target file.

1 Like

Hi @mara: I have tried implementing the diffobj::diffCsv function and encountered an error as below:

If you're not familiar with writing file paths and handling working directories in code, check out Hands-On Programming with R's appendix Loading and Saving Data in R.

For your problem, if you would share a few lines from two of the data dictionaries showing a change, that'd help with finding a way to track it. A simple solution could be:

1. Assign a unique ID to each variable described. For example, if the variables are always in the same order, then the index would work. If not, then you might need to "manually" create a name \rightarrow ID lookup table.
2. Read in each dictionary, and add ID and tracking columns. The ID is from step #1, and the tracking column be something like the dictionary's file path or the effective date of the change.
3. rbind() these datasets into a single one, and sort by the ID and tracking columns.

Continuing my example from above, the result might look like this:

library(dplyr)

# dicts was created following the steps above
arrange(dicts, id, effective_date)
# # A tibble: 4 x 5
#      id effective_date field  type    description
#   <dbl> <date>         <chr>  <chr>   <chr>
# 1     1 2018-06-01     gender factor  gender of student
# 2     1 2019-01-01     sex    factor  sex of student
# 3     2 2018-06-01     age    integer age in years
# 4     2 2019-01-01     age    integer age in years


1 Like

diffCsv() function is for csv files, not for dataframes, this would work

library(diffobj)
library(tidyverse)

csv_01 <- tibble::tibble(name = "", age = "", sex = "")
csv_02 <- tibble::tibble(names = "", Age = "", SeX = "")


1 Like

Hi @bokola, for my solution to work, the data needs to be transformed to make the variable names into a header row. See the tidyr package for the reasoning and a function to do this.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.