meta file for lab experiments

Hello,

I am trying to set up a system where I have a plain text file to write experiment notes as metadata of that experiment. I recently attended to a seminar where the speaker talked that such file should be human and machine readable; but I am not getting there.
My goal is to make a file that works as a template for metadata of experiments, with a fixed header, and a variable body according to the experiment.

The problems I am facing:

  • not sure how much ground I have to play with a file for it to be human and machine readable
  • how can I deal with the fact that more variables are being added overtime, or
  • more details for the same variable
  • how can I make such a file that later on, if I want to do any analysis (like statistics) I could stack all the file of the same experiment (e.g. stack the 3 files of the 3 repeats of the experiment)

An example would help

Imagine that I have to observe a protein (protein-x) in cells using a confocal microscope after an immunocytochemistry experiment -- therefore, I will have to use a primary antibody and a secondary antibody, each with their respective dilution; skipping the details here for sake of simplicity.

This is how I picture the file to be like '.csv' (with random examples to fill in)

date, "20200220"
experiment, "test experiment"
cell_line, "U2OS"
target, "protein-x"
primary_antibody, "anti-protein-x"
primary_antibody_dilution, "1:500"
secondary_antibody, "anti-anti-a594"
secondary_antibody_dilution, "1:5000"

So far my problem is: if I use two 'primary_antibody', what is the best way to do it? Create another file, or add like:
primary_antibody, {"anti-protein-x", "anti-protein-b"}

Moving on, I would like log the file names of the raw data that correspond to a given microscopy slide; I have the same problem, and one more: I don't know in advance how many photos I will acquire for a given slide, and for different experiments I'll have different number of slides. As an example:

date, "20200220"
experiment, "test experiment"
cell_line, "U2OS"
target, "protein-x"
primary_antibody, "anti-protein-x"
primary_antibody_dilution, "1:500"
secondary_antibody, "anti-anti-a594"
secondary_antibody_dilution, "1:5000"
slide_a, {"-001", "-002", "-005"}
slide_b, {"-010", "-015", "-016"}

Hope that this is not too confusing.
What would be your advice? Keeping in mind that the less time making annotation and logs would be beneficial. In other words, what could here be automated to end up with a nice metadata file while keep the log in the lab sustainable; for example, in the lab its far easier and faster to write "slide_a", ""-001", "-002", "-005" than having three files with the same (duplicated) info in the heading.

Thank you in advance.

Cheers,
Leonardo

You could consider the json format that is very flexible and similar to your setup.
Below you find an example of

  • creating a character vector with a json string
  • converting the json string with the function fromjson of the package jsonlite to a list
  • change the list
  • convert the changed list to a json string again

I think it is both machine and humanly readable.

library(jsonlite)

experiment1 = '{
"date" : ["20200220"],
"experiment" : ["test experiment"],
"cell_line" : ["U2OS"],
"target" : ["protein-x"],
"primary_antibody" :  ["anti-protein-x", "anti-protein-b"],
"primary_antibody_dilution" : ["1:500", "1:250"],
"secondary_antibody" : ["anti-anti-a594"],
"secondary_antibody_dilution" : ["1:5000"],
"slide_a" : ["-001", "-002", "-005"],
"slide_b" : ["-010", "-015", "-016"]
}'

# convert to list format
(jexperiment1 = jsonlite::fromJSON(experiment1))
#> $date
#> [1] "20200220"
#> 
#> $experiment
#> [1] "test experiment"
#> 
#> $cell_line
#> [1] "U2OS"
#> 
#> $target
#> [1] "protein-x"
#> 
#> $primary_antibody
#> [1] "anti-protein-x" "anti-protein-b"
#> 
#> $primary_antibody_dilution
#> [1] "1:500" "1:250"
#> 
#> $secondary_antibody
#> [1] "anti-anti-a594"
#> 
#> $secondary_antibody_dilution
#> [1] "1:5000"
#> 
#> $slide_a
#> [1] "-001" "-002" "-005"
#> 
#> $slide_b
#> [1] "-010" "-015" "-016"
# id of third photo of slide_a
jexperiment1$slide_a[3]
#> [1] "-005"
# replace by "-005r"
jexperiment1$slide_a[3] = "-005r"
# convert back to (pretty) json 
jsonlite::prettify(
  jsonlite::toJSON(jexperiment1)
)
#> {
#>     "date": [
#>         "20200220"
#>     ],
#>     "experiment": [
#>         "test experiment"
#>     ],
#>     "cell_line": [
#>         "U2OS"
#>     ],
#>     "target": [
#>         "protein-x"
#>     ],
#>     "primary_antibody": [
#>         "anti-protein-x",
#>         "anti-protein-b"
#>     ],
#>     "primary_antibody_dilution": [
#>         "1:500",
#>         "1:250"
#>     ],
#>     "secondary_antibody": [
#>         "anti-anti-a594"
#>     ],
#>     "secondary_antibody_dilution": [
#>         "1:5000"
#>     ],
#>     "slide_a": [
#>         "-001",
#>         "-002",
#>         "-005r"
#>     ],
#>     "slide_b": [
#>         "-010",
#>         "-015",
#>         "-016"
#>     ]
#> }
#> 

Created on 2020-07-08 by the reprex package (v0.3.0)

hi @HanOostdijk

thank you for you answer -- extremely useful. I like it!
I've been playing around with that structure, and I came across the possibility of having longitudinal experiments where I would need to have more than one entry in one file to keep the data together.
I tried with the following structure:

{
    "Experiment1_01":[
        {
            "var1" : ["ok"],
            "var2" : ["2"],
            "var3" : ["A"]
        }],
    "Experiment1_02":[
        {
            "var1" : ["bad"],
            "var2" : ["3"],
            "var3" : ["A"]
        }
    ]
}

But I am having problems -- and stuck -- to make a table/data.frame out of it in order to have all the data in one nice final .csv.

So far I did this

jfiles.dir <- "...D:\\...." # set dir
list.record.jfiles <- list.files(jfiles.dir, "example0") # list files in folder

jimported.df <- data.frame()

for (n in length(list.record.jfiles)) {
  
  jimported <- jsonlite::read_json(paste0(jfiles.dir,list.record.jfiles[n]), simplifyVector = TRUE)
  
    if (length(jimported) > 1) {
    for (k in length(jimported)) {
      jimported.df <- rbind.data.frame(jimported.df,jimported[[k]])
    
    }
  } else {
    jimported.df <-rbind.data.frame(jimported.df,jimported)
     
  }
}

My problems:

  1. gives error because the name of the columns does not match; but when I try to give names it gives me and error because de data frame is empty
  2. I tried to use purrr::flatten() and jsonlite::flatten() but says that jimported[[k]] is not a data.frame
  3. it might happen the sometimes the .json. just has one entry, and other times more -- that was why I tried to use the for loop and if inside it
{
    "Experiment2_01":[
        {
            "var1" : ["ok"],
            "var2" : ["1"],
            "var3" : ["A"]
        }
    ]
}

Thank you !

Best,
Leonardo

I have saved your two json structures in the files example1 and example2 in my current folder and adapted your code (see the lines marked by #changed) .
It can handle now the two structures.

Both structures now contain all vars: var1 , var2 and var3.
If it is possible that these are not all present (or others are present) then the code should be able to handle that by providing a column for all possible var-s .

#jfiles.dir <- "...D:\\...." # set dir
jfiles.dir <- "." # set dir #changed
list.record.jfiles <- list.files(jfiles.dir, "example") # list files in folder #changed

jimported.df <- data.frame()

for (n in 1:length(list.record.jfiles)) { #changed
  
  jimported <- jsonlite::read_json(paste(jfiles.dir,list.record.jfiles[n],sep='/'), 
                                   simplifyVector = TRUE) #changed
  
    if (length(jimported) > 1) { 
    for (k in 1:length(jimported)) { #changed
      jimported.df <- rbind.data.frame(jimported.df,jimported[[k]])
    
    }
  } else {
    jimported.df <-rbind.data.frame(jimported.df,jimported[[1]]) #changed
     
  }
}

Maan, I can't believe that I missed that the for loop goes through a vector.

yes, it's working with other variables, without problems so far -- now feel like I have the basis to try and test in the real world (laboratory).

Thank you again for your help o7

It seems to me if you can expect to view/work with your data in tabular I.e. table form , doesnt adding json amount to just another level of indirection.
Work out what column of info you want to be tracking and then do your experiment directly in maybe tab delimited form (likely more readable than comma separated)

1 Like

the problem is that editing a tabular file in a text editor in the lab it's not as fluid as working with a .json -- tabular view requires more horizontal screen space, and if I want to see two variables at the same time it's harder that way; editing a tabular (from my experience) is more prone to error, because of miss placed values -- it's easier to know where exactly to place values --> ["here"].

Also, the need for the tabular data is not to visualize, it's to share with others in a more convenient fashion.

I thought about that, but If I have files with that column I want to track, I run into the possibility of each file missing their context; that is what I am trying to avoid using the .json, the info might be repeated, but the context of the experiment is kept together with the result.

Those are fine reasons, so its good to know its a considered decision :slight_smile:

1 Like

from the post above (by @HanOostdijk) with the import I ran into another problem by importing data that has two or more entries (if that is the correct terminology); for example:

{
    "Experiment1_01":[
        {
            "var1" : ["ok"],
            "var2" : ["2","1"],
            "var3" : ["A"]
        }],
    "Experiment1_02":[
        {
            "var1" : ["bad"],
            "var2" : ["3"],
            "var3" : ["A"]
        }
    ]
}

The result I get is this:
jimported.df$var2[[1]]
[1] "c(\"2\", \"1\")"

it appears like this c("2", "1") when I ask head(jimported.df)

How can I convert this to a regular vector?

thank you

In this case the output is as expected (by me at least) . When I use the str function it look like

str(jimported.df)
'data.frame':	2 obs. of  3 variables:
 $ var1:List of 2
  ..$ : chr "ok"
  ..$ : chr "bad"
 $ var2:List of 2
  ..$ : chr  "2" "1"
  ..$ : chr "3"
 $ var3:List of 2
  ..$ : chr "A"
  ..$ : chr "A"

The first element of var2 is indeed a (character) vector.
The other elements are all character vectors but of length 1.

I have a feeling you are unhappy with that but are unsure how you would like to see this handled.

I was just expecting other str. let me explain:

Here should it be chr [1:2] "1" "2" ? or is it the same thing?

Now I see what you mean (I think :grinning: ) .
Apparently there is a difference when displaying the whole data.frame or an element of that.
I have looked at the defaults for the str.data.frame function in the documentation but find no reason for that. So no solution/explanation but compare

> str(jimported.df) 
'data.frame':	2 obs. of  3 variables:
 $ var1:List of 2
  ..$ : chr "ok"
  ..$ : chr "bad"
 $ var2:List of 2
  ..$ : chr  "2" "1"
  ..$ : chr "3"
 $ var3:List of 2
  ..$ : chr "A"
  ..$ : chr "A"

> str(jimported.df$var2[[1]])
 chr [1:2] "2" "1"

> str(jimported.df$var2[1])
List of 1
 $ : chr [1:2] "2" "1"

You thought right!

Good news, I accidentally found a way to make a tidy data.frame with a .json with the extact structure that I was looking for; while trying to remove [[foo]] from values -- because I was thinking to use the structure of the .json in my .md files in Obsidian to quickly edit and keep visual picture with graph view of what is going on, and how my data is connected.

Yet, I don't know how to simplify it for the cases where there is no need to remove the double [[ ]] -- if you know how to I appreciate the help.


# get wd to list files
jfiles.dir <- paste0(getwd(),"\\")
# import list files
file.pattern <- "example0"
jfiles.list <- list.files(jfiles.dir,file.pattern)
# creat an empty data frame to store data
jimported.df <- data.frame()
# go through all the files 
for (n in 1:length(jfiles.list)) {
  
  jimported <- jsonlite::read_json(paste(jfiles.dir,"/",jfiles.list[n], sep='/'), simplifyVector = TRUE) # read json
  
    if (length(jimported)>1) { # in case the json has >1 object/entry
      
    for (k in 1:length(jimported)) {
      
      jimported.df <- rbind.data.frame(jimported.df,jimported[[k]])
      
    }
      
  } else {
    
    jimported.df <- rbind.data.frame(jimported.df,jimported[[1]])
    
     }
}

# convert to tidy from .md with backlinks [[foo]]
jimported.tidy <- data.frame()
for (k in 1:nrow(jimported.df)) {
  t.jimported.tidy <- as.data.frame(lapply(jimported.df, function(x) gsub("\\[|\\]","",x[[k]])))
  jimported.tidy <- rbind.data.frame(jimported.tidy,t.jimported.tidy)
}

the str(jimported.tidy)

'data.frame':	5 obs. of  3 variables:
 $ var1: chr  "A01" "A01" "A02" "B" ...
 $ var2: chr  "2" "1" "3" "1" ...
 $ var3: chr  "0.2" "0.2" "0.1" "0.3" ...

The example files:
...\\example01.json

{
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1"],
       "var3" : ["0.2"]
     }],
   "Experiment1_02":[
     {
       "var1" : ["[[A02]]"],
       "var2" : ["3"],
       "var3" : ["0.1"]
     }]
 }

...\\example02.json

{
  "Experiment2_01":[
    {
      "var1" : ["[[A03]]"],
      "var2" : ["1"],
      "var3" : ["0.3","0.5"]
    }
    ]
}

I am afraid you lost me.
Can you indicate what is the input and what the required output and where your code goes wrong?

Sorry @HanOostdijk.

Let me try to explain.

From a file were there is one entry with one variable with two entries, like this:

{
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1"],
       "var3" : ["0.2"]
     }]
}

I would expect a data.frame with two rows, because here var2 has observations.

Or, if I have:

{
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1"],
       "var3" : ["0.2", "0.5]
     }]
}

the data.frame would have 4 rows for the four observations.

The idea is that it goes to each row in the data frame, and if it has one var with more than one entry it would add it to the new tidy dataframe.

I am currently here -- but its not working because x > 3, and since I have just 3 var it leads to error


jimported.tidy2 <- data.frame()

for (x in 1:nrow(jimported.df)) {
  for (y in 1:length(jimported.df[[x]])){
    
  t.jimported.tidy <- (jimported.df[x][[y]])
  jimported.tidy2 <- rbind.data.frame(jimported.tidy,t.jimported.tidy)
  }
}

does it make sense?

I would use the old code and the tidyr::unnest function.
To the old code I added the extraction of the experiment name (even when you could have used one of the var entries as a description/name) and a call to the unnest function .

My new version of the code (with output and input after that) :

#jfiles.dir <- "...D:\\...." # set dir
jfiles.dir <- "." # set dir #changed
list.record.jfiles <- list.files(jfiles.dir, "example") # list files in folder

jimported.df <- data.frame()

for (n in 1:length(list.record.jfiles)) { 
  
  jimported <- jsonlite::read_json(paste(jfiles.dir,list.record.jfiles[n],sep='/'), 
                                   simplifyVector = TRUE)  
  
    if (length(jimported) > 1) { 
    for (k in 1:length(jimported)) {
      jimported.df <- rbind.data.frame(jimported.df,
                                       cbind(data.frame(exp=names(jimported[k])),jimported[[k]]))
    
    }
  } else {
    jimported.df <-rbind.data.frame(jimported.df,
                                    cbind(data.frame(exp=names(jimported[1])),jimported[[1]])) 
     
  }
}

library(magrittr)
library(tidyr)

jimported.df2 = jimported.df %>%
  unnest(var1) %>%
  unnest(var2) %>%
  unnest(var3) 

str(jimported.df2) 

with output

 str(jimported.df2) 
tibble [6 x 4] (S3: tbl_df/tbl/data.frame)
 $ exp : chr [1:6] "Experiment1_01" "Experiment1_01" "Experiment1_01" "Experiment1_01" ...
 $ var1: chr [1:6] "[[A01]]" "[[A01]]" "[[A01]]" "[[A01]]" ...
 $ var2: chr [1:6] "2" "2" "1" "1" ...
 $ var3: chr [1:6] "0.2" "0.5" "0.2" "0.5" ...
> 

when applied to one example file with contents

{  
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1"],
       "var3" : ["0.2", "0.5"]
     }]
     ,
  "Experiment2_01":[
    {
      "var1" : ["[[A03]]"],
      "var2" : ["1"],
      "var3" : ["0.3","0.5"]
    }
    ]
}

Thats it! Thanks

This is a neat thing you add -- I did thought about it until I saw it there.

This part now makes it just work for a system with these variables,

I want to make it future proof, and changed it to this: what do you think?

jimported.df2 <- unnest(jimported.df, 
                         cols = colnames(jimported.df)[2:length(jimported.df)]) # start at 2 because 1 is for the experiment name

No this not future proof: it does not even work at the moment.

  1. Try the input
{  
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1"],
       "var3" : ["0.2", "0.5"]
     }]
     ,
  "Experiment2_01":[
    {
      "var1" : ["[[A03]]"],
      "var2" : ["1"],
      "var3" : ["0.3","0.5"]
    }
    ]
}

in the result the first resp. second var2 is matched only with the first resp. second var3, so not all combinations are made

print(jimported.df2) 
# A tibble: 4 x 4
  exp            var1    var2  var3 
  <chr>          <chr>   <chr> <chr>
1 Experiment1_01 [[A01]] 2     0.2  
2 Experiment1_01 [[A01]] 1     0.5  
3 Experiment2_01 [[A03]] 1     0.3  
4 Experiment2_01 [[A03]] 1     0.5  
  1. Try the input
{  
   "Experiment1_01":[
     {
       "var1" : ["[[A01]]"],
       "var2" : ["2","1","3"],
       "var3" : ["0.2", "0.5"]
     }]
     ,
  "Experiment2_01":[
    {
      "var1" : ["[[A03]]"],
      "var2" : ["1"],
      "var3" : ["0.3","0.5"]
    }
    ]
}

where var2 and var3 in Experiment1_01 have lengths that are greater than 1 but are not equal.

jimported.df2 <- unnest(jimported.df, 
+       cols = colnames(jimported.df)[2:length(jimported.df)]) # start at 2 because 1 is for the experiment name
Error: Incompatible lengths: 3, 2.
Run `rlang::last_error()` to see where the error occurred.
> 
> 
> print(jimported.df2) 
Error in print(jimported.df2) : object 'jimported.df2' not found
> 

I had the same idea (great minds etc. :grinning:) but saw in the help info the notice

cols	<tidy-select> Columns to unnest.

If you unnest() multiple columns, parallel entries must be of compatible sizes, i.e. they're either equal or length 1 (following the standard tidyverse recycling rules).

I think have found a way with this code, look:

jimported.tidy <-  data.frame()
for (x in 1:nrow(jimported.df)){
  
temp.jimported.tidy <- jimported.df[x,]
  for(y in  2:length(temp.jimported.tidy)) {
     # start the unnest in col 2
    temp.jimported.tidy <- temp.jimported.tidy %>%
      unnest_longer(col = colnames(temp.jimported.tidy[y]), indices_include = FALSE)
  }
jimported.tidy = rbind.data.frame(jimported.tidy,temp.jimported.tidy)
}

the str():

tibble [14 x 4] (S3: tbl_df/tbl/data.frame)
 $ exp : chr [1:14] "Experiment1_01" "Experiment1_01" "Experiment1_01" "Experiment1_01" ...
 $ var1: chr [1:14] "[[A01]]" "[[A01]]" "[[A01]]" "[[A01]]" ...
 $ var2: chr [1:14] "10" "10" "10" "20" ...
 $ var3: chr [1:14] "10" "20" "30" "10" ...

it worked for a file with the structure:

{
    "Experiment1_01":[
      {
        "var1" : ["[[A01]]", "[[B01]]"],
        "var2" : ["10","20"],
        "var3" : ["10", "20", "30"]
      }],
    "Experiment1_02":[
      {
        "var1" : ["[[A02]]"],
        "var2" : ["3"],
        "var3" : ["0.1", "10"]
      }]
  }

what do you think?

I wonder if there is a better way to write those instructions though