Tidy up bacterial data

Hello there ! :grinning:

I have to do a NMDS (Non-metric multidimensional scaling) on my bacterial datas which represents somr factors and their score for each strains, when there is a "-", that means the strain doesn't have the factor.
The problem is that my datas are not very organised but I don't know how to proceed without loosing a lot of my datas:
Here is the first lines of the file :

If you have any suggestions ? Thanks :slight_smile:

your screenshot implies that you have the data in WPS/SAS , so probably it will be most convenient to use haven package in R to import the sas7bdat file, and skip any excel steps in your workflow.

Thank you for your answer,

here is a view on rstudio

I imported it with
"library("xlsx")
df <- read.xlsx("/home/raw.xlsx", 1)"

thanks,but no thanks. eh ? :laughing:

1 Like

Hi @LarraFarron,

Welcome! Take a look at this:

# Load libraries ----------------------------------------------------------
library("tidyverse")

# Set example data --------------------------------------------------------
set.seed(9481)
d <- tibble(ids = str_c("strain", 1:5),
            `-` = c("0", rep("-", 4)),
            F1 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE),
            F2 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE),
            V1 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE),
            V2 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE),
            X1 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE),
            X2 = sample(c(rnorm(2), "-"), size = 5, replace = TRUE))

# Wrangle data ------------------------------------------------------------
d_clean <- d %>%
  pivot_longer(cols = -ids, names_to = "strain", values_to = "score") %>%
  mutate(score = case_when(score == "-" ~ NA_character_,
                           score != "-" ~ score) %>% as.numeric)

Yielding:

> d
# A tibble: 5 x 8
  ids     `-`   F1                F2                 V1                 V2                X1               X2              
  <chr>   <chr> <chr>             <chr>              <chr>              <chr>             <chr>            <chr>           
1 strain1 0     -                 -0.334358504586595 0.903817354962354  -                 1.12019753417009 1.34203949772225
2 strain2 -     -                 -0.290915459898793 -                  0.701652663546783 1.12019753417009 -               
3 strain3 -     0.301567369458915 -                  -0.199246032293474 0.701652663546783 1.12019753417009 -               
4 strain4 -     0.301567369458915 -0.290915459898793 -                  0.701652663546783 0.32104758856752 1.34203949772225
5 strain5 -     0.301567369458915 -0.334358504586595 0.903817354962354  0.701652663546783 1.12019753417009 1.34203949772225
> d_clean
# A tibble: 35 x 3
   ids     strain  score
   <chr>   <chr>   <dbl>
 1 strain1 -       0    
 2 strain1 F1     NA    
 3 strain1 F2     -0.334
 4 strain1 V1      0.904
 5 strain1 V2     NA    
 6 strain1 X1      1.12 
 7 strain1 X2      1.34 
 8 strain2 -      NA    
 9 strain2 F1     NA    
10 strain2 F2     -0.291
# … with 25 more rows

...and then take a look at this excellent ressource R for Data Science, which I think will benefit you.

Hope it helps :slightly_smiling_face:

Thank you very much Leon, it's way better :slight_smile: I'll go take a look at this ressource :wink:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.