Splitting into two groups

How do I split a data into two groups where Group 1 has the first 4 disease samples and the first 4 normal samples; group 2 has the remaining 3 disease and 3 normal? Group 1 has Sample ID '454', '3', '554', '202' as normal samples, and '531', '18', '681', '423' as disease samples; Group 2 has the reset samples. Ignore the two samples with missing Disease status. How do I go about it?

"Index" "SampleID" "Disease"
1 454 "N"
2 3 "N"
3 531 "Y"
4 18 "Y"
5 554 "N"
6 202 "N"
7 559 "N"
8 203 "N"
9 681 "Y"
10 423 "Y"
11 710 "Y"
12 768 "Y"
13 "A81" "?"
14 "A82" "?"
15 "A11" "Y"
16 "A101" "N"

Lets say I have a data that looks like this:

# A tibble: 27,578 x 202
   Index SYMBOL `454.AVG_Beta` `454.Avg_NBEADS… `454.Avg_NBEADS… `454.BEAD_STDER…
   <int> <chr>           <dbl>            <int>            <int>            <int>
 1     1 ATP2A1         0.755                16               13               36
 2     2 SLMAP          0.722                12               18               30
 3     3 MEOX2          0.0975               20               25              111
 4     4 HOXD3          0.146                20               19              122
 5     5 ZNF398         0.102                16               20              181
 6     6 PANX1          0.0626               12               13              543
 7     7 COX8C          0.964                19               15               17
 8     8 IMPA2          0.0240               15               25              494
 9     9 TTC8           0.0109               23               20              562
10    10 FLJ35…         0.657                21               21              470
# ... with 27,568 more rows, and 196 more variables: `454.BEAD_STDERR_B` <int>,
#   `454.Signal_A` <int>, `454.Signal_B` <int>, `454.Detection Pval` <dbl>,
#   `454.Intensity` <int>, `3.AVG_Beta` <dbl>, `3.Avg_NBEADS_A` <int>,
#   `3.Avg_NBEADS_B` <int>, `3.BEAD_STDERR_A` <int>, `3.BEAD_STDERR_B` <int>,
#   `3.Signal_A` <int>, `3.Signal_B` <int>, `3.Detection Pval` <dbl>,
#   `3.Intensity` <int>, `531.AVG_Beta` <dbl>, `531.Avg_NBEADS_A` <int>,
#   `531.Avg_NBEADS_B` <int>, `531.BEAD_STDERR_A` <int>,
#   `531.BEAD_STDERR_B` <int>, `531.Signal_A` <int>, `531.Signal_B` <int>,
#   `531.Detection Pval` <dbl>, `531.Intensity` <int>, `18.AVG_Beta` <dbl>,
#   `18.Avg_NBEADS_A` <int>, `18.Avg_NBEADS_B` <int>, `18.BEAD_STDERR_A` <int>,
#   `18.BEAD_STDERR_B` <int>, `18.Signal_A` <int>, `18.Signal_B` <int>,
#   `18.Detection Pval` <dbl>, `18.Intensity` <int>, `554.AVG_Beta` <dbl>,
#   `554.Avg_NBEADS_A` <int>, `554.Avg_NBEADS_B` <int>,
#   `554.BEAD_STDERR_A` <int>, `554.BEAD_STDERR_B` <int>, `554.Signal_A` <int>,
#   `554.Signal_B` <int>, `554.Detection Pval` <dbl>, `554.Intensity` <int>,
#   `202.AVG_Beta` <dbl>, `202.Avg_NBEADS_A` <int>, `202.Avg_NBEADS_B` <int>,
#   `202.BEAD_STDERR_A` <int>, `202.BEAD_STDERR_B` <int>, `202.Signal_A` <int>,
#   `202.Signal_B` <int>, `202.Detection Pval` <dbl>, `202.Intensity` <int>,
#   `559.AVG_Beta` <dbl>, `559.Avg_NBEADS_A` <int>, `559.Avg_NBEADS_B` <int>,
#   `559.BEAD_STDERR_A` <int>, `559.BEAD_STDERR_B` <int>, `559.Signal_A` <int>,
#   `559.Signal_B` <int>, `559.Detection Pval` <dbl>, `559.Intensity` <int>,
#   `203.AVG_Beta` <dbl>, `203.Avg_NBEADS_A` <int>, `203.Avg_NBEADS_B` <int>,
#   `203.BEAD_STDERR_A` <int>, `203.BEAD_STDERR_B` <int>, `203.Signal_A` <int>,
#   `203.Signal_B` <int>, `203.Detection Pval` <dbl>, `203.Intensity` <int>,
#   `681.AVG_Beta` <dbl>, `681.Avg_NBEADS_A` <int>, `681.Avg_NBEADS_B` <int>,
#   `681.BEAD_STDERR_A` <int>, `681.BEAD_STDERR_B` <int>, `681.Signal_A` <int>,
#   `681.Signal_B` <int>, `681.Detection Pval` <dbl>, `681.Intensity` <int>,
#   `423.AVG_Beta` <dbl>, `423.Avg_NBEADS_A` <int>, `423.Avg_NBEADS_B` <int>,
#   `423.BEAD_STDERR_A` <int>, `423.BEAD_STDERR_B` <int>, `423.Signal_A` <int>,
#   `423.Signal_B` <int>, `423.Detection Pval` <dbl>, `423.Intensity` <int>,
#   `710.AVG_Beta` <dbl>, `710.Avg_NBEADS_A` <int>, `710.Avg_NBEADS_B` <int>,
#   `710.BEAD_STDERR_A` <int>, `710.BEAD_STDERR_B` <int>, `710.Signal_A` <int>,
#   `710.Signal_B` <int>, `710.Detection Pval` <dbl>, `710.Intensity` <int>,
#   `768.AVG_Beta` <dbl>, `768.Avg_NBEADS_A` <int>, `768.Avg_NBEADS_B` <int>,
#   `768.BEAD_STDERR_A` <int>, `768.BEAD_STDERR_B` <int>, …

I think you’re going to need to explain:

  • How “first” is meant to be determined (by the value of the index field?)
  • What output you are looking for: two new data frames? Something else?
  • What’s supposed to happen to the data where Disease == "?"

I’m also not sure I understand how the structure of your actual giant data frame relates to your task? Do you want all the other variables in the data frame to accompany the Disease and SampleID variables, or do you want to drop them? What about all the other rows?

1 Like

@jcblum,

I need clarification on:

  • How “first” is meant to be determined (by the value of the index field?)
    Yes. Group 1 has Sample ID '454', '3', '554', '202' as normal samples, and '531', '18', '681', '423' as disease samples; Group 2 has the reset samples. Ignore the two samples with missing Disease status.

  • What output we are looking for: two new data frames? Something else?
    We use samples in group 1 as discovery samples and samples in group 2 as replication samples. We perform t-tests to identify mean changes between disease sample and normal sample for each row of the data file.

  • What happens to the data where Disease == "?Ignore them as we do not know their disease status. This happens in real data analysis.

I’m also not sure I understand how the structure of the actual giant data frame relates to the task? Do we want all the other variables in the data frame to accompany the Disease and SampleID variables, or do you want to drop them? What about all the other rows?

Each row represents the data for one gene. We compare the responses between disease samples and normal samples for each gene. The column corresponding to the samples.

A couple of sticking points I’m seeing:

Some of your sample IDs are not numeric. If you want to determine order based on sample ID, how are the non-numeric IDs supposed to fit in?

Based on your original question, it sounded like you wanted one group to contain 8 rows and the other group to contain 9 rows — so I’m afraid I’m still not understanding what is supposed to happen to the other 27,000+ rows in the actual data frame :thinking:.

@jcblum, I answered your questions above.

It would be better if you put your answers in a new reply instead of editing the old one. Editing that reply makes it very hard to follow the conversation

Also, the way it is formatted makes it difficult to tell which parts are your answers and which parts are my questions. Suggestions:

  • You can use > at the beginning of a line to format it as a block quote
  • You can wrap a line in [quote="jcblum"] and [/quote] tags to get quoted formatting with attribution
1 Like

If you know the Sample IDs of Group 1, and Group 2 is “all the rest”, then why not add the group designator in as a variable, then split() on the variable (or don’t split and write other code that filters the dataframe before running — your choice). Would that work with your plans?

I tried the group designator and it omitted all the values. What am I missing?

> by_Group1 <- SampleData_trim %>% group_by(454,3,554,202,531,18,681,423)
> select(by_Group1)
Adding missing grouping variables: `454`, `3`, `554`, `202`, `531`, `18`, `681`, `423`
# A tibble: 27,578 x 8
# Groups:   454, 3, 554, 202, 531, 18, 681, 423 [1]
   `454`   `3` `554` `202` `531`  `18` `681` `423`
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   454     3   554   202   531    18   681   423
 2   454     3   554   202   531    18   681   423
 3   454     3   554   202   531    18   681   423
 4   454     3   554   202   531    18   681   423
 5   454     3   554   202   531    18   681   423
 6   454     3   554   202   531    18   681   423
 7   454     3   554   202   531    18   681   423
 8   454     3   554   202   531    18   681   423
 9   454     3   554   202   531    18   681   423
10   454     3   554   202   531    18   681   423
# ... with 27,568 more rows

group_by() interpreted those bare values as references to columns, and then created columns to match. Try something like:

by_Group1 <- SampleData_trim %>% 
  mutate(group_id = if_else(SampleID %in% c(454,3,554,202,531,18,681,423), 1, 2)) %>%
  group_by(group_id)

(the above code only works if your sample Id column is actually called SampleID and if its values are numeric. If it's a column of character values, you'll need if_else(SampleID %in% c("454","3","554","202","531","18","681","423"), 1, 2))).

To be clear, when I mentioned adding a group designator as a variable, I was talking about the mutate() step. After that, you can use filter() to do something with only the rows that have a certain grouping ID, or (as you tried) you can go ahead and group_by() that variable, so that summary functions will be applied per group.

One method involving splitting into separate data frames looks like this:

library(tidyverse)

# Construct an example data frame reproducibly
set.seed(42)
dfr <- data.frame(
  sample_id = as.character(sample(1:200, size = 10)),
  numeric_var = rnorm(10, mean = 10, sd = 2),
  categorical_var = factor(sample(LETTERS[1:3], size = 10, replace = TRUE)),
  stringsAsFactors = FALSE
)

dfr
#>    sample_id numeric_var categorical_var
#> 1        183    9.787751               C
#> 2        187   13.023044               C
#> 3         57    9.810682               B
#> 4        164   14.036847               C
#> 5        126    9.874572               A
#> 6        102   12.609739               C
#> 7        143   14.573291               A
#> 8         26    7.222279               A
#> 9        127    9.442422               C
#> 10       135    9.733357               B

# If we already know which samples belong in Group 1
group_1 <- c("57", "126", "143", "135")

# Adding the group as a variable, tidyverse-style
dfr <- mutate(dfr, group_id = if_else(sample_id %in% group_1, 1, 2))

dfr
#>    sample_id numeric_var categorical_var group_id
#> 1        183    9.787751               C        2
#> 2        187   13.023044               C        2
#> 3         57    9.810682               B        1
#> 4        164   14.036847               C        2
#> 5        126    9.874572               A        1
#> 6        102   12.609739               C        2
#> 7        143   14.573291               A        1
#> 8         26    7.222279               A        2
#> 9        127    9.442422               C        2
#> 10       135    9.733357               B        1

# Here's how you'd do it in base R
# dfr$group_id <- ifelse(dfr$sample_id %in% group_1, 1, 2)

Splitting into a list of data frames

# Splitting on `group_id`
split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))
#> $group1
#>    sample_id numeric_var categorical_var group_id
#> 3         57    9.810682               B        1
#> 5        126    9.874572               A        1
#> 7        143   14.573291               A        1
#> 10       135    9.733357               B        1
#> 
#> $group2
#>   sample_id numeric_var categorical_var group_id
#> 1       183    9.787751               C        2
#> 2       187   13.023044               C        2
#> 4       164   14.036847               C        2
#> 6       102   12.609739               C        2
#> 8        26    7.222279               A        2
#> 9       127    9.442422               C        2

# The output of `split()` is a list, with each element containing a data frame.
# You can leave it that way...
dfr_list <- split(dfr, dfr$group_id) %>% set_names(c("group1", "group2"))

# ...which makes it easy to apply subsequent code to both data frames in parallel
map_dbl(dfr_list, ~mean(.$numeric_var))
#>   group1   group2 
#> 10.99798 11.02035

# base R version
# unlist(lapply(dfr_list, function(x) { mean(x$numeric_var)}))

# You can access an individual data frame in the list like:
dfr_list$group1
#>    sample_id numeric_var categorical_var group_id
#> 3         57    9.810682               B        1
#> 5        126    9.874572               A        1
#> 7        143   14.573291               A        1
#> 10       135    9.733357               B        1

# Or dfr_list[["group1"]] or dfr_list[[1]]

Created on 2018-11-22 by the reprex package (v0.2.1)

Which pattern makes sense for you depends on what you're doing downstream. There are other possibilities, too! A whole lot of related discussion and resources can be found in these topics:


4 Likes

If all your rows are in the data.frame mydata:

group1 = rbind(subset(mydata, Disease=="Y")[1:4,], subset(mydata, Disease=="N")[1:4,])
group2 = rbind(subset(mydata, Disease=="Y")[5:7,], subset(mydata, Disease=="N")[5:7,])

This, of course, only works in the data set you showed in your post.
More general:

diseased = subset(mydata, Disease=="Y")
healthy = subset(mydata, Disease=="N")
group1 = rbind(diseased[1:4,], healthy[1:4,])
group2 = rbind(diseased[5:nrow(diseased),], healthy[5:nrow(healthy),])

Of course, this is not "tidyverse"-ish :slight_smile:

Cheers

@stkrog,.... this is helpful. I tried and got this output.

group1 = rbind(subset(SampleData_trim, Disease=="Y")[1:4,], subset(SampleData_trim, Disease=="N")[1:4,])
Error in eval(e, x, parent.frame()) : object 'Disease' not found
> group2 = rbind(subset(SampleData_trim, Disease=="Y")[5:7,], subset(SampleData_trim, Disease=="N")[5:7,])
Error in eval(e, x, parent.frame()) : object 'Disease' not found
> 

The problem is that disease wasn't assigned in the data. How do I split the data by taking the corresponding columns out?

I think it may be dawning on me what you are asking for. Correct me if I'm wrong.
In your original post, the first table is a list of the relations between samples and Disease. The next table you show (which I basically just ignored) actually contains nine columns for each sample. You want one data frame (or tibble or whatever) with all the columns from the second table that correspond to samples in your group 1, and another data frame with the columns that correspond to samples in your group 2.
If this is true then the problem is much more elaborate than I originally thought. Before I go into suggestions for solutions please confirm that I have understood you correctly.

1 Like

@stkrog, you got it right. Its more elaborate. What do you suggest I do?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.