Substitute data by group from creating factor or character from a range number and single number

Sheneice.c · October 15, 2019, 8:43am

Hi, I am greatly aware of function cut() for slicing data according to the value set. However, some of the categories are presented by a range of numbers while some are presented by a single number. For example:

groups <- c("A", "B", "C", "D", "E")
value <- c("9.9-13.9", "32.1-32.9", "0.9-6.9", 73, "14.8 AND 41.1")
data <- cbind(groups, value)

Intent to show the value in ranges but i dont know how and yes, the value in my data are in numbers

The problem is some groups are defined by a range of values, some by single number while the last group, is defined by two different value.
I am intended to substitute my original data into groups as stated in data according to the value stated to plot the frequency of the groups into a plot.

How should i do this? Thank you in advance for any comments and suggestions!

pieterjanvc · October 15, 2019, 10:37am

Hi,

If I hope I understood your question, so this would be my suggestion:

library(stringr)
library(dplyr)

options(stringsAsFactors = F)

groups <- c("A", "B", "C", "D", "E")
value <- c("9.9-13.9", "32.1-32.9", "0.9-6.9", 73, "14.8 AND 41.1")
data <- cbind(groups, value)

data = data.frame(data)

data = purrr::map_df(1:nrow(data), function(x){ # x = 1
  value = data$value[x]
  
  if(str_detect(value, "-")){
    myRange = as.numeric(unlist(str_split(data$value[x], "-")))
    data.frame(groups = data$groups[x],
               start = myRange[1], end = myRange[2], info = "range")
  } else if(str_detect(value, "AND")){
    myVals = as.numeric(unlist(str_split(data$value[x], "AND")))
    data.frame(groups = data$groups[x],
               start = myVals, end = myVals, info = "multiple")
  } else if(!is.na(as.numeric(value))){
    myVals = as.numeric(value)
    data.frame(groups = data$groups[x],
               start = myVals, end = myVals, info = "single")
  } else {
    data.frame(groups = data$groups[x],
               start = NA, end = NA, info = "error")
  }
})

The output would look like this:


  groups start  end     info
1      A   9.9 13.9    range
2      B  32.1 32.9    range
3      C   0.9  6.9    range
4      D  73.0 73.0   single
5      E  14.8 14.8 multiple
6      E  41.1 41.1 multiple

Ranges now have a start and end value
Single numbers have the same start and end value
Numbers split by AND are in separate rows, each treated as a single number

With this method, you can perform better analysis on the data I think.

Hope this helps,
PJ

Sheneice.c · October 15, 2019, 3:54pm

Thank you so much for the help, @pieterjanvc!
I would like to get an extra miles on the analysis as substitute my original data (in numeric) by referring to the groups in data is my final goal. I only have experience in substitute single and multiple values but no luck in able to substitute ranges. May someone suggest a proper reference and if there is a possible way that i can substitute all groups with different info at once?

Appreciate any suggestion!

pieterjanvc · October 15, 2019, 4:20pm

Hi,

I'm afraid I do not understand what your question is here. Is the code and output I provided already a step in the right direction?

Please write out a detailed example of a before and after dataset (like I did at the end of my code) so I can see what you are trying to accomplish and explain to me what filtering / substitutions are needed.

Are you trying to filter data depending on whether their range (or values) contain a certain numeric value? For example: get all groups where the number 10.0 is in the range (would be A).

Kind regards,
PJ

Sheneice.c · October 16, 2019, 3:33am

Hi,
The code and output you provided is the step in the correct direction. The example of dataset i would like to achieve is as below:

# Before
value2 <- 73.0, 32.9,  10.0,  6.1, 14.8, 41.1
sample <- 1:6   
data2 <- cbind(sample, value2)

# After
Grouping <- c("D", "B", "A", "C", "E", "E") 
finalData <- cbind(sample, Grouping)

pieterjanvc · October 16, 2019, 12:20pm

Hi,

It was as I thought then. Here is my implementation:

library(stringr)
library(dplyr)

options(stringsAsFactors = F)

groups <- c("A", "B", "C", "D", "E")
value <- c("9.9-13.9", "32.1-32.9", "0.9-6.9", 73, "14.8 AND 41.1")
data <- cbind(groups, value)

data = data.frame(data)

data = purrr::map_df(1:nrow(data), function(x){ # x = 1
  value = data$value[x]
  
  if(str_detect(value, "-")){
    myRange = as.numeric(unlist(str_split(data$value[x], "-")))
    data.frame(groups = data$groups[x],
               start = myRange[1], end = myRange[2], info = "range")
  } else if(str_detect(value, "AND")){
    myVals = as.numeric(unlist(str_split(data$value[x], "AND")))
    data.frame(groups = data$groups[x],
               start = myVals, end = myVals, info = "multiple")
  } else if(!is.na(as.numeric(value))){
    myVals = as.numeric(value)
    data.frame(groups = data$groups[x],
               start = myVals, end = myVals, info = "single")
  } else {
    data.frame(groups = data$groups[x],
               start = NA, end = NA, info = "error")
  }
})


# New input
value2 <- c(73.0, 32.9,  10.0,  6.1, 14.8, 41.1)
sample <- 1:6   
data2 <- cbind(sample, value2)

#Find groups for input
data2 = data.frame(data2)
data2 = data2 %>% mutate(groups = sapply(value2, function(x) {
  data %>% filter(start <= x, end >= x) %>% pull(groups)
  }))

data2
  sample value2 groups
1      1   73.0      D
2      2   32.9      B
3      3   10.0      A
4      4    6.1      C
5      5   14.8      E
6      6   41.1      E

Note that this simple version assumes that there is only one group that can match per value, i.e. the ranges of data in the first dataset do not overlap. If they would, the are multiple groups possible per input and the code needs to be expanded (not that difficult)

Hope this helps,
PJ

Sheneice.c · October 16, 2019, 12:37pm

Hi,
Thank you so much for the suggestion however it ended up with some error

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

options(stringsAsFactors = F)

groups <- c("A", "B", "C", "D", "E")
value <- c("9.9-13.9", "32.1-32.9", "0.9-6.9", 73, "14.8 AND 41.1")
data1 <- cbind(groups, value)

data1 = data.frame(data1)

data1 = purrr::map_df(1:nrow(data1), function(x){ # x = 1
  value = data1$value[x]

  if(str_detect(value, "-")){
    myRange = as.numeric(unlist(str_split(data1$value[x], "-")))
    data.frame(groups = data1$groups[x],
               start = myRange[1], end = myRange[2], info = "range")
  } else if(str_detect(value, "AND")){
    myVals = as.numeric(unlist(str_split(data1$value[x], "AND")))
    data.frame(groups = data1$groups[x],
               start = myVals, end = myVals, info = "multiple")
  } else if(!is.na(as.numeric(value))){
    myVals = as.numeric(value)
    data.frame(groups = data1$groups[x],
               start = myVals, end = myVals, info = "single")
  } else {
    data.frame(groups = data1$groups[x],
               start = NA, end = NA, info = "error")
  }
})

# New input
value2 <- c(73.0, 32.9,  10.0,  6.1, 14.8, 41.1)
sample <- 1:6   
data2 <- cbind(sample, value2)

#Find groups for input
data2 = data.frame(data2)
data2 = data2 %>% mutate(groups = sapply(value2, function(x) {
  data %>% filter(start <= x, end >= x) %>% pull(groups)
}))
#> Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"

pieterjanvc · October 16, 2019, 12:44pm

Hi,

This sounds as if there is an error with the function loaded from the packages. You did install the dplyr package right?

Try using the explicit call of the filter function:

data %>% dplyr::filter(start <= x, end >= x) %>% pull(groups)

Alternatively, use a non-tidyverse implementation:

data2$groups = sapply(data2$value2, function(x) {
  data[data$start <= x & data$end >= x, "groups"]
})

PJ

Sheneice.c · October 16, 2019, 12:53pm

Hi,
Yes i did install dplyr and i tried reinstall and rerun it after it is restarted but the problem doesn't resolve...

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data %>% dplyr::filter(start <= x, end >= x) %>% pull(groups)
#> Error in UseMethod("filter_"): no applicable method for 'filter_' applied to an object of class "function"

pieterjanvc · October 16, 2019, 1:56pm

That's weird ... It's working on my end.

You did copy-paste all the code right?
Ensure that data and data2 are data frames

data = data.frame(data)
data2 = data.frame(data2)

If this is not the issue, did you get the error with the non-tidyverse implementation?

PJ

Sheneice.c · October 16, 2019, 2:34pm

Hi,
Thank you for everything! I put the problem aside and continue with other parts of my analysis and somehow it works without any error (the only possible reason i updated some other packages in Rstudio but dplyr is not one of them?) Thank you again for helping me out!

system · October 23, 2019, 2:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.