How to make data into groups and then into subgroups

hellovivvvv · July 8, 2020, 2:19am

How to make data into groups and then into subgroups
step1: I wanna group the dataset into [0,0.5Y),[0.5Y,2Y),[2Y,4Y]
step2: make data in each group into smaller subgroups.
For example:
[0,0.5Y): [0,0.1Y),[0.1,0.2Y),[0.2,0.3Y),[0.3,0.4Y),[0.4,0.5Y)
[0.5Y,2Y):[0.5,1Y),[1,1.5Y),[1.5,2Y)
[2Y,4Y):[2,3Y),[3,4Y)
step3: see the distribution of each groups and each subgroup
step4: substract the first calue in each smaller subgroup

For step1&2, since it requires to group data twice, and the range of smaller subgroups are not the same, I'm quiet confused about how to make my code in the most efficient way.
For step 3, I think I can create a dataset consisting of major groups and the other one consisting of all the subgroups, and then plot two scatter plots seperately.

dataframe<-df <- data.frame(
  maturity=c("0.24Y","0.6Y","0.7Y","0.9Y","0.98Y","3Y","3.5Y","2.9Y","0.32Y"),
  price = c(2,4,3,6,23,4,7,2,7))

nirgrahamuk · July 8, 2020, 7:22am

Representing durations by character strings with Y in them is bound to make data transformations more difficult than they would otherwise be.

nirgrahamuk · July 8, 2020, 7:59am


library(tidyverse)

# [0,0.5Y): [0,0.1Y),[0.1,0.2Y),[0.2,0.3Y),[0.3,0.4Y),[0.4,0.5Y)
# [0.5Y,2Y):[0.5,1Y),[1,1.5Y),[1.5,2Y)
# [2Y,4Y):[2,3Y),[3,4Y)

set.seed(42)
example_raw <- data.frame(
  maturity = runif(n = 100,min=0,max=4),
  price= sample.int(n=100,size=100,replace=TRUE)
)

(ex_grouped <- example_raw %>% mutate(
  g = cut(maturity,
    breaks = c(0, .5, 2, 4))
  ,
    sg = case_when(
      g == "(0,0.5]" ~ cut(maturity, breaks = 0:5 / 10) %>% as.character(),
      g == "(0.5,2]" ~ cut(maturity, breaks = 1:4 / 2)  %>% as.character(),
      g == "(2,4]" ~ cut(maturity, breaks = c(2, 3, 4)) %>% as.character(),
      TRUE ~ 99 %>% as.character()
    ),
  sgf = factor(sg)
  ) %>% select(-sg)
)

##histogram i.e. occurrances of the groupings
ggplot(data=ex_grouped,
       mapping = aes(x=g)) + geom_bar()

ggplot(data=ex_grouped,
       mapping = aes(x=sgf)) + geom_bar()


ex_grouped_ident_first <- arrange(ex_grouped,sgf )%>% 
  group_by(sgf) %>% 
  mutate(is_first = row_number()==1)

ex_grouped_del_first <- filter(ex_grouped_ident_first,
                               is_first==FALSE)

hellovivvvv · July 9, 2020, 9:16am

Hi，thanks for your reply. But may I ask what's the meaning of "breaks = 0:5 / 10" and "breaks = 1:4 / 2"? I know that's for defining ranges of subgroups, but why do we write like that? And why do we need to have " sgf = factor(sg)" ? Thank you so much!

nirgrahamuk · July 9, 2020, 9:20am

you can run

0:5 / 10

in the console to see the output, this is the same as

 c(0.0 ,0.1 ,0.2 ,0.3 ,0.4 ,0.5)

but fewer key strokes, and I like to save my fingers.

this is necessary, for the result to be a factor, which is what I wanted. The advantages of using factors are performance based (probably not an issue in your scenario) but also convenience like ability to control the order for ggplot etc.

system · July 30, 2020, 9:20am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.