Grouping data based on the Summary information of dataframe

I would like to group this data based on the summary of its statistical information. By this I mean group 1 (values between Min - 1st Qu), group 2(values between 1st Qu - Median) group 3 (values between Median - Mean) group 4 (values between - Mean - 3rd Qu) group 5 (values between 3rd Qu -Max). Is this possible to do?

dd = data.frame(stringsAsFactors=FALSE,
           timestamp = c("2018-11-08 07:42:11", "2018-11-08 07:42:55",
                         "2018-11-08 07:43:41", "2018-11-08 07:44:07",
                         "2018-11-08 07:44:57", "2018-11-08 07:45:44", "2018-11-08 07:46:28",
                         "2018-11-08 07:47:20", "2018-11-08 07:47:56", "2018-11-08 07:48:48"),
           hostname = c("host1",
                        "host2",
                        "host2",
                        "host3", 
                        "host2",
                        "host5",
                        "host5", 
                        "host3",
                        "host3",
                        "host1"),
           event_name = c("save", "upload", "render", "upload",
                          "save", "save", "render", "upload",
                          "upload", "render"),
           event_type = c("STOP", "STOP", "STOP", "STOP", "STOP", "STOP", "STOP",
                          "STOP", "STOP", "STOP"),
           time_task = c("25.8089997768402", "40.319000005722", "42.9910001754761",
                         "24.6840000152588", "46.1050000190735", "44.2489998340607",
                         "41.2440001964569", "49.4800000190735", "33.7000000476837",
                         "49.0550000667572"),
           task_id = c("00390eee-c26c-41da-a02d-556bb7fcac67",
                       "dbc599f6-694b-46c4-a864-e09ab881af37",
                       "0ad8d29d-d30c-48c9-bd0a-fbea985464b2", "52881801-4d75-4ada-a118-682aa1d5ddf9",
                       "5c14d761-26af-4602-a51d-6378a4ad7c24",
                       "fa8d5709-ffb6-4a8b-bd73-0076c1654d49", "0ebfe158-0c86-4cde-8742-20d13cc4076b",
                       "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
                       "ffd69831-0ba4-457b-b8a8-e37c49779d94", "70a9ab55-b17f-4df6-82ef-146425d7bbfa"))
           
summary(as.numeric(dd$time_task))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   24.68   35.35   42.12   39.76   45.64   49.48

There are a number of different ways to do this. One way that comes to my mind is by using ecdf to build an empirical cumulative distribution function of the column you are interested in.

Here's an example that simplifies your data a bit. I also focused on the quantiles and ignored the group from mean to 3rd quartile. But you should be able to add that easily enough.

dd = data.frame(
  stringsAsFactors = FALSE,
  hostname = c(
    "host1",
    "host2",
    "host2",
    "host3",
    "host2",
    "host5",
    "host5",
    "host3",
    "host3",
    "host1"
  ),
  time_task = as.numeric(
    c(
      "25.8089997768402",
      "40.319000005722",
      "42.9910001754761",
      "24.6840000152588",
      "46.1050000190735",
      "44.2489998340607",
      "41.2440001964569",
      "49.4800000190735",
      "33.7000000476837",
      "49.0550000667572"
    )
  )
)

library(tidyverse)

time_ecdf <- ecdf(dd$time_task)

dd %>%
  mutate(t_ecdf = time_ecdf(time_task)) %>%
  mutate(
    my_group = case_when(
      t_ecdf <= .25 ~ "group_1",
      t_ecdf > .25 &
        t_ecdf <= .5 ~ "group_2",
      t_ecdf > .5 &
        t_ecdf <= .75  ~ "group_3",
      t_ecdf > .75  ~ "group_4"
    )
  )
#>    hostname time_task t_ecdf my_group
#> 1     host1    25.809    0.2  group_1
#> 2     host2    40.319    0.4  group_2
#> 3     host2    42.991    0.6  group_3
#> 4     host3    24.684    0.1  group_1
#> 5     host2    46.105    0.8  group_4
#> 6     host5    44.249    0.7  group_3
#> 7     host5    41.244    0.5  group_2
#> 8     host3    49.480    1.0  group_4
#> 9     host3    33.700    0.3  group_2
#> 10    host1    49.055    0.9  group_4

Created on 2019-01-16 by the reprex package (v0.2.1)

The way ecdf works is that you pass it the whole vector first and create an ecdf object, which I've called time_ecdf. Then we can call time_ecdf as if it's a function. We can pass it any value and it will tell us where that value is on the empirical cumulative distribution. So I do that in the mutate step and put the resulting values in a column called t_ecdf. Then I can use case_when to assign them to buckets.

2 Likes

Here is a way to do, I'm sure there is a more elegant way

library(tidyverse)

dd = data.frame(stringsAsFactors=FALSE,
                timestamp = c("2018-11-08 07:42:11", "2018-11-08 07:42:55",
                              "2018-11-08 07:43:41", "2018-11-08 07:44:07",
                              "2018-11-08 07:44:57", "2018-11-08 07:45:44", "2018-11-08 07:46:28",
                              "2018-11-08 07:47:20", "2018-11-08 07:47:56", "2018-11-08 07:48:48"),
                hostname = c("host1",
                             "host2",
                             "host2",
                             "host3", 
                             "host2",
                             "host5",
                             "host5", 
                             "host3",
                             "host3",
                             "host1"),
                event_name = c("save", "upload", "render", "upload",
                               "save", "save", "render", "upload",
                               "upload", "render"),
                event_type = c("STOP", "STOP", "STOP", "STOP", "STOP", "STOP", "STOP",
                               "STOP", "STOP", "STOP"),
                time_task = c("25.8089997768402", "40.319000005722", "42.9910001754761",
                              "24.6840000152588", "46.1050000190735", "44.2489998340607",
                              "41.2440001964569", "49.4800000190735", "33.7000000476837",
                              "49.0550000667572"),
                task_id = c("00390eee-c26c-41da-a02d-556bb7fcac67",
                            "dbc599f6-694b-46c4-a864-e09ab881af37",
                            "0ad8d29d-d30c-48c9-bd0a-fbea985464b2", "52881801-4d75-4ada-a118-682aa1d5ddf9",
                            "5c14d761-26af-4602-a51d-6378a4ad7c24",
                            "fa8d5709-ffb6-4a8b-bd73-0076c1654d49", "0ebfe158-0c86-4cde-8742-20d13cc4076b",
                            "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
                            "ffd69831-0ba4-457b-b8a8-e37c49779d94", "70a9ab55-b17f-4df6-82ef-146425d7bbfa"))

summary(as.numeric(dd$time_task))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   24.68   35.35   42.12   39.76   45.64   49.48

dd %>% 
  mutate(time_task = parse_number(time_task),
         groups = case_when(
    between(time_task, min(time_task), 
            quantile(time_task, probs = .25)) ~ "Group 1",
    between(time_task, quantile(time_task, probs = .25), 
            median(time_task)) ~ "Group 2",
    between(time_task, median(time_task), quantile(time_task, probs = .75)) ~ "Group 3",
    between(time_task, quantile(time_task, probs = .75), max(time_task)) ~ "Group 4",
    TRUE ~ "ERROR"
  )) 
#>              timestamp hostname event_name event_type time_task
#> 1  2018-11-08 07:42:11    host1       save       STOP    25.809
#> 2  2018-11-08 07:42:55    host2     upload       STOP    40.319
#> 3  2018-11-08 07:43:41    host2     render       STOP    42.991
#> 4  2018-11-08 07:44:07    host3     upload       STOP    24.684
#> 5  2018-11-08 07:44:57    host2       save       STOP    46.105
#> 6  2018-11-08 07:45:44    host5       save       STOP    44.249
#> 7  2018-11-08 07:46:28    host5     render       STOP    41.244
#> 8  2018-11-08 07:47:20    host3     upload       STOP    49.480
#> 9  2018-11-08 07:47:56    host3     upload       STOP    33.700
#> 10 2018-11-08 07:48:48    host1     render       STOP    49.055
#>                                 task_id  groups
#> 1  00390eee-c26c-41da-a02d-556bb7fcac67 Group 1
#> 2  dbc599f6-694b-46c4-a864-e09ab881af37 Group 2
#> 3  0ad8d29d-d30c-48c9-bd0a-fbea985464b2 Group 3
#> 4  52881801-4d75-4ada-a118-682aa1d5ddf9 Group 1
#> 5  5c14d761-26af-4602-a51d-6378a4ad7c24 Group 4
#> 6  fa8d5709-ffb6-4a8b-bd73-0076c1654d49 Group 3
#> 7  0ebfe158-0c86-4cde-8742-20d13cc4076b Group 2
#> 8  403c1ca4-f5d3-4831-8a66-0f8be10f5aeb Group 4
#> 9  ffd69831-0ba4-457b-b8a8-e37c49779d94 Group 1
#> 10 70a9ab55-b17f-4df6-82ef-146425d7bbfa Group 4

Created on 2019-01-16 by the reprex package (v0.2.1)

2 Likes

Many Thanks for you both. What is the best graph type to use to plot this data please?

Thanks for the explanations on how ecdf work.

depends on what you really want to show. here are a few simple examples. But I had to cook up some dummy data to make them interesting:


library(tidyverse)
n <- 1000
dd = data.frame(
  stringsAsFactors = FALSE,
  timestamp =  sample(seq(
    as.POSIXct('2018/01/01'), as.POSIXct('2019/01/01'), by = "10 mins"
  ), n, replace = TRUE),
  hostname = sample(c(
    "host1",
    "host2",
    "host3",
    "host4",
    "host5"
  ), n, replace = TRUE),
  event_name = sample(c("save",
                        "upload",
                        "render"), n, replace = TRUE),
  time_task = rlnorm(n, 2, 1)
  
)

ggplot(dd) +
  aes(timestamp, time_task) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = hostname) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = event_name) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = event_name) +
  geom_line() +
  facet_grid(rows = vars(hostname))


ggplot(dd) +
  aes(time_task, event_name) +
  geom_boxplot() +
  facet_grid(vars(hostname))

Created on 2019-01-16 by the reprex package (v0.2.1)

3 Likes

If i want to find the total number of each group. I tried the below but it's working. It say cannot count an integer. i tried changing it as numeric but didn't work. Can anyone help please?

group_by(groups) %>%
mutate(total=count(groups))

All you need is

count(dd, group)

1 Like

Thanks. That works. But if I want to add as a new column to the existing df it doesn't work.

Use n()

dd %>%
    group_by(groups) %>%
    mutate(total = n())

Thanks! This adds a new column called total but doesn't group them by each group.

they are already grouped, did you mean arrange them by group?

dd %>%
    group_by(groups) %>%
    mutate(total = n()) %>%
    arrange(groups)
1 Like

This code seem to work:

dd %>%
group_by(groups) %>%
mutate(total = n()) %>%
count(groups)

That does exactly the same as dylanjm already told you

dd %>%
    group_by(groups) %>%
    mutate(total = n()) %>%
    count(groups)
#> # A tibble: 4 x 2
#> # Groups:   groups [4]
#>   groups      n
#>   <chr>   <int>
#> 1 Group 1     3
#> 2 Group 2     2
#> 3 Group 3     2
#> 4 Group 4     3
count(dd, groups)
#> # A tibble: 4 x 2
#>   groups      n
#>   <chr>   <int>
#> 1 Group 1     3
#> 2 Group 2     2
#> 3 Group 3     2
#> 4 Group 4     3
1 Like

This is not the answer to your original question, please mark the answer to the question in your topic title, and if you have more questions is better to open a new topic, that way we keep the forum tidy.

Thanks @andresrcs. I have marked the orginal awnser.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.