Grouping data based on the Summary information of dataframe

user124578 · January 16, 2019, 4:06pm

I would like to group this data based on the summary of its statistical information. By this I mean group 1 (values between Min - 1st Qu), group 2(values between 1st Qu - Median) group 3 (values between Median - Mean) group 4 (values between - Mean - 3rd Qu) group 5 (values between 3rd Qu -Max). Is this possible to do?

dd = data.frame(stringsAsFactors=FALSE,
           timestamp = c("2018-11-08 07:42:11", "2018-11-08 07:42:55",
                         "2018-11-08 07:43:41", "2018-11-08 07:44:07",
                         "2018-11-08 07:44:57", "2018-11-08 07:45:44", "2018-11-08 07:46:28",
                         "2018-11-08 07:47:20", "2018-11-08 07:47:56", "2018-11-08 07:48:48"),
           hostname = c("host1",
                        "host2",
                        "host2",
                        "host3", 
                        "host2",
                        "host5",
                        "host5", 
                        "host3",
                        "host3",
                        "host1"),
           event_name = c("save", "upload", "render", "upload",
                          "save", "save", "render", "upload",
                          "upload", "render"),
           event_type = c("STOP", "STOP", "STOP", "STOP", "STOP", "STOP", "STOP",
                          "STOP", "STOP", "STOP"),
           time_task = c("25.8089997768402", "40.319000005722", "42.9910001754761",
                         "24.6840000152588", "46.1050000190735", "44.2489998340607",
                         "41.2440001964569", "49.4800000190735", "33.7000000476837",
                         "49.0550000667572"),
           task_id = c("00390eee-c26c-41da-a02d-556bb7fcac67",
                       "dbc599f6-694b-46c4-a864-e09ab881af37",
                       "0ad8d29d-d30c-48c9-bd0a-fbea985464b2", "52881801-4d75-4ada-a118-682aa1d5ddf9",
                       "5c14d761-26af-4602-a51d-6378a4ad7c24",
                       "fa8d5709-ffb6-4a8b-bd73-0076c1654d49", "0ebfe158-0c86-4cde-8742-20d13cc4076b",
                       "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
                       "ffd69831-0ba4-457b-b8a8-e37c49779d94", "70a9ab55-b17f-4df6-82ef-146425d7bbfa"))
           
summary(as.numeric(dd$time_task))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   24.68   35.35   42.12   39.76   45.64   49.48

jdlong · January 16, 2019, 4:46pm

There are a number of different ways to do this. One way that comes to my mind is by using ecdf to build an empirical cumulative distribution function of the column you are interested in.

Here's an example that simplifies your data a bit. I also focused on the quantiles and ignored the group from mean to 3rd quartile. But you should be able to add that easily enough.

dd = data.frame(
  stringsAsFactors = FALSE,
  hostname = c(
    "host1",
    "host2",
    "host2",
    "host3",
    "host2",
    "host5",
    "host5",
    "host3",
    "host3",
    "host1"
  ),
  time_task = as.numeric(
    c(
      "25.8089997768402",
      "40.319000005722",
      "42.9910001754761",
      "24.6840000152588",
      "46.1050000190735",
      "44.2489998340607",
      "41.2440001964569",
      "49.4800000190735",
      "33.7000000476837",
      "49.0550000667572"
    )
  )
)

library(tidyverse)

time_ecdf <- ecdf(dd$time_task)

dd %>%
  mutate(t_ecdf = time_ecdf(time_task)) %>%
  mutate(
    my_group = case_when(
      t_ecdf <= .25 ~ "group_1",
      t_ecdf > .25 &
        t_ecdf <= .5 ~ "group_2",
      t_ecdf > .5 &
        t_ecdf <= .75  ~ "group_3",
      t_ecdf > .75  ~ "group_4"
    )
  )
#>    hostname time_task t_ecdf my_group
#> 1     host1    25.809    0.2  group_1
#> 2     host2    40.319    0.4  group_2
#> 3     host2    42.991    0.6  group_3
#> 4     host3    24.684    0.1  group_1
#> 5     host2    46.105    0.8  group_4
#> 6     host5    44.249    0.7  group_3
#> 7     host5    41.244    0.5  group_2
#> 8     host3    49.480    1.0  group_4
#> 9     host3    33.700    0.3  group_2
#> 10    host1    49.055    0.9  group_4

^{Created on 2019-01-16 by the reprex package (v0.2.1)}

The way ecdf works is that you pass it the whole vector first and create an ecdf object, which I've called time_ecdf. Then we can call time_ecdf as if it's a function. We can pass it any value and it will tell us where that value is on the empirical cumulative distribution. So I do that in the mutate step and put the resulting values in a column called t_ecdf. Then I can use case_when to assign them to buckets.

dylanjm · January 16, 2019, 4:47pm

Here is a way to do, I'm sure there is a more elegant way

library(tidyverse)

dd = data.frame(stringsAsFactors=FALSE,
                timestamp = c("2018-11-08 07:42:11", "2018-11-08 07:42:55",
                              "2018-11-08 07:43:41", "2018-11-08 07:44:07",
                              "2018-11-08 07:44:57", "2018-11-08 07:45:44", "2018-11-08 07:46:28",
                              "2018-11-08 07:47:20", "2018-11-08 07:47:56", "2018-11-08 07:48:48"),
                hostname = c("host1",
                             "host2",
                             "host2",
                             "host3", 
                             "host2",
                             "host5",
                             "host5", 
                             "host3",
                             "host3",
                             "host1"),
                event_name = c("save", "upload", "render", "upload",
                               "save", "save", "render", "upload",
                               "upload", "render"),
                event_type = c("STOP", "STOP", "STOP", "STOP", "STOP", "STOP", "STOP",
                               "STOP", "STOP", "STOP"),
                time_task = c("25.8089997768402", "40.319000005722", "42.9910001754761",
                              "24.6840000152588", "46.1050000190735", "44.2489998340607",
                              "41.2440001964569", "49.4800000190735", "33.7000000476837",
                              "49.0550000667572"),
                task_id = c("00390eee-c26c-41da-a02d-556bb7fcac67",
                            "dbc599f6-694b-46c4-a864-e09ab881af37",
                            "0ad8d29d-d30c-48c9-bd0a-fbea985464b2", "52881801-4d75-4ada-a118-682aa1d5ddf9",
                            "5c14d761-26af-4602-a51d-6378a4ad7c24",
                            "fa8d5709-ffb6-4a8b-bd73-0076c1654d49", "0ebfe158-0c86-4cde-8742-20d13cc4076b",
                            "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
                            "ffd69831-0ba4-457b-b8a8-e37c49779d94", "70a9ab55-b17f-4df6-82ef-146425d7bbfa"))

summary(as.numeric(dd$time_task))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   24.68   35.35   42.12   39.76   45.64   49.48

dd %>% 
  mutate(time_task = parse_number(time_task),
         groups = case_when(
    between(time_task, min(time_task), 
            quantile(time_task, probs = .25)) ~ "Group 1",
    between(time_task, quantile(time_task, probs = .25), 
            median(time_task)) ~ "Group 2",
    between(time_task, median(time_task), quantile(time_task, probs = .75)) ~ "Group 3",
    between(time_task, quantile(time_task, probs = .75), max(time_task)) ~ "Group 4",
    TRUE ~ "ERROR"
  )) 
#>              timestamp hostname event_name event_type time_task
#> 1  2018-11-08 07:42:11    host1       save       STOP    25.809
#> 2  2018-11-08 07:42:55    host2     upload       STOP    40.319
#> 3  2018-11-08 07:43:41    host2     render       STOP    42.991
#> 4  2018-11-08 07:44:07    host3     upload       STOP    24.684
#> 5  2018-11-08 07:44:57    host2       save       STOP    46.105
#> 6  2018-11-08 07:45:44    host5       save       STOP    44.249
#> 7  2018-11-08 07:46:28    host5     render       STOP    41.244
#> 8  2018-11-08 07:47:20    host3     upload       STOP    49.480
#> 9  2018-11-08 07:47:56    host3     upload       STOP    33.700
#> 10 2018-11-08 07:48:48    host1     render       STOP    49.055
#>                                 task_id  groups
#> 1  00390eee-c26c-41da-a02d-556bb7fcac67 Group 1
#> 2  dbc599f6-694b-46c4-a864-e09ab881af37 Group 2
#> 3  0ad8d29d-d30c-48c9-bd0a-fbea985464b2 Group 3
#> 4  52881801-4d75-4ada-a118-682aa1d5ddf9 Group 1
#> 5  5c14d761-26af-4602-a51d-6378a4ad7c24 Group 4
#> 6  fa8d5709-ffb6-4a8b-bd73-0076c1654d49 Group 3
#> 7  0ebfe158-0c86-4cde-8742-20d13cc4076b Group 2
#> 8  403c1ca4-f5d3-4831-8a66-0f8be10f5aeb Group 4
#> 9  ffd69831-0ba4-457b-b8a8-e37c49779d94 Group 1
#> 10 70a9ab55-b17f-4df6-82ef-146425d7bbfa Group 4

^{Created on 2019-01-16 by the reprex package (v0.2.1)}

user124578 · January 16, 2019, 4:53pm

Many Thanks for you both. What is the best graph type to use to plot this data please?

user124578 · January 16, 2019, 5:11pm

Thanks for the explanations on how ecdf work.

jdlong · January 16, 2019, 5:34pm

depends on what you really want to show. here are a few simple examples. But I had to cook up some dummy data to make them interesting:


library(tidyverse)
n <- 1000
dd = data.frame(
  stringsAsFactors = FALSE,
  timestamp =  sample(seq(
    as.POSIXct('2018/01/01'), as.POSIXct('2019/01/01'), by = "10 mins"
  ), n, replace = TRUE),
  hostname = sample(c(
    "host1",
    "host2",
    "host3",
    "host4",
    "host5"
  ), n, replace = TRUE),
  event_name = sample(c("save",
                        "upload",
                        "render"), n, replace = TRUE),
  time_task = rlnorm(n, 2, 1)
  
)

ggplot(dd) +
  aes(timestamp, time_task) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = hostname) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = event_name) +
  geom_line()


ggplot(dd) +
  aes(timestamp, time_task, color = event_name) +
  geom_line() +
  facet_grid(rows = vars(hostname))


ggplot(dd) +
  aes(time_task, event_name) +
  geom_boxplot() +
  facet_grid(vars(hostname))

^{Created on 2019-01-16 by the reprex package (v0.2.1)}

user124578 · January 17, 2019, 1:50pm

If i want to find the total number of each group. I tried the below but it's working. It say cannot count an integer. i tried changing it as numeric but didn't work. Can anyone help please?

group_by(groups) %>%
mutate(total=count(groups))

dylanjm · January 17, 2019, 3:13pm

All you need is

count(dd, group)

user124578 · January 17, 2019, 3:25pm

Thanks. That works. But if I want to add as a new column to the existing df it doesn't work.

andresrcs · January 17, 2019, 3:29pm

Use n()

dd %>%
    group_by(groups) %>%
    mutate(total = n())

user124578 · January 17, 2019, 3:40pm

Thanks! This adds a new column called total but doesn't group them by each group.

andresrcs · January 17, 2019, 3:43pm

they are already grouped, did you mean arrange them by group?

dd %>%
    group_by(groups) %>%
    mutate(total = n()) %>%
    arrange(groups)

user124578 · January 17, 2019, 3:45pm

This code seem to work:

dd %>%
group_by(groups) %>%
mutate(total = n()) %>%
count(groups)

andresrcs · January 17, 2019, 3:55pm

That does exactly the same as dylanjm already told you

dd %>%
    group_by(groups) %>%
    mutate(total = n()) %>%
    count(groups)
#> # A tibble: 4 x 2
#> # Groups:   groups [4]
#>   groups      n
#>   <chr>   <int>
#> 1 Group 1     3
#> 2 Group 2     2
#> 3 Group 3     2
#> 4 Group 4     3
count(dd, groups)
#> # A tibble: 4 x 2
#>   groups      n
#>   <chr>   <int>
#> 1 Group 1     3
#> 2 Group 2     2
#> 3 Group 3     2
#> 4 Group 4     3

andresrcs · January 17, 2019, 4:01pm

This is not the answer to your original question, please mark the answer to the question in your topic title, and if you have more questions is better to open a new topic, that way we keep the forum tidy.

user124578 · January 17, 2019, 4:07pm

Thanks @andresrcs. I have marked the orginal awnser.

system · January 24, 2019, 4:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.