Best way to plot a data set of 66040 with data relatively close

ggplot2

#1

I have dataset of 66040. Below is a sample of my data. I would like to compare the performance by each host? Also by event name? What is the best way to do this please?

data.frame(stringsAsFactors=FALSE,
    timestamp = c("2018-11-08 07:42:11", "2018-11-08 07:42:55",
                  "2018-11-08 07:43:41", "2018-11-08 07:44:07",
                  "2018-11-08 07:44:57", "2018-11-08 07:45:44", "2018-11-08 07:46:28",
                  "2018-11-08 07:47:20", "2018-11-08 07:47:56", "2018-11-08 07:48:48"),
     hostname = c("host1",
                  "host2",
                  "host2",
                  "host3", 
                  "host2",
                  "host5",
                  "host5", 
                  "host3",
                  "host3",
                  "host1"),
   event_name = c("save", "upload", "render", "upload",
                  "save", "save", "render", "upload",
                  "upload", "render"),
   event_type = c("STOP", "STOP", "STOP", "STOP", "STOP", "STOP", "STOP",
                  "STOP", "STOP", "STOP"),
    time_task = c("25.8089997768402", "40.319000005722", "42.9910001754761",
                  "24.6840000152588", "46.1050000190735", "44.2489998340607",
                  "41.2440001964569", "49.4800000190735", "33.7000000476837",
                  "49.0550000667572"),
      task_id = c("00390eee-c26c-41da-a02d-556bb7fcac67",
                  "dbc599f6-694b-46c4-a864-e09ab881af37",
                  "0ad8d29d-d30c-48c9-bd0a-fbea985464b2", "52881801-4d75-4ada-a118-682aa1d5ddf9",
                  "5c14d761-26af-4602-a51d-6378a4ad7c24",
                  "fa8d5709-ffb6-4a8b-bd73-0076c1654d49", "0ebfe158-0c86-4cde-8742-20d13cc4076b",
                  "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
                  "ffd69831-0ba4-457b-b8a8-e37c49779d94", "70a9ab55-b17f-4df6-82ef-146425d7bbfa"),

#2

A histogram or a density plot would be a good choice to show data distribution and avoid overplotting, see this example.


#3

Can you say more about the structure of your data (it would be great if you could provide a small data sample)? For example, is the goal to plot time_consumption (numeric) vs. host_name (categorical)? Given the number of observations, it looks like there are multiple observations per host_name. Do you want to plot total time_consumption per host_name? Or maybe the average, the individual values, or the distribution of values by host_name? Or do you just want the distribution of time_consumption overall, regardless of host_name? It will be easier to provide more specific guidance if you can provide more information about what you're trying to do.


#4

Yeah @andresrcs is correct histogram is great. The only down fall is you want to get you bins right or you can skew the data, though this is generally not a problem.

This is my favourite and I can only suggest it without knowing the layout of your data, whether it is normalise etc, yet package plot_ly 2-3D scatter plot works well with continuous plots.


#5

So if you find out that the whole 66040 data points are normalized then you only need to take say 10 000 data points as a subset of the population and use these as your sample. The reason being that sample will adequately represent the population and if you don't believe me ask any statistician, this has been known for a good 100 years.

As the population is normalised so too is your sample. As a result, I probably wouldn't use the population (66040). I'd give you a function to test for guassian or normality but my notes really need updating and can't find it atm. Leave a reply if you can't find one.


#6

It’s rights tcratius, take a random sibset fron all the data. It I’ll be representative.


#7

As you can see, not knowing the actual structure of your data, brings a lot of speculation, this is hardly helpful and can even be misleading.

I encourage you to share a small but representative sample of your data and more accurately state your goals with it.


#8

I have updated a sample of the data that I am working on. Thanks


#9

How you define "performance" for your specific application?, Do you have specification limits for the time your tasks should take?, if you don't have limits specified by design maybe you should start by defining process limits (applying statistical process control techniques) and then concentrate your efforts analysing the out of control cases, identifying root causes and developing key performance indexes (e.g. mean time between failures).

Another approach is anomaly detection using time series decomposition, for this you can take a look at the anomalize package


#10

I would like to compare the performance by each host? Also by event name?

Tukey style box and whisker plots would be a good start. There are too few data points here to be meaningful but you get the idea.

If you do do this it would be great to see the plots for you big data set, so if you can please post the results back here.

df <- data.frame(
    stringsAsFactors = FALSE,
    time_stamp = c(
        "2018-11-08 07:42:11",
        "2018-11-08 07:42:55",
        "2018-11-08 07:43:41",
        "2018-11-08 07:44:07",
        "2018-11-08 07:44:57",
        "2018-11-08 07:45:44",
        "2018-11-08 07:46:28",
        "2018-11-08 07:47:20",
        "2018-11-08 07:47:56",
        "2018-11-08 07:48:48"
    ),
    host_name = c(
        "host1",
        "host2",
        "host2",
        "host3",
        "host2",
        "host5",
        "host5",
        "host3",
        "host3",
        "host1"
    ),
    event_name = c(
        "save",
        "upload",
        "render",
        "upload",
        "save",
        "save",
        "render",
        "upload",
        "upload",
        "render"
    ),
    event_type = c(
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP",
        "STOP"
    ),
    time_task = c(
        "25.8089997768402",
        "40.319000005722",
        "42.9910001754761",
        "24.6840000152588",
        "46.1050000190735",
        "44.2489998340607",
        "41.2440001964569",
        "49.4800000190735",
        "33.7000000476837",
        "49.0550000667572"
    ),
    task_id = c(
        "00390eee-c26c-41da-a02d-556bb7fcac67",
        "dbc599f6-694b-46c4-a864-e09ab881af37",
        "0ad8d29d-d30c-48c9-bd0a-fbea985464b2",
        "52881801-4d75-4ada-a118-682aa1d5ddf9",
        "5c14d761-26af-4602-a51d-6378a4ad7c24",
        "fa8d5709-ffb6-4a8b-bd73-0076c1654d49",
        "0ebfe158-0c86-4cde-8742-20d13cc4076b",
        "403c1ca4-f5d3-4831-8a66-0f8be10f5aeb",
        "ffd69831-0ba4-457b-b8a8-e37c49779d94",
        "70a9ab55-b17f-4df6-82ef-146425d7bbfa"
    )
)

df$time_stamp <-
    as.numeric(as.POSIXct(df$time_stamp, format = "%Y-%m-%d %H:%M:%OS"))

df$time_task <- as.numeric(df$time_task)


library(ggplot2)
library(magrittr)

df %>% ggplot(aes(host_name, time_task)) +
    geom_boxplot() 


df %>% ggplot(aes(event_name, time_task)) +
    geom_boxplot()


# You can also control the display of dots and outliers
df %>% ggplot(aes(x = event_name, y = time_task)) +
    geom_boxplot(
        outlier.colour = "red",
        outlier.shape = 16,
        outlier.size = 2
    ) +
    geom_dotplot(binaxis = "y",
                 stackdir = "center",
                 dotsize = 0.5)
#> `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2019-01-15 by the reprex package (v0.2.1)


#11

Thanks for your reply. I am not able to represent the full data using this approach. The scale doesn't fit all the obs.


#12

Do you mean you have too many boxes to fit on the x-axis? In that case try a biplot of a one-hot encoded ordination (e.g. SVD)


#13

Rplot03

This is what I am getting.


#14

It's the least time it take to perfom a task. So the shorter the time the higher the performance.


#15

Please post the output of something like str(df), and also your ggplot code so that we can try to help.


#16

I believe that, unless you are trying to do inference with your data, you need to do some sort of summarise or filtering in order to reduce your data points and avoid overplotting.