splitting data into groups

I have two models that I am trying to compare side beside against the ground truth. Each model counts a number of objects in an image. I would like to visualize how each model is performing in terms of the number of objects in an image. I would like to cut the data into groups(0-100,100-200...etc) then understand how each model is predicting objects within each group which will represent the number of objects the model is trying to predict. Any idea how could i achieve this please/Many Thanks

data.frame(
                             Ground.Truth = c(75L,78L,83L,87L,
                                              89L,90L,93L,93L,94L,95L),
                                      ICY = c(75L,61L,66L,106L,
                                              82L,85L,59L,88L,78L,70L),
                                    VGG16 = c(79L,74L,117L,
                                              160L,101L,118L,106L,128L,107L,
                                              109L),
                               Image.Name = as.factor(c("images /2020-02-04_p1_Day0_Position023.png",
                                                        "images /2020-02-04_p3_Day0_Position026.png",
                                                        "images /2020-02-04_p3_Day0_Position015.png",
                                                        "images /2020-02-04_p3_Day0_Position083.png",
                                                        "images /2020-02-04_p1_Day0_Position077.png",
                                                        "images /2020-02-04_p1_Day0_Position026.png",
                                                        "images /2020-02-04_p1_Day0_Position050.png",
                                                        "images /2020-02-04_p1_Day0_Position074.png",
                                                        "images /2020-02-04_p1_Day0_Position082.png",
                                                        "images /2020-02-04_p3_Day0_Position078.png"))
dat <- data.frame(
  Ground.Truth = c(75L,78L,83L,87L,89L,90L,93L,93L,94L,95L),
  ICY = c(75L,61L,66L,106L,82L,85L,59L,88L,78L,70L),
  VGG16 = c(79L,74L,117L,160L,101L,118L,106L,128L,107L,
  109L),
  Image.Name = as.factor(
    c("images /2020-02-04_p1_Day0_Position023.png",
   "images /2020-02-04_p3_Day0_Position026.png",
   "images /2020-02-04_p3_Day0_Position015.png",
   "images /2020-02-04_p3_Day0_Position083.png",
   "images /2020-02-04_p1_Day0_Position077.png",
   "images /2020-02-04_p1_Day0_Position026.png",
   "images /2020-02-04_p1_Day0_Position050.png",
   "images /2020-02-04_p1_Day0_Position074.png",
   "images /2020-02-04_p1_Day0_Position082.png",
   "images /2020-02-04_p3_Day0_Position078.png")
   ))


split(dat, cut(dat$Ground.Truth, c(70, 75, 80, 85,90,95), include.lowest=TRUE))
#> $`[70,75]`
#>   Ground.Truth ICY VGG16                                 Image.Name
#> 1           75  75    79 images /2020-02-04_p1_Day0_Position023.png
#> 
#> $`(75,80]`
#>   Ground.Truth ICY VGG16                                 Image.Name
#> 2           78  61    74 images /2020-02-04_p3_Day0_Position026.png
#> 
#> $`(80,85]`
#>   Ground.Truth ICY VGG16                                 Image.Name
#> 3           83  66   117 images /2020-02-04_p3_Day0_Position015.png
#> 
#> $`(85,90]`
#>   Ground.Truth ICY VGG16                                 Image.Name
#> 4           87 106   160 images /2020-02-04_p3_Day0_Position083.png
#> 5           89  82   101 images /2020-02-04_p1_Day0_Position077.png
#> 6           90  85   118 images /2020-02-04_p1_Day0_Position026.png
#> 
#> $`(90,95]`
#>    Ground.Truth ICY VGG16                                 Image.Name
#> 7            93  59   106 images /2020-02-04_p1_Day0_Position050.png
#> 8            93  88   128 images /2020-02-04_p1_Day0_Position074.png
#> 9            94  78   107 images /2020-02-04_p1_Day0_Position082.png
#> 10           95  70   109 images /2020-02-04_p3_Day0_Position078.png

Created on 2020-08-30 by the reprex package (v0.3.0)

1 Like

What is the best graph to use to represent this please? My data consists of 480 observations? Should I take a sample of the data and plot into a graph? Thanks for your help.

Plots are N_2 representations of an object that seek to illuminate relationships among its constituent objects beyond what can be conveyed only through enumeration of results. So, the best way starts with the question what relationships is this plot to illustrate?

The object created by

split(dat, cut(dat$Ground.Truth, c(70, 75, 80, 85,90,95), include.lowest=TRUE))

dat along intervals of Ground.Truth. Each of those contain the specific Ground.Truth, values for ICY and VGG16 (the number of images) and the Image.Name filenames of png objects related somehow.

What is the relationship to be shown to better advantage with a plot here?

I am hoping to see the difference between ICY and VGG16 to spot the difference in the graph.

Questions are much harder than answers.

ICY and VGG16 are vectors of integers. As such they can be compared with the - operator.

> dat[2][[1]] - dat[3][[1]]
 [1]  -4 -13 -51 -54 -19 -33 -47 -40 -29 -39

Without knowing the scales of the two variables, however, that is not a useful result.

Difference in what? That is needed to proceed.

Thanks! VGG16 and ICY are two predictions method, so i need to compare their values against the ground truth. The prediction might above or below the ground truth value.

Start with the unstratified data

dat <- data.frame(
  Ground.Truth = c(75L,78L,83L,87L,89L,90L,93L,93L,94L,95L),
  ICY = c(75L,61L,66L,106L,82L,85L,59L,88L,78L,70L),
  VGG16 = c(79L,74L,117L,160L,101L,118L,106L,128L,107L,
  109L),
  Image.Name = as.factor(
    c("images /2020-02-04_p1_Day0_Position023.png",
   "images /2020-02-04_p3_Day0_Position026.png",
   "images /2020-02-04_p3_Day0_Position015.png",
   "images /2020-02-04_p3_Day0_Position083.png",
   "images /2020-02-04_p1_Day0_Position077.png",
   "images /2020-02-04_p1_Day0_Position026.png",
   "images /2020-02-04_p1_Day0_Position050.png",
   "images /2020-02-04_p1_Day0_Position074.png",
   "images /2020-02-04_p1_Day0_Position082.png",
   "images /2020-02-04_p3_Day0_Position078.png")
   ))

fit_icy   <- lm(Ground.Truth ~ ICY, data = dat)
fit_vgg16 <- lm(Ground.Truth ~ VGG16, data = dat)
fit_both  <- lm(Ground.Truth~ ICY + VGG16, data = dat)
plot(fit_icy, which = 2)

plot(fit_vgg16, which = 2)

plot(fit_both, which = 2)



summary(fit_icy)
#> 
#> Call:
#> lm(formula = Ground.Truth ~ ICY, data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -12.528  -3.614   1.242   5.750   7.901 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 81.09147   13.33948   6.079 0.000296 ***
#> ICY          0.08583    0.17066   0.503 0.628585    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 7.253 on 8 degrees of freedom
#> Multiple R-squared:  0.03064,    Adjusted R-squared:  -0.09052 
#> F-statistic: 0.2529 on 1 and 8 DF,  p-value: 0.6286
summary(fit_vgg16)
#> 
#> Call:
#> lm(formula = Ground.Truth ~ VGG16, data = dat)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -8.446 -5.448  1.855  5.080  7.424 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  72.5716     9.9598   7.286  8.5e-05 ***
#> VGG16         0.1377     0.0887   1.552    0.159    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.458 on 8 degrees of freedom
#> Multiple R-squared:  0.2314, Adjusted R-squared:  0.1353 
#> F-statistic: 2.408 on 1 and 8 DF,  p-value: 0.1593
summary(fit_both)
#> 
#> Call:
#> lm(formula = Ground.Truth ~ ICY + VGG16, data = dat)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -8.516 -5.815  2.319  4.094  7.155 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  78.4633    12.1627   6.451  0.00035 ***
#> ICY          -0.2015     0.2312  -0.872  0.41219    
#> VGG16         0.2253     0.1349   1.669  0.13896    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.557 on 7 degrees of freedom
#> Multiple R-squared:  0.3067, Adjusted R-squared:  0.1086 
#> F-statistic: 1.548 on 2 and 7 DF,  p-value: 0.2775

Created on 2020-08-31 by the reprex package (v0.3.0)

The plots illustrate that VGG16 and ICY VGG16 + ICY` and have similarities in a linear model: overlapping outliers and similar distribution of residual errors.

The summary tables confirm this and also indicate that a linear model will not be fruitful. A decision whether N for the subsets suffices for application of the same remains.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.