Why mpg data in ggplot2 has duplicated rows?

(Sorry for cross posting. I originally posted this topic on ggplot2 mailing list, but could not get the answer.)

I found mpg data has some duplicated rows. Is this intended? Why are these rows needed?
If anyone knows why, please help me better understand this data.

data("mpg",  package = "ggplot2")

duplicated_rows <- mpg[duplicated(mpg) | duplicated(mpg, fromLast = TRUE), ]

duplicated_rows
#>     manufacturer               model displ year cyl      trans drv cty hwy
#> 19     chevrolet  c1500 suburban 2wd   5.3 2008   8   auto(l4)   r  14  20
#> 21     chevrolet  c1500 suburban 2wd   5.3 2008   8   auto(l4)   r  14  20
#> 40         dodge         caravan 2wd   3.3 1999   6   auto(l4)   f  16  22
#> 41         dodge         caravan 2wd   3.3 1999   6   auto(l4)   f  16  22
#> 42         dodge         caravan 2wd   3.3 2008   6   auto(l4)   f  17  24
#> 43         dodge         caravan 2wd   3.3 2008   6   auto(l4)   f  17  24
#> 53         dodge   dakota pickup 4wd   4.7 2008   8   auto(l5)   4  14  19
#> 54         dodge   dakota pickup 4wd   4.7 2008   8   auto(l5)   4  14  19
#> 59         dodge         durango 4wd   4.7 2008   8   auto(l5)   4  13  17
#> 61         dodge         durango 4wd   4.7 2008   8   auto(l5)   4  13  17
#> 65         dodge ram 1500 pickup 4wd   4.7 2008   8 manual(m6)   4  12  16
#> 67         dodge ram 1500 pickup 4wd   4.7 2008   8   auto(l5)   4  13  17
#> 68         dodge ram 1500 pickup 4wd   4.7 2008   8   auto(l5)   4  13  17
#> 69         dodge ram 1500 pickup 4wd   4.7 2008   8 manual(m6)   4  12  16
#> 78          ford        explorer 4wd   4.0 1999   6   auto(l5)   4  14  17
#> 80          ford        explorer 4wd   4.0 1999   6   auto(l5)   4  14  17
#> 101        honda               civic   1.6 1999   4   auto(l4)   f  24  32
#> 104        honda               civic   1.6 1999   4   auto(l4)   f  24  32
#>     fl      class
#> 19   r        suv
#> 21   r        suv
#> 40   r    minivan
#> 41   r    minivan
#> 42   r    minivan
#> 43   r    minivan
#> 53   r     pickup
#> 54   r     pickup
#> 59   r        suv
#> 61   r        suv
#> 65   r     pickup
#> 67   r     pickup
#> 68   r     pickup
#> 69   r     pickup
#> 78   r        suv
#> 80   r        suv
#> 101  r subcompact
#> 104  r subcompact

# For example, this two rows are exactly the same:
dplyr::glimpse(mpg[c(19,21),])
#> Observations: 2
#> Variables: 11
#> $ manufacturer <chr> "chevrolet", "chevrolet"
#> $ model        <chr> "c1500 suburban 2wd", "c1500 suburban 2wd"
#> $ displ        <dbl> 5.3, 5.3
#> $ year         <int> 2008, 2008
#> $ cyl          <int> 8, 8
#> $ trans        <chr> "auto(l4)", "auto(l4)"
#> $ drv          <chr> "r", "r"
#> $ cty          <int> 14, 14
#> $ hwy          <int> 20, 20
#> $ fl           <chr> "r", "r"
#> $ class        <chr> "suv", "suv"
1 Like

This is just a wild guess... One way to look at that is to say that each entry of a particular vehicle and the row name is the "serial number", i.e. surrogate key, of each vehicle. Some of the individual vehicles just happen to be the same model and same mpg, etc.

I'm new to R but I've done a lot of database stuff but I think there is a problem with that interpretation.

First of all row names won't necessary stick to a row and in the tidyverse rows no longer have names.

In a database each row is supposed to contain an individual entity, i.e. represent some real world object. It's also supposed to have a key that can be used to select any individual entity. Duplicate rows prevent this.

So in the end it seems like this table would produce questionable statistics... for example duplicate rows would produce a mean that doesn't seem to make any sense.

I never saw the dups in the mpg table until you pointed it out... I'm interested in what the reason is for what appear to be duplicate rows too.

1 Like

I realize this explanation doesn't make them not duplicate entries (intentional use of the double negative), but there are variables that aren't in the dataset that may make them not actually the same thing…

So, for example, according to my well-thumbed KBB, the 2008 Chevy Suburban 2WD, auto transmission, etc. came in 4 different trim levels, and with 3 different gas tank size options. So, as far as KBB is concerned there'd be multiple, distinct, models that fit the variables in the mpg dataset.

1 Like

I'm just curious... in the stats world would you just assume that duplicate entires in a set of data are really different entities but the set just didn't include spurious attributes like, in this case, gas tank size?

It seems like that would make it hard to verify that the stats on a data set like that were reliable.

Maybe this data set was just put together just to serve to as an example for trying out R? That would be sensible... keep examples as simple as possible for the tasks you are trying to illustrate

BTW the reason for my comment about duplicates is that E. F. Codd, the mathematician who invented relational databases to intrinsically insure the integrity of data, required no duplicates in a table. That's why the dup's in mpg were a surprise to me.

Here is the particular rule

Rule 2: The guaranteed access rule:
Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

You can find Codd's 12 rules that define the behavior of a relational data base system in bunch of places on the web, here is one:

2 Likes

Oh, from a stats perspective with this actual data, I'm with you 100%. I thought the question was more out of curiosity— that, and I enjoy any excuse to peruse my little KBB (especially since they recently announced that they won't be printing them anymore :cry:).

Thanks! I didn't come up with this idea. This is plausible. Actually, the original data seems to have more variables, (though I've given up to infer which columns and records correspond to mpg data...):