Are dataframes in R implemented as Heap-organized tables in relational databases?

Odyssey2001 · October 19, 2022, 1:43pm

Hello everyone,

I have a question about the default order of observations in a data.frame in R. Of course functions such as sort and arrange allow to retrieve observations based on the desired criteria. Yet, I would like to undersntand how the internal structures of dataframes in R handle by default the order of observations once the dataframe is being queried to read data. To clarify what I mean exactly, I prefer to compare dataframes in R to ordinary tables in classic relational databases such as Oracle, PostgreSQL, etc. to show better what I intend to understand. Here is what Oracle online documentation says about the default order of rows (= observations in R) once they have been inserted into a table ( = for me that is data.frame in R):

Table Organization:

By default, a table is organized as a heap, which means that the database places rows where they fit best rather than in a user-specified order. Thus, a heap-organized table is an unordered collection of rows.

As users add rows, the database places the rows in the first available free space in the data segment. Rows are not guaranteed to be retrieved in the order in which they were inserted.

Therefore, this means for example if you insert the following CSV file into a table :
Data.txt

line-01
line-02
line-03
line-04
line-05

Then there is no guarantee that

select * from mytable

returns the rows exactly based on the order as they appear in the data file Data.txt unless an appropriate ORDER BY clause has been specified. So my question is that whether the same concept exists for dataframes in R. If you import data from a CSV file into a dataframe and you interrogate the data without using any kind of sort, arrange, etc. will the data be returned exactly based on the very same order as they were in the CSV data file and does a dataframe maintain this order unless the user specifies differently via sort, arrange. etc. ?

I checked the manual of data.frame but I didn't find anything in this regard. Therefore, I would appreciate if you could kindly make some clarification.

Thanks in advance

martin.R · October 19, 2022, 2:06pm

The row order is maintained unless an order/sort/arrange/key/index (depending on package and function) operation is performed.

dvetsch75 · October 19, 2022, 2:29pm

Dataframes are ordered collections. To break this down a bit, let's look at this example:

class(iris)
#> [1] "data.frame"
dput(head(iris, 10))
#> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
#> 5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
#> 3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
#> 1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
#> 0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
#> 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
#> ), class = "factor")), row.names = c(NA, 10L), class = "data.frame")

^{Created on 2022-10-19 by the reprex package (v1.0.0)}

This shows us that a dataframe is pretty much a list of lists - which explains why as.data.frame.list works so well and why the way to construct a dataframe and a list are very similar.

Knowing that, then we can answer the question about whether or not data.frames are ordered. In the documentation for list, we see that:

Almost all lists in R internally are Generic Vectors

and since vectors are ordered collections, and lists are pretty much just vectors, and data.frames are pretty much just lists, you can be confident that a dataframe is an ordered collection.

If you want an empirical test of this, you can just build a dataframe and index into it many times and see if the sort order ever changes:

df <- data.frame(
    'a' = runif(9),
    'b' = runif(9)
) # Create a data.frame with 9 rows

# Add a row with different values
df <- rbind(df, data.frame('a' = 'foo', 'b' = 'bar'))

# If data.frames weren't ordered, you would expect at least one time in a 100K
# that 'foo' would be in one of the nine rows selected. If all rows are equally likely to get
# selected, then 'foo' has a 10% chance of being in the row
for(i in 100000) {
    tmp <- df[1:9, ]
    if('foo' %in% tmp$a) {
        print(tmp)
        break
    }
}

^{Created on 2022-10-19 by the reprex package (v1.0.0)}

Odyssey2001 · October 19, 2022, 6:48pm

Thank you very much for your help. Very interesting that the order of observations is maintained until the user decides otherwise.

system · October 26, 2022, 6:49pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.