When should a function accept a dataframe as an argument?

nguyen · February 6, 2018, 4:42pm

Lately I've found myself writing a lot of functions that accept a dataframe with a specific structure

library(tidyr)
library(ggplot2)

stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

getStockPlot <- function(stocks_df){
     stocks_tall <- gather(stocks, stock, price, -time)
     plt <- ggplot(data = stocks_tall ) +
              geom_line(aes(x = time, y = price , colour = stock ))
     plt
}

In a lot of ways, this comes across as bad function design because how is the user supposed to know what format of dataframe they should pass in? An OO language like python or Java would solve this by only accepting a class as a function argument, but I'm not sure reference classes are the right answer here.

So my questions are as such:

Should you ever write functions that accepts a dataframe with a specific structure?
If no, how do you like to handle this?
How much data manipulation do you put on the burden of the user? For example, I could have made it so that my function only accepts a tall version of the data instead of the wide version

danr · February 6, 2018, 5:46pm

Generics might give you what you want. You can do inheritance too. Look up the UseMethod and NextMethod functions.

library(tidyr)
library(ggplot2)

stocks <- data.frame(
    time = as.Date('2009-01-01') + 0:9,
    X = rnorm(10, 0, 1),
    Y = rnorm(10, 0, 2),
    Z = rnorm(10, 0, 4)
)
# add class to stock
class(stocks) <- "stock"

# this has no class
# or could be a class not named stock
not_stocks <- data.frame(
    time = as.Date('2009-01-01') + 0:9,
    X = rnorm(10, 0, 1),
    Y = rnorm(10, 0, 2),
    Z = rnorm(10, 0, 4)
)

# this is like an abstract base method
getStockPlot <- function(stocks_df) {
    UseMethod("getStockPlot")
}

# this is the implementation for "stock" objects,
# you could have more for other "class" objects
getStockPlot.stock <- function(stocks_df){
    print("Plot Stocks")
}

# this captures unsupported objects
getStockPlot.default <- function(stocks_df) {
    stop("class not supported")
}

# this calls getStockPlot.stock
getStockPlot(stocks)
#> [1] "Plot Stocks"
#this calls getStockPlot.default
getStockPlot(not_stocks)
#> Error in getStockPlot.default(not_stocks): class not supported

tbradley · February 6, 2018, 6:15pm

You may also want to take a look at programming with dplyr using tidy eval. This will allow you to let the user specify what the columns of interest are in the same way they would in other dplyr functions.

Frank · February 6, 2018, 8:12pm

You could give them an option to provide either wide or long format; explain the requirements in the documentation; and implement a test near the top of the function, like

library(vetr)
library(magrittr)

vet_stockDF = function(DF){
  target = data.frame(
    time = Sys.Date()[0L], 
    stock = character(0),
    price = numeric(0),
    stringsAsFactors = FALSE
  )
  vet(target, DF, stop=TRUE)
}

getStockPlot <- function(stocks_df, do_transform = TRUE){
  if (do_transform)
    stocks_df %<>% gather(stock, price, -time)
  vet_stockDF(stocks_df)

  plt <- ggplot(data = stocks_df) +
    geom_line(aes(x = time, y = price , colour = stock ))
  plt
}

Examples:

# pass original data
getStockPlot(stocks)

# pass long data
longstocks = gather(stocks, stock, price, -time)
getStockPlot(longstocks, do_transform = FALSE)

# pass invalid data
library(dplyr)
getStockPlot(mutate(stocks, time = Sys.time()))
# Error in vet(target, DF, stop = TRUE) : 
#   `class(DF$time)[2]` should be "Date" (is "POSIXt")

The vetr homepage lists alternative packages with similar functionality. I find vetr very convenient for testing DF attributes.