Advice on how much to manipulate the return of an API

nviau · December 14, 2018, 6:20pm

Hello,

I'm currently writing a package for a city's open data store API (CKAN). Some of the data returns are a bit much. For example...

Many columns with variables that serve little or no purpose - e.g., there's a date-time variable but then columns for "Year", "Month", "Day", "Hour" etc.
Less than ideal data formatting - A column labeled "UCR" that has character entries written as "UCR Level One", "UCR Level Two", etc. and I would just rather factor code them as 1, 2, 3.
Poorly formatted variable names - e.g., very verbose, all in capitals, etc.

Does anybody have any advice on how opinionated I should get in designing the API returns? Changing column/variable names or removing entire columns seems risky because it will introduce conflict with the data store's own codebooks, but I do think it will increase user satisfaction.

I was thinking of having some sort of option such as pretty = TRUE that gives the user the option to return the raw data in a cleaner, albeit opinionated, form.

I do plan on releasing the package on CRAN at some point.

nwerth · December 14, 2018, 8:15pm

Many columns with variables that serve little or no purpose

I'd keep them. Yes, they're potentially redundant, but unused columns aren't likely to hurt the user. They cost you effort, but that's not the user's concern.

If there are so many that users would often have performance issues, then dropping them is reasonable.

Less than ideal data formatting

It's fine to "clean" values, as long as they're similar to the original. Your example makes sense. Remember that users will likely consult the official API documentation while using your package, so don't make it too hard to mentally map values from one to the other.

On a related note, consider the data's meaning when choosing column classes. If UCR is a grading system, then (IMO) an ordered factor is more appropriate than a numeric or integer column.

Poorly formatted variable names

Definitely change those. You're making an interface for R code, so use a naming style familiar to R programmers. Of course, try to keep it easy for users consulting the official docs.

I was thinking of having some sort of option such as pretty = TRUE

This makes me nervous. If it makes too many changes (column names, dropping columns, adding columns), then switching between TRUE and FALSE in a program would require rewriting all the code dealing with the result. And, considering the topic, that'd probably be most of the code in a program.

Also, if you offer utility functions for the data, each of those functions would need multiple versions to handle the different formats.

Your highest goal is to reduce the mental, munging, and coding burdens on an R programmer. They don't even want to think "this is from an API, and follows these rules, and..." They just want to use the data and stay in "high concept" land.