Seeking beginner friendly advice: How do I look at data? How do I know what I am doing is correct?

I am a front end programmer who is used to looking at code but not data. What I mean is: when I am data wrangling with SQL results or data frames, I tend to look at small and large result sets which look more or less like each other.

A lot of times I am looking at the head of a result frame and it looks alright but turns out the middle section of a 25000 row dataset was incorrect all along.

So how can I tell if the result sets I am wrangling with are correct?

Everything looks so similar and it's easy to make a mistake. Staring at grids of numbers can be confusing.

I guess #rstats packages like visdat with functions like vis_compare can help compare similar datasets with each other.

Any other software that can help me track anomalies?

Also, does one develop a certain instinct or level of confidence around judging the correctness of one's data as one works with it? Or do you need peers who cross validate your work for you?

Would appreciate some advice from wizened data wranglers.

the following link may help

1 Like

Hi there, my advice is pretty general and doesn't point to any magical solutions, but I hope will lead you in the right direction.

My first tip is to have a solid enough expectation of what the result data set should look like so that you could test against it, formally or informally. This requires knowledge of the domain you are working in and the mechanics of how the data are collected, to some extent. For example, should there exist missing values for certain fields? Should there exist duplicate entries for a particular field or collection of fields? Or, should a field be numeric when it is showing as class character? Perhaps that means an invalid value snuck its way in and changed the class attribute of that field. These are the types of questions I would ask each time I look at the data. I would also check the “data domain” - that is, for each field, see all of the possible entries that exist, and match it up with your expectation. For qualitative data you can use dplyr::count to do this, and for quantitative data you can check ranges, min/max values, etc. Visually, box/whisker plots and histograms work well here.

Besides using head I would also recommend dplyr::glimpse which prints horizontally and gives you a
bit more metadata. You are right that:

which is why using either of these functions should only be used as an initial sanity check.

{skimr} is a nice package that can answer some of the questions with its skimr::skim function though IMO nothing is better than creating custom functions to check for all of your specific needs. Depending on how formal you want to get, you can take the knowledge from above and create more formal testing procedures which can be run every time you obtain a result set from a query. {testthat} makes this process easier.

3 Likes

For some kinds of anomalies, the best thing to do is draw a histogram. This is pretty good at showing if there are nutso values.

1 Like

That's the thing. I haven't reached the point where I know what kind of sensible expectations I can develop around data. Hopefully, this will get better over time. I guess I lack an intuition for what data is supposed to do in the real world. I don't know what questions to ask of it. I think this is something I really want to change.

Hey that looks like a useful resource. Thanks for sharing :slight_smile:

I imagine it is the same as how you learned to be a front end programmer. It takes time and exposure to different types of problems that can be solved with data for you to gain that “intuition”.