Structuring data from multiple repetitive rows (summaries of demographic data)

pablo · January 4, 2019, 3:44pm

I couldn't find an answer to my question (maybe I don't know how to specifically search for this issue), so I'll try to explain as best as I can.

The thing is that I have data that came from multiples files, structured somewhat like:

institution
- date (month and year)
  - table-1
  - table-...
  - table-n

The data is a compilation of demographic data, so to exactly each row, it's not precisely "an observation", more like a summary of data like:

[[12]]
# A tibble: 16 x 15
   tipo_atencion      profesional `6-9` `10-14` `15-19` `20-24` `25-34` `35-44` `45-54` `55-64` `65_y_mas` hombres mujeres mes     año
   <chr>              <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>      <dbl>   <dbl>   <dbl> <chr> <int>
 1 pre-concepcional   medico          0       0       0       0       0       0       0       0          0       0       0 enero  2012
 2 pre-concepcional   matrona/on      0       0       0       0       0       0       0       0          0       0       0 enero  2012
 3 prenatal           medico          0       0       0       0       0       0       0       0          0       0       0 enero  2012
 4 prenatal           matrona/on      0       0      11      19      29       9       0       0          0       0      68 enero  2012
.....

After loading and preparing the data I found myself with a list of 12 tibbles (one for each table), and when I bind the rows, I go from 16 distinct rows to 192 (I have from 2012 to 2018...).

I have been thinking that maybe using R isn't the best way to structure the data and I should be using a relational database to avoid repetition, since I'm starting to feel this isn't going to scale well.

ps.- I couldn't think a best way to put the title more clear, so if somebody could suggest a better idea, I'll change it asap.

technocrat · January 4, 2019, 5:31pm

This is a recurring untidy problem that I'm currently working on for another medical study, which has an analogous collection of data. In that case, 169 unique patients generated a 28,000 row tibble because the data were organized by observation

- patient diagnostic code 1
- patient diagnostic code 2
...
- patient diagnostic code 18
- patent demographics

and so on. This reflected a failure of the researcher's source data (for which his institution was not responsible) to choose a single subject of observation, the patient as rows and variables, the types of observations, as columns.

Your problem may be simpler. If your unit of observation is the institution, I would insert new columns for institution and year. I'll have to assume you have csv files in a folder descending hierarchy by institution by year by observations. After creating the tibble in the form shown in your example

# Create an "empty" tib, `combined` with the fields from your example with just the first line
# mutate to create year=2011 (date or integer) and institution = any string
# filter it to remove the data
combined <- combined %>% filter(year < 2011)
2012_tib <- original_tib %>% mutate(year == 2012, institution = institution_1) %>% row_bind(combined, 2012_tib)
#...
2018_tib <- original_tib %>% mutate(year == 2012, institution = institution_5)  %>% row_bind(combined, 2012_tib)

andresrcs · January 4, 2019, 6:08pm

I think that by investing some time in normalizing your data, whether you choose to do it in a database or in R, your are going to save your self a lot of troubles later.

And, yes a relational database design is going to scale much better

pablo · January 5, 2019, 7:44pm

Thanks @technocrat!

Actually, I need to analyze those specific tables, and compare them between months, or by year. Maybe I should create a tibble that is institution + table + specific table (like a tibble per row in my example). Could that work?

Yes, I'm (trying) doing that, but I just found that I'm starting to see a lot of repeated rows, and smelled wrong.

R/dplyr can still connect and work with a SQL database right?

andresrcs · January 5, 2019, 8:17pm

You can use dbplyr for applying dplyr commands on a sql database, and yes, it is possible to nest tibbles inside other tibbles, but if you are going to normalize your data, this should be done in the modeling part of your process not in the wrangling and cleaning part.

There is nothing inherently wrong with that approach though, but I do believe the normalization path pays off better at long term

technocrat · January 5, 2019, 9:22pm

I agree that if this is more than a one-off, that a DB is the way to go, and yes, R works very well with SQL of various flavors.

colnames(your_tibble) <- c("institution","mon_year", 'tipo-atenciion`,`professional`, '6-9' ...

provided that you split tipo-atencion and professional into two variables (assuming you don't have more than shown).

system · January 26, 2019, 9:22pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.