What can I use instead of tibble(tidyverse) when it comes to large dataframes in R?

joelt92 · March 15, 2019, 11:53pm

I am new to R and working with a dataframe as below:

      Year       Zip      Total_Population Median_Income               City State
1    2014 ZCTA5 43001             2475         87333             Alexandria    OH
2    2014 ZCTA5 43002             2753         83873                  Amlin    OH
3    2014 ZCTA5 43003             2366         46691                 Ashley    OH
4    2014 ZCTA5 43004            24625         70809              Blacklick    OH
5    2014 ZCTA5 43005              155         43810            Bladensburg    OH
6    2014 ZCTA5 43006              705         45673             Brinkhaven    OH
7    2014 ZCTA5 43008             2430         28422           Buckeye Lake    OH
8    2014 ZCTA5 43009             2036         62188                  Cable    OH
9    2014 ZCTA5 43010              386         34625                Catawba    OH
10   2014 ZCTA5 43011             7733         66548             Centerburg    OH
11   2014 ZCTA5 43013              966         57813                 Croton    OH
12   2014 ZCTA5 43014             3610         46034               Danville    OH
13   2014 ZCTA5 43015            50809         63244               Delaware    OH
14   2014 ZCTA5 43016            34409         89268                 Dublin    OH
15   2014 ZCTA5 43017            39329         96795                 Dublin    OH
16   2014 ZCTA5 43019             9722         64080          Fredericktown    OH
17   2014 ZCTA5 43021            10910        123444                 Galena    OH
18   2014 ZCTA5 43022             4089         66346                Gambier    OH
19   2014 ZCTA5 43023            12624         97875              Granville    OH
20   2014 ZCTA5 43025             5870         54918                 Hebron    OH
21   2014 ZCTA5 43026            58392         77973               Hilliard    OH
22   2014 ZCTA5 43028             7857         56788                 Howard    OH
23   2014 ZCTA5 43029              631         34697                  Irwin    OH
25   2014 ZCTA5 43031            12390         71486              Johnstown    OH
26   2014 ZCTA5 43032              127         23750              Kilbourne    OH
27   2014 ZCTA5 43033              410         43750           Kirkersville    OH
28   2014 ZCTA5 43035            26130        105336           Lewis Center    OH
29   2014 ZCTA5 43036              268         38438       Magnetic Springs    OH
30   2014 ZCTA5 43037              370         44464            Martinsburg    OH

I have used the below code to find which zip codes experienced the greatest decrease in total population from 2014 to 2017:

library(tidyverse)
zips <- tibble::tribble(
  ~Year,          ~Zip, ~Total_Population, ~Median_Income,                 ~City,
  2013, "ZCTA5 43001",              2475,          87333,    "Alexandria    OH",
  2013, "ZCTA5 43002",              2753,          83873,  "Amlin           OH",
  2014, "ZCTA5 43003",              2366,          46691,   "Ashley         OH",
  2014, "ZCTA5 43001",             24625,          70809, "Blacklick        OH",
  2014, "ZCTA5 43005",               155,          43810,   "Bladensburg    OH",
  2015, "ZCTA5 43006",               705,          45673,    "Brinkhaven    OH",
  2015, "ZCTA5 43001",              2430,          28422,  "Buckeye Lake    OH",
  2016, "ZCTA5 43009",              2036,          62188,         "Cable    OH",
  2016, "ZCTA5 43010",               386,          34625,       "Catawba    OH",
  2016, "ZCTA5 43001",              7733,          66548,    "Centerburg    OH"
)

diff <-  zips %>% dplyr::filter(Year %in% c(2013,2016)) %>% 
  spread(Year,Total_Population) %>% group_by(Zip) %>%
  summarise(`Total2013` = sum(`2013`, na.rm = TRUE),
            `Total2016` = sum(`2016`, na.rm = TRUE)) %>% 
  mutate(Difference = Total2013- Total2016)

diff

Output:

# A tibble: 4 x 4
  Zip         Total2013 Total2016 Difference
  <chr>           <dbl>     <dbl>      <dbl>
1 ZCTA5 43001      2475      7733      -5258
2 ZCTA5 43002      2753         0       2753
3 ZCTA5 43009         0      2036      -2036
4 ZCTA5 43010         0       386       -386

However, as you notice, tibbles are only efficient if i copy my output in my console which is limited to a certain number of rows. In this dataframe case there is data for more than 100 zip codes. Is there any other function in R like tibble which can take in the whole DF and provide the same output?

andresrcs · March 16, 2019, 12:12am

What do you mean by this? Your data frame is relatively small and dplyr can handle it without a problem.
In your code above tibble its only been used for providing sample data, you just have to replace zips dataframe with your actual data.

andresrcs · March 16, 2019, 12:20am

This is also posted on SO

rensa · March 16, 2019, 1:18am

Hi @joelt92! Are you saying that you'd like to get the benefits of tibbles with an existing data frame, rather than manually entering your data in via the console with tibble::tribble()?

You can convert a data frame (for example, one called df to a tibble using tibble::as_tibble(df). As well as this, the readr package has a lot of data input functions like readr::read_csv() that natively return tibbles. Does that help?

system · March 23, 2019, 1:18am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.