Read chunkwise and compute 2 variables which interact with each other

Please note that this is a cross post.

Originally posted here: r - Read chunkwise and compute 2 variables - Stack Overflow

I wish to read from MySQL database chunkwise, create TWO summary variables and write chunkwise to another table. Here is how I think we may do this seperately as follows(I have not tried the code):

library(RMySQL)
library(chunked)
library(dplyr)

connection <- dbConnect(dbDriver("MySQL"),host = "myhost",
              db="mydb",user="myuser",password="mypass")

# Computing the number of members in a household.

tbl(connection,"table1") %>%
    read_chunkwise() %>%
    select(hid,year) %>%
    group_by(hid,year) %>%
    summarise(Total_members= n()) %>%
    write_chunkwise("table2")

# Extracting the age of the head of household, I know that for
# each hid, year there will be only one member with  
# relation_with_hoh == "hoh"

tbl(connection,"table1") %>%
    read_chunkwise() %>%
    select(hid,year,relation_with_hoh,age) %>%
    filter(relation_with_hoh=="hoh") %>%
    mutate(hoh_age = age) %>%
    select(hid,year,hoh_age) %>%
    write_chunkwise("table2")

My question is: How do I these 2 together ?

I don't have experience using 'chunked', though initial reading makes it seem straightforward, unfortunately I can't reconcile what you are asking to be able to do with my knowledge of databases in general and what they do or don't allow.

It seems your ultimate goal is to simultaneously extend the same table (table2) with differently structured data. I would assume this is not possible whether or not chunking was a desired feature. Databases generally establish a structure for insertion that then must be respected in subsequent insertions. Unless perhaps you are allowing for 'dummy columns' that some insertions will pad with missing values, and other insertions wont ?

Hello,

My goal is to write "streaming" code with chunked package. I need to do this since the database is huge and won't fit into memory (RAM).

I need to compute 2 differently structured variables.

Perhaps that is more clear.

Any idea how I can do this?

One way to do this is 2 compute the 2 variables separately and then do a streaming join If there is such a thing

The documentation explains that basic dplyr joins are not supported, let alone a streaming variation.
edwindj/chunked: Chunkwise Text-file Processing for 'dplyr' (github.com)
Since data is processed in chunks, some dplyr verbs are not implemented:

  • arrange
  • right_join
  • full_join

In fact, there are cautions given on group_by / summarise that you don't seem to have taken on board in the way you constructed your example :
summarize and group_by are implemented but generate a warning: they operate on each chunk and not on the whole data set. However this makes is more easy to process a large file, by repeatedly aggregating the resulting data.

tmp <- tempfile()
write.csv(iris, tmp, row.names=FALSE, quote=FALSE)
iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example

iris_cw %>% 
  group_by(Species) %>%            # group in each chunk
  summarise( m = mean(Sepal.Width) # and summarize in each chunk
           , w = n()
           ) %>% 
  as.data.frame %>%                  # since each Species has 50 records, results will be in multiple chunks
  group_by(Species) %>%              # group the results from the chunk
  summarise(m = weighted.mean(m, w)) # and summarize it again

i.e. note the doubled use of group_by/.summarise with the second variation reformulated to use a weighted mean

Hello and many thanks for your reply.

In my example, the summarise and filter interact with each other.

In other words, Total_members and hoh_age can't together computed for a given hid, year.

That is why the double groupby trick will fail I think.

Perhaps I am mistaken. Am I ?

do they ?
the first query group_by and summarises and has no filters.
the second query filters and doesnt have any grouping or summarising.

It seems to me your queries have nothing in common apart from wanting to both stream in
hid and year, they then diverge immediately.

yes, but you are drawing the wrong conclusion. I'm saying that even sticking with separate queries for both, you will need to rewrite your first query to compute a correct result, by adding a second level of grouping and summarising.

Hello,

In the first segment of the code I compute for each hid,year the Total_members.

In the second for each hid, year I compute the age of the head of household (hoh_age).

I had written in the comment that :

# Extracting the age of the head of household, I know that for
# each hid, year there will be only one member with  
# relation_with_hoh == "hoh"

that matches my understanding and changes nothing for me... have I missed something ?

I will borrow your original idea of a dummy column. That is the answer I think.

tbl(connection,"table1") %>%
    read_chunkwise() %>%
    select(hid,year) %>%
    group_by(hid,year) %>%
    summarise(Total_members= n()) %>%
# A column with a constant value, this could be buggy for the moment
    mutate(hoh_age = age[which(relation_with_hoh=="hoh")]) %>%
   collect() %>%
   group_by(hid,year) %>%
   summarise(Total_members  = sum(Total_members), hoh_age = mean(hoh_age) %>% 
   write_chunkwise("table2")
# Here is the correct way
tbl(connection,"table1") %>%
    read_chunkwise() %>%
    select(hid,year) %>%
    group_by(hid,year) %>%
    summarise(Total_members= n(),hoh_age = ifelse(length(which(relation_with_hoh=="hoh"))==1,age[which(relation_with_hoh=="hoh")],NA)) %>%
   collect() %>%
group_by(hid,year) %>%
# hoh_age will be different to NA exactly once in a group.
# A check to ensure that EXACTLY one row in each group of hid,year has relation_with_hoh="hoh"
   summarise(Total_members  = sum(Total_members), hoh_age = mean(hoh_age,na.rm=TRUE),check  = length( is.na(hoh_age)==1)) %>% 
   write_chunkwise("table2")

# The naive way which was confusing me

tbl(connection,"table1") %>%
    read_chunkwise() %>%
    select(hid,year) %>%
    group_by(hid,year) %>%
    summarise(Total_members= n()) %>%
# here is the point of confusion, the filter in the next line will
# not work for the entire group of hid and year
# since we have already done a summarise.
    filter(relation_with_hoh =="hoh") %>%
    mutate(hoh_age = age) %>%
   collect() %>%
   group_by(hid,year) %>%
   summarise(Total_members  = sum(Total_members), hoh_age = mean(hoh_age,na.rm=TRUE),check  = length( is.na(hoh_age)==1)) %>% 
   write_chunkwise("table2")

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.