columns not found in tbl_spark

Hello, I am learning sparklyr so thanks for your patience and help.

I have read in some data using spark_read_csv() and I used dplyr's mutate as well as sdf_mutate to create some new variables for example:

match_cat2 <- match_cat %>%
              mutate(Var_C = as.character(VarC)) %>%
              mutate(Var_B_Avg = (VarB1 + VarB2)/2 ) %>%
              sdf_mutate(
                Var_C = ft_string_indexer(VarC),
                Var_D = ft_string_indexer(VarD)
              ) 

sdf_register(match_cat2, "match_cat2")

Now I'm trying to create some more variables for example:

match_cat3 <- match_cat2 %>%
              group_by(VarE, VarF) %>%
              mutate(Var_G = if(any(Var_C ==1)) ((VarG - VarG[Var_C == 1])/(Var_G + Var_G[Var_C == 1])/2) else NA)

However, I am getting an error that the column Var_G cannot be found in match_cat2:

Error in eval_bare(call, env) : object 'Var_G' not found

Its confusing me since I can see within the column Var_G within the spark table match_cat2 within the "Connections" tab.

Thanks!

But your match_cat2 doesn't have Var_G in it? It only has Var_C, Var_B_Avg and Var_D. Or do you define it in some other place?

Oh yeah its in the dataframe I just did not manipulate it in the first block of dplyr code. But I can see it when I click the dropdown arrow for match_cat2 in the connections tab.

I'm wondering if the code under the mutate function is too complex or something?

I've not used Spark in a while, so my skills are rusty, but just visually your code looks totally fine. One thing to try is to maybe assign Var_G to a different column. So in your mutate call assign it to Var_temp or something to see if that helps.

dbplyr doesn't support subsetting with brackets, e.g. this fails

dbplyr::memdb_frame(a = c(1, 0)) %>%
  mutate(b = a - a[a == 1])
# Error in eval_bare(call, env) : object 'a' not found

You'll have to get clever and do what you want via table operations.

1 Like

Thank you guys so much! I am thinking of using the spark_apply() function but I am a bit lost about where I need to include the function(e) within the following dplyr syntax.

match_cat3 <- match_cat2 %>%
              group_by(VarE, VarF) %>%
              mutate(Var_G = if(any(Var_C ==1)) ((VarG - VarG[Var_C == 1])/(Var_G + Var_G[Var_C == 1])/2) else NA)

Here is my attempt at using the spark_apply() function with the mutate equation from above. I would love some help with how to use the function(e) and where the e goes within the syntax. I don't have any experience using a function within another function like this.

match_cat3 <- spark_apply(
                        function(e)
                        match_cat2 %>%
                        group_by(e$VarE, e$VarF) %>%
                        mutate(e$Var_G = if(any(e$Var_C ==1)) ((e$VarG - e$VarG[e$Var_C == 1])/(e$Var_G + e$Var_G[e$Var_C == 1])/2) else NA, e)
)

Btw, this gives me an out of bounds error.

I was basing the syntax off of the following block from the spark_apply() documentation.

trees_tbl %>%
  spark_apply(
    function(e) data.frame(2.54 * e$Girth, e),
    names = c("Girth(cm)", colnames(trees)))