recode_factor giving an error

Hello everyone,

I am trying to use recode_factor to convert a string to a factor.

I am unable to see what basic mistake I am making which is throwing this error.

> answer <- tbl(con,"mytable") %>% head %>% select(EDUCATION)
> answer
# Source:   lazy query [?? x 1]
# Database: mysql 10.3.25-MariaDB-0+deb10u1 
  EDUCATION     
  <chr>         
1 5th Std. Pass 
2 No Education  
3 8th Std. Pass 
4 8th Std. Pass 
5 11th Std. Pass
6 9th Std. Pass 
> answer <- tbl(con,"mytable") %>% head %>%
+         mutate(edu = recode_factor(EDUCATION,
+                                "No Education" = "None or Primary",
+                                "1st Std. Pass" = "None or Primary",
+                                "2nd Std. Pass" = "None or Primary",
+                                "3rd Std. Pass" = "None or Primary",
+                                "4th Std. Pass" = "None or Primary",
+                                "5th Std. Pass" = "None or Primary",
+                                "6th Std. Pass" = "Class 10",
+                                "7th Std. Pass" = "Class 10",
+                                "8th Std. Pass" = "Class 10",
+                                "9th Std. Pass" = "Class 10",
+                                "10th Std. Pass" = "Class 10",
+                                "11th Std. Pass" = "Class 12 / Diploma",
+                                "12th Std. Pass" = "Class 12 / Diploma",
+                                .default = "Graduate and Above",
+                                .ordered = TRUE
+                                ))
> answer
Error in .local(conn, statement, ...) : 
  could not run statement: Incorrect parameters in the call to stored function 'recode_factor'
In addition: Warning message:
Named arguments ignored for SQL recode_factor 
> 

Can someone please help me with this query?

Thank you.

Strangely, the same code works when I apply it to a vector.
It does not throw an error.

> EDUCATION
[1] "5th Std. Pass"  "No Education"   "8th Std. Pass"  "8th Std. Pass" 
[5] "11th Std. Pass" "9th Std. Pass" 
> recode_factor(EDUCATION,
+                                "No Education" = "None or Primary",
+                                "1st Std. Pass" = "None or Primary",
+                                "2nd Std. Pass" = "None or Primary",
+                                "3rd Std. Pass" = "None or Primary",
+                                "4th Std. Pass" = "None or Primary",
+                                "5th Std. Pass" = "None or Primary",
+                                "6th Std. Pass" = "Class 10",
+                                "7th Std. Pass" = "Class 10",
+                                "8th Std. Pass" = "Class 10",
+                                "9th Std. Pass" = "Class 10",
+                                "10th Std. Pass" = "Class 10",
+                                "11th Std. Pass" = "Class 12 / Diploma",
+                                "12th Std. Pass" = "Class 12 / Diploma",
+                                .default = "Graduate and Above",
+                                .ordered = TRUE
+                                )
[1] None or Primary    None or Primary    Class 10           Class 10          
[5] Class 12 / Diploma Class 10          
Levels: None or Primary < Class 10 < Class 12 / Diploma

It seems you are using dbplyr, if so, you must have in mind that until you collect() the result all "dplyr" like commands are translated into sql and executed on the RDBMS.

Sql doesn't have a recode_factor() equivalent, the closest to a factor you can get its an enum class and it can be altered but since it would be a permanent change to the database I don't think it would be a good idea to implement it on dbplyr.

I think you should explicitly collect() the result and then apply recode_factor() locally using dplyr.

Hi,

@andresrcs is correct, I was just typing this myself.
You can only use SQL compatible functions within the query (like min, max, mean, ...) for any other functions you need to first collect the data into a data frame

answer <- tbl(con,"mytable") %>%
collect %>%
mutate(edu = recode_factor( ....)

PJ

Many thanks andrercs and pieterjanvc.

I am a little confused now. I was hoping to:

  1. Read chunkwise(since the dataset is huge)
  2. Recode the string to factor
  3. Then do a collect.

Seems like the strategy is a failure.

Should I simply replace the recode_factor with a nested if_else?

Nothing else seems to make sense to me.

Thank you once again.

Hi,

If you want tot change the actual data in the database, you can use the UPDATE command with CASE WHEN statement to update the different values.

You can just Google the way to do it for different version of SQL

This is an example (did not test this)

q = dbSendStatement(con,
 "UPDATE mytable SET EDUCATION = CASE 
   WHEN EDUCATION IN ('No Education', '1st Std. Pass', '3rd Std. Pass') THEN 'None or Primary'
   WHEN EDUCATION IN ('6th Std. Pass', '7th Std. Pass', '8th Std. Pass') THEN 'Class 10' 
   ELSE 'Graduate and Above' END")

Hope this helps,
PJ

I think it would be better if you create a temporary table in the database with the recoding equivalents and use a side join operation in dbplyr like a left_join() for example. This way you don't permanently modify your database like with an UPDATE

Hello once again pieterjanvc and andrercs ,

I wish to:-

  1. Read chunkwise from one table
  2. Convert string to a factor using recode.
  3. Write to a seperate table.

I think the best solution is to replace recode_factor with a nested ifelse() which will create a new variable in the RAM and then write to the new table.

Does this sound right?

I think we need a little bit of context to give you any meaningful advice, you say you want to read chunkwise but your example doesn't reflect that, maybe we have a different concept of what chunkwise means.

As I understand your goal, it seems unnecessary to collect the data, more over, using R on the middle seems like adding unnecessary complexity and restrictions, what you describe could more efficiently be achieved with pure sql on the server side.

EDIT: For context, this was an answer to a deleted post from the OP, or at least it seems deleted for me right now. I'm having strange issues with the forum.

1 Like

Hello,

I have tried to make a Minimum Working Example.

        answer <- tbl(con,"input_table") %>%
        read_chunkwise(chunk_size=10000) %>%
        select(HH_ID,REF_PERIOD_WAVE,AGE_YRS,EDUCATION) %>%
        mutate(Count_Old  = ifelse(AGE_YRS>=60,1,0)) %>%
        mutate(edu = recode_factor(EDUCATION,
                               "No Education" = "None or Primary",
                               "1st Std. Pass" = "None or Primary",
                               "2nd Std. Pass" = "None or Primary",
                               "3rd Std. Pass" = "None or Primary",
                               "4th Std. Pass" = "None or Primary",
                               "5th Std. Pass" = "None or Primary",
                               "6th Std. Pass" = "Class 10",
                               "7th Std. Pass" = "Class 10",
                               "8th Std. Pass" = "Class 10",
                               "9th Std. Pass" = "Class 10",
                               "10th Std. Pass" = "Class 10",
                               "11th Std. Pass" = "Class 12 / Diploma",
                               "12th Std. Pass" = "Class 12 / Diploma",
                               .default = "Graduate and Above",
                               .ordered = TRUE
                               )) %>%
        group_by(HH_ID,REF_PERIOD_WAVE) %>%
        summarise(Tot_N = n(),
                  Old_N = sum(Count_Old),
                  max_hh_edu = max(edu,na.rm=TRUE)) %>%                                                                         
        collect() %>%
        group_by(HH_ID,REF_PERIOD_WAVE) %>% summarise(Tot_N = sum(Tot_N),Old_N = sum(Old_N)) 

        dbWriteTable(con,"output_table",answer)

In words, the input_table is read_chunkwise, we use R to create additional fields and then write the answer to another table.

Many thanks for your help.

As I said, that could be achieved much more efficiently with pure sql on the server side without even needing R at all but to give you an alternative pure sql solution I think I would need you to check your example since with the last group_by/summarize combo you are dropping max_hh_edu rendering the whole recode thing pointless. Can you please check?

Dear andrercs,

Many thanks for your reply.

You are right, while creating the minimal working example, I forgot about max_hh_edu in the end.

The corrected code is as follows:

 answer <- tbl(con,"input_table") %>%
        read_chunkwise(chunk_size=10000) %>%
        select(HH_ID,REF_PERIOD_WAVE,AGE_YRS,EDUCATION) %>%
        mutate(Count_Old  = ifelse(AGE_YRS>=60,1,0)) %>%
        mutate(edu = recode_factor(EDUCATION,
                               "No Education" = "None or Primary",
                               "1st Std. Pass" = "None or Primary",
                               "2nd Std. Pass" = "None or Primary",
                               "3rd Std. Pass" = "None or Primary",
                               "4th Std. Pass" = "None or Primary",
                               "5th Std. Pass" = "None or Primary",
                               "6th Std. Pass" = "Class 10",
                               "7th Std. Pass" = "Class 10",
                               "8th Std. Pass" = "Class 10",
                               "9th Std. Pass" = "Class 10",
                               "10th Std. Pass" = "Class 10",
                               "11th Std. Pass" = "Class 12 / Diploma",
                               "12th Std. Pass" = "Class 12 / Diploma",
                               .default = "Graduate and Above",
                               .ordered = TRUE
                               )) %>%
        group_by(HH_ID,REF_PERIOD_WAVE) %>%
        summarise(Tot_N = n(),
                  Old_N = sum(Count_Old),
                  max_hh_edu = max(edu,na.rm=TRUE)) %>%                                                                         
        collect() %>%
        group_by(HH_ID,REF_PERIOD_WAVE) %>% summarise(Tot_N = sum(Tot_N),Old_N = sum(Old_N), max_hh_edu = max(edu,na.rm=TRUE)) 

        dbWriteTable(con,"output_table",answer)

I can't test this because I don't have access to your input_table but the sql equivalent would be something like this:

 create table if not exists output_table as
 with temp_equivalence as (
	 select * 
	 from (values 
	 	('No Education', 'None or Primary', 1), 
	  	('1st Std. Pass', 'None or Primary', 1),
	  	('2nd Std. Pass', 'None or Primary', 1),
	  	('3rd Std. Pass', 'None or Primary', 1),
	  	('4th Std. Pass', 'None or Primary', 1),
	  	('5th Std. Pass', 'None or Primary', 1),
	  	('6th Std. Pass', 'Class 10', 2),
	  	('7th Std. Pass', 'Class 10', 2),
	  	('8th Std. Pass', 'Class 10', 2),
	  	('9th Std. Pass', 'Class 10', 2),
	  	('10th Std. Pass', 'Class 10', 2),
	  	('11th Std. Pass', 'Class 12 / Diploma', 3),
	  	('12th Std. Pass', 'Class 12 / Diploma', 3)
	  	) as t (education, edu, edu_level)
)
select
	HH_ID,
	REF_PERIOD_WAVE,
	count(*) as Tot_N,
	sum((AGE_YRS >= 60)::integer) as Old_N,
	case 
		when max(edu_level) = 1 then 'None or Primary'
		when max(edu_level) = 2 then 'Class 10'
		when max(edu_level) = 3 then 'Class 12 / Diploma'
	end max_hh_edu
from
	input_table it
	left join temp_equivalence te on it.EDUCATION = te.education
group by
	HH_ID, REF_PERIOD_WAVE;

Obviously, you would need to complete all education level equivalences explicitly because you can't use the .default wild card with this approach and maybe the SQL dialect is a little different for MySQL, since I'm using Postgresql dialect here (sorry I haven't worked with MySQL in years), but I hope you get the general idea.

Dear andresrcs,

Thank you for your reply.

I have a different but related query now.

How do you know that it is more efficient to write a pure SQL query than to mix R and SQL?

In particular, when we have millions of rows, is MySQL superior to R? I have a particular query: How does MySQL manage datasets bigger than the RAM?

Thank you!

Because to work with R you have to retrieve chunks of data from the MySQL server, load them into memory, process them locally and write back into the server. Whereas when working with SQL the whole process is done in the server without moving the data around so your RAM would be no longer a concern and "manual chunking" not needed (the SQL server has an optimized way of allocating memory), also, SQL is specifically designed for this kind of process so it is optimized for it, more over, depending on how the table has being defined, it might have indexes in place which can speed up the process even further.

In general RDBMS are more efficient processing queries than a less specialized programming languages like R, however, not all RDBMS are the same, MySQL is not the fastest out there, but it is still faster than R for this task.

1 Like

Hello,

Thank you for that explanation. I have one last request : Where can I read more about how SQL - perhaps MySQL - does this better than R? Is there a book I can read which talks about how SQL manages RAM / disk space ? and how this is superior to manually chunking and doing it in R? How did you learn this ? What should I be googling for ? Any specific terms or references to this?

Thank you.

To be honest this is something I have been told everywhere while learning data science related skills and it just makes sense intuitively so I never bothered trying to investigate the internals of a RDBMS.

The details of how a RDBMS works internally are way to much technical and complex for me but I think this article gives you an at least reasonably understandable overview.

the intuition is that different media operate at different speeds
roughly speaking

  • -> fast
  • RAM
  • DISK
  • NETWORK
  • ->slow

its common for a corporate RDBMS to hold commonly hit tables in RAM.
typically a well resourced RDBMS will have the cpu resources to access the data in its RAM and compute results 'fast'.
If you pull a large volume of data out of the RDBMS to be processed by your local cpus, you have overhead of the Network transfer time. And for what benefit ? so you can load it to your RAM and process it with your laptops CPU's rather the RDBMS server CPUs ?

Is this guaranteed ? no.

You can imagine an ancient 20year old server, which a modern laptop might beat in a race, (at a given network transfer speed), its just not something to expect as a normal matter of course.

This is rule of thumb stuff, but hopefully the intuition makes sense.

Dear andresrcs and nirgrahamuk,

Thank you for all your inputs.

Best Regards.