How to get hive to recognise new partitions created with noctua and dbWriteTable

dougfir · March 25, 2020, 10:54pm

I am using the noctua pakage to send data to a Hive table that is stored on AWS S3.

con_s3 <- DBI::dbConnect(noctua::athena(), s3_staging_dir = "s3://ourco-emr/tables/revenue_predictions.db/")

      dbWriteTable(conn = con_s3,
                   name = paste0("revenue_predictions.", game_name),
                   value = prediction_df, # my data frame
                   append = F,
                   overwrite = F, # don't have permissions to do this anyway
                   file.type = "parquet",
                   partition = c(year = yr, month = mt, day = d),
                   s3.location = paste0("s3://ourco-emr/tables/revenue_predictions.db/", game_name)
                   )

This creates new parquet files on S3 along with partitions. This code block runs daily as part of a cron job.

The set up I'm working on has 3 distinct entities:

An EMR cluster where hive lives and is controlled by our data engineers
An S3 bucket which stores our Hive tables
Our analytics EC2 server where rstudio is hosted.

Each morning the code block above runs on the EC2 RStudio instance as part of a cron job. This sends the data (predictions created with R) to S3.

So far so good. The predictions appear as desired and correctly partitioned on S3. The trouble is, each morning after the job has run I need to get Hive to recognize the newly created partition for the previous day (partitions are year, month and day).

I do this by logging into EMR in the terminal and opening hive then I run MSCK repair table [tablename]. After I do this the hive table detects the newly added data and all is good.

An alternative to the MSCK command that I learned was to run alter table like so:

ALTER TABLE schemaname.tablename ADD IF NOT EXISTS
  PARTITION (year='2020', month='03', day='24')
  location 's3://ourco-emr/tables/adhoc.db/mtcars/2020/03/24

My question is, is there some way to automate this last step as part of my cron job within R?r ggplo

On the EC2 server I have odbc connections with both S3 and Hive. However, looking at the documentation for DBI, I can only run dbWriteTable and dbSendQuery and these are for select queries only.

Some fall back solutions:

Create a shell script on the emr and run it every e.g. 30 minutes with the hive command MSCK repair table [tablename]. Highly un-elegeant.
Create empty partitions on hive till e.g. the end of the year and run MSCK repair table [tablename] ahead of time to get hive to recognize all partitions till the end of the year. Then come Jan 1st just repeat. Also un-elegant.

Preference would be to get hive to recognize the partitions right after they are created. Is there a way of doing this using my existing Hive or S3 odbc connections on the EC2 server?

larefly · March 26, 2020, 12:09am

Hi @dougfir,

When dbWriteTable is called, noctua runs MSCK REPAIR TABLE once data has been sent to S3. This is to update the DDL so that AWS Athena can pick up the new partition. You shouldn't need to run MSCK REPAIR TABLE after dbWriteTable is called.

I am not fully familiar with your set up, does AWS Athena link into your hive metadata store or does it connect to AWS Glue?

I am sightly confused why you have append = F as this will create a new DDL for your table in AWS Athena. If you are adding more data to an existing table shouldn't append = T?

If you are needing to updated a large table with multiple partitions then ALTER TABLE would be preferred as MSCK REPAIR TABLE would run through every partition and could be slow. I am happy to change this to speed up the call.

What version of noctua are you running? (This is so that I can give the best advice depending on the latest changes that are happening to the package)

If you are able, can you let me know what happens with the dev version of the package:

remotes::install_github("dyfanjones/noctua")

larefly · March 26, 2020, 12:17am

@dougfir I have raised and issue on noctua's github link, please can you double check if I have capture the issue you are highlighting here.

dougfir · March 26, 2020, 1:30am

Hi @larefly thanks for your swift response.

'I am sightly confused why you have append = F'.

I'm glad you bring this up because I've been spinning my wheels a bit here. I have a table of the form id | day_30_prediction | day_90_prediction | day_180_prediction

I have 3 models in r that I use to predict revenue for a given id. I predict revenue at 3 different time horizons: day 30, day 90 and day 180. I run each model in a loop and send to S3 using your package column by column for each id. I.e. I add a new column (as opposed to new row) on each iteration: first iteration is the only iteration to add a new row, for each id, day 30 predictions. The second iteration adds day 90 predictions for the same field id and then the 3rd iteration adds the day 180 predictions for the same field id. This seems to work, at the end of the loop I have a table I can access in hive that shows the 3 predictions for each id.

Sometimes what happens is that it seems that only the most recent predictions, day 180 are preserved. It looks like each time the loop runs it over writes the day 30 and day 90 predictions. So I end up with id | null | null | 123 as opposed to id | 30 | 75 | 123.

This happens sometimes and I do not know why. I had wondered if it was to do with the append settings. Perhaps I should just assemble all the predictions into a single data frame and then send in a oner as opposed to adding each column on each loop iteration? Just, the way I built my code it would be nice if I could add it column by column and it does seem to work fine like that, but sometimes I get this issue and that's why I was experimenting with append = F/T (I only changed it to F this morning and would have got a nasty surprise tomorrow when the cron runs, so I have changed it back to T now!)

If you are able, can you let me know what happens with the dev version of the package remotes::install_github("dyfanjones/noctua")

I have just installed and tried this out just now. I'm not sure which version I was on before but I only installed noctua recently, within the past week. I ran my predictions loop after using this new dev version (Assuming I call it in the usual way, with just noctua::athena()). The newly generated predictions are in S3 but not showing in Hive. If I run MSCK repair table manually I'm confident they will appear however it looks like something isn't working as expected since presumably the expectation is that I don't have to run MSCK manually. Let me know if there's any more info I can provide here?

I am not fully familiar with your set up, does AWS Athena link into your hive metadata store or does it connect to AWS Glue?

I'm not 100% sure but I can describe our set up that the engineers created. I create a table in Hive and add an s3 location when creating the table. I know that there is a link between Hive and Athena meta store I'm just not sure of the nature of it. After manually running MSCK repair table in Hive on our EMR, I notice several .txt and .txt.metastore files added to the S3 location. I'm happy to provide more info here if it helps. Sorry for not being more exact, I'm an analyst more than a data engineer and have only been figuring out how to use AWS S3 recently.

larefly · March 26, 2020, 9:19am

Hi @dougfir,

I think I know what is happening. You will need to chat to your engineers to double check this.

noctua is designed to work with AWS Athena. AWS Athena uses AWS Glue to store it's Data Catalog i.e. all it's database metadata and that.

Hive uses Hive Metadata Storage to do the same thing.

In your case what I believe this is happening:

You send data to AWS S3 using noctua, noctua then registers the table with AWS Athena. This however doesn't register with Hive. I believe this is where the issue lies.

I believe your engineers have set up your AWS Glue catalog to be updated with the Hive Metadata Store, however I don't believe they have set it up so that Hive Metadata Store is update when AWS Glue is update. This means you have to repair you table in Hive before Hive will see the new partition.

From a quick google it looks like your need to have AWS Glue to update Hive with a AWS Glue ETL Migration between the Hive Metastore and the AWS Glue Data Catalog.

For your models do you upload them to S3 individually? Is it possible to merge the results together in R and then send it to AWS S3 in 1 go?

dougfir · March 26, 2020, 1:20pm

Hi @larefly thanks for the info. I am going to check with our engineers today the nature of our metastore / catalog, I"ll update here when I find out.

Is it possible for me to run ALTER TABLE... PARTITION = ... like in the example above using noctua and DBI with dbSendQuery? Do I have any other options here?

For your models do you upload them to S3 individually? Is it possible to merge the results together in R and then send it to AWS S3 in 1 go?

Yes, it is possible. I will change that.

Thanks for your help!

I will update you here when I find out about our metastore catalog.

larefly · March 26, 2020, 1:36pm

I have created a branch in noctua for dbWriteTable to utilise ALTER TABLE instead of MSCK REPAIR TABLE when appending new partitions. Currently passing unit tests. If requested I can create a basic wrapper to help write ALTER TABLE/ MSCK REPAIR.

Yes you can use dbExecute/dbSendQuery to send MSCK REPAIR TABLE and ALTER TABLE. Keep in mind that noctua is making the connection to AWS Athena and not HIVE. So AWS Athena will get the update but I don't think Hive will.

dougfir · March 26, 2020, 1:58pm

Hi Larefly. OK. Thank you! Sounds like I really need to find out the nature of our metastore and catalog for this to work anyway.

Currently passing unit tests. If requested I can create a basic wrapper to help write ALTER TABLE/ MSCK REPAIR

This is great! Let me know when I can try this out and how to tell noctua to use alter table as opposed to msck, which I presume is more efficient than checking the entire table.

Thanks again for your help!

dougfir · March 26, 2020, 2:04pm

Just thinking aloud here. I also have an odbc connection to hive. I wonder if I'm able t run msck repair table or alter table in Hive with dbExcecute/dbSendQuery I can update the hive meta store, assuming that's what we use. I'll try that out...

larefly · March 26, 2020, 2:09pm

@dougfir, yes you should be able to repair the table with your odbc connection. That might be a better work around without the requirement of extra engineering work.

system · April 16, 2020, 2:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.