Trying to get Athena to recognise json with noctua file.type

dougfir · March 31, 2020, 12:15am

I get this error message when trying to send json to S3:

Error : Unable to append onto table: revenue_predictions.mtcars
org.apache.hadoop.hive.ql.io.orc.OrcSerde: Is currently not supported by noctua

I am using noctua package to send data to S3. During the creation of the table in hive, When I create the table in question, I can use e.g. parquet file format:

CREATE EXTERNAL TABLE IF NOT EXISTS revenue_predictions.mtcars
(
mpg double,
cyl double
)

PARTITIONED BY (
 year string,
 month string,
 day string
)

stored as parquet

LOCATION
  's3://ourco-emr/tables/example.db/mtcars';

Then, I can send this to S3 using noctua:

dbWriteTable(conn = con_s3,
                 name = "examples.mtcars",
                 value = mtcars %>% select(mpg, cyl),
                 append = T,
                 overwrite = F, 
                 file.type = "parquet",
                 partition = c(year = yr, month = mt, day = d),
                 s3.location = "s3://ourco-emr/tables/example.db"

So far so good. If I run this I can see data in S3 using Athena or Hive.

I would like to change this to json format instead of parquet format. Noctua dbWriteTable() does seem to offer this fucntionality with the file.type parameter, where I'd change the above call to file.type = "json".

To use json format, our engineers told me to change the create table hive call from the above code block to this:

CREATE EXTERNAL TABLE IF NOT EXISTS revenue_predictions.mtcars
(
mpg double,
cyl double
)

PARTITIONED BY (
 year string,
 month string,
 day string
)

row format serde
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

stored as inputformat
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

outputformat
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION
  's3://ourco-emr/tables/example.db/mtcars';

So, the line stored as parquet has been changed to:

row format serde
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

stored as inputformat
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

outputformat
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

When I create this table and then send data to S3, the json does not seem to be detected.

In order to get Hive and Athena to recognize the json sent to S3 by noctua, how should I define the json when creating the table? I found several differing methods when searching, it seems it's not as straight forwards as parquet, e.g. I cannot just do stored as json but instead have to refer to serde. Having tried the above and failed, what's the correct way to create a table in Hive in json format that will recognise json I send to S3 via noctua?

Put another way, I'm confused about how to define the json format when creating the table so that it can read json sent by noctua?

larefly · March 31, 2020, 10:57am

Hi @dougfir,

When appending to an existing tables, noctua does a check to see if the backend file format is compatible with the Athena/Hive DDL.

csv/tsv : org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
parquet : org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
json : org.apache.hive.hcatalog.data.JsonSerDe

This is why you are getting the error you have highlighted.

When uploading json to Athena/Hive the back end file is uploaded in json line format: http://jsonlines.org/. To my knowledge "org.apache.hive.hcatalog.data.JsonSerDe" will enable the Hive DDL to read json lines. However I haven't tested the other possible json ddl formats. I will raise a ticket link to see if I can extend the DDL creation if they support json line formatting.

I also believe this could be down to a lack of documentation in how the DDLs are created. I will also improve the documentation so that users are fully aware in what the functions are doing in the backend.

In mean time it is possible to see what DDL's are created when uploading data to Athena by using the DBI function sqlCreateTable for example:

library(DBI)

# connection to athena
con <- dbConnect(noctua::athena())

# return ddl of mtcars in json format
sqlCreateTable(con, "mtcars", mtcars, file.type= "json")

CREATE EXTERNAL TABLE `default`.`mtcars` (
  `mpg` DOUBLE,
  `cyl` DOUBLE,
  `disp` DOUBLE,
  `hp` DOUBLE,
  `drat` DOUBLE,
  `wt` DOUBLE,
  `qsec` DOUBLE,
  `vs` DOUBLE,
  `am` DOUBLE,
  `gear` DOUBLE,
  `carb` DOUBLE
)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3:/my/bucket/default/mtcars/'

I hope this helps.

larefly · March 31, 2020, 12:02pm

When creating json ddl's, noctua follows: https://docs.aws.amazon.com/athena/latest/ug/json.html#hivejson. Currently the native Hive JSON SerDe is supported however I believe the OpenX JSON SerDe should be able to be supported as well.

The new noctua branch supports both JSON SerDe Libraries.

remotes::install_github("dyfanjones/noctua", ref = "json_ddl")

I hope this helps.

dougfir · March 31, 2020, 3:43pm

Setting ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' solved my problem. Thank you!

dougfir · March 31, 2020, 3:44pm

Thanks for this! I do not fully understand the nuances of the different json types. I will do some research on these.

system · April 7, 2020, 3:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.