How to parse a json object column inside CSV using Spark SQL

I am facing an issue when reading and parsing a CSV file. I have a json in a column named CONTENT. File look like this:

TEMPLATE_SECTION_ID,BUSINESS_ID,TEMPLATE_ID,SECTION_NAME,CONTENT,VERSION,CREATED_DATE,CREATED_BY_ID,MODIFIED_DATE,MODIFIED_BY_ID "1234577d1c74680083",12345,"12345e401477d1c7422007b","BONUS_SECTION",""groupBy":"TOTAL","showAllItems":true,"labels":["name":"Released","label":"Released","name":"Held","label":"Held","name":"Holds Released","label":"Holds Released"]",2,"2018-01-1 01:00:00.0",12345678,"2018-01-01 01:00:00.0",12345678

I've tried to read it using:

Dataset<Row> df2 = spark.read().option("multiLine", "true").option("inferSchema", "true").option("header", "true").format("csv").load("/xc_inct_stmt_tmpl_section_header.csv").toDF(); df2.show();

Output is:

+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+ | TEMPLATE_SECTION_ID|BUSINESS_ID| TEMPLATE_ID| SECTION_NAME|
CONTENT| VERSION| CREATED_DATE| CREATED_BY_ID|
MODIFIED_DATE| MODIFIED_BY_ID|
+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+ |8a7abc10477ba1e40...|

4549|8a7abc10477ba1e40...|BONUS_SECTION|"{"groupBy":"TOTAL"|"showAllItems":true|"labels":["label":"Released"|"name":"Held"|
+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+

Expected output is:

+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+ |TEMPLATE_SECTION_ID|BUSINESS_ID| TEMPLATE_ID| SECTION_NAME|

CONTENT|VERSION| CREATED_DATE|CREATED_BY_ID|

MODIFIED_DATE|MODIFIED_BY_ID|
+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+ | 1234577d1c74680083|

12345|12345e401477d1c74...|BONUS_SECTION|"{"groupBy":"TOTA...|

2|2018-01-01 02:00:00| 12345678|2018-01-01 02:00:00|

12345678|
+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+

I have tried all sort of options available like quote,quoteMode,delimiter,parserLib,etc

My Env details: Spark 2.2.1 with java1.8

0

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt

How to parse a json object column inside CSV using Spark SQL

How to parse a json object column inside CSV using Spark SQL

0

Popular posts from this blog

How do I collapse sections of code in Visual Studio Code for Windows?