How to parse a json object column inside CSV using Spark SQL
How to parse a json object column inside CSV using Spark SQL
I am facing an issue when reading and parsing a CSV file. I have a json in a column named CONTENT. File look like this:
TEMPLATE_SECTION_ID,BUSINESS_ID,TEMPLATE_ID,SECTION_NAME,CONTENT,VERSION,CREATED_DATE,CREATED_BY_ID,MODIFIED_DATE,MODIFIED_BY_ID
"1234577d1c74680083",12345,"12345e401477d1c7422007b","BONUS_SECTION",""groupBy":"TOTAL","showAllItems":true,"labels":["name":"Released","label":"Released","name":"Held","label":"Held","name":"Holds Released","label":"Holds Released"]",2,"2018-01-1 01:00:00.0",12345678,"2018-01-01 01:00:00.0",12345678
I've tried to read it using:
Dataset<Row> df2 = spark.read().option("multiLine", "true").option("inferSchema", "true").option("header", "true").format("csv").load("/xc_inct_stmt_tmpl_section_header.csv").toDF();
df2.show();
Output is:
+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+ | TEMPLATE_SECTION_ID|BUSINESS_ID| TEMPLATE_ID| SECTION_NAME|
CONTENT| VERSION| CREATED_DATE| CREATED_BY_ID|
MODIFIED_DATE| MODIFIED_BY_ID|
+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+ |8a7abc10477ba1e40...|
4549|8a7abc10477ba1e40...|BONUS_SECTION|"{"groupBy":"TOTAL"|"showAllItems":true|"labels":["label":"Released"|"name":"Held"|
+--------------------+-----------+--------------------+-------------+-------------------+-------------------+--------------------+-------------------+--------------+---------------+
Expected output is:
+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+ |TEMPLATE_SECTION_ID|BUSINESS_ID| TEMPLATE_ID| SECTION_NAME|
CONTENT|VERSION| CREATED_DATE|CREATED_BY_ID|
MODIFIED_DATE|MODIFIED_BY_ID|
+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+ | 1234577d1c74680083|
12345|12345e401477d1c74...|BONUS_SECTION|"{"groupBy":"TOTA...|
2|2018-01-01 02:00:00| 12345678|2018-01-01 02:00:00|
12345678|
+-------------------+-----------+--------------------+-------------+--------------------+-------+-------------------+-------------+-------------------+--------------+
I have tried all sort of options available like quote,quoteMode,delimiter,parserLib,etc
My Env details: Spark 2.2.1 with java1.8
0
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.