How to connect to redshift data using Spark on Amazon EMR cluster

How to connect to redshift data using Spark on Amazon EMR cluster



I have an Amazon EMR cluster running. If I do


ls -l /usr/share/aws/redshift/jdbc/



it gives me


RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar



Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -


jar


Redshift database


spark-shell


import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)


val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()



and I get this error -


org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;



I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.


format


spark-redshift driver


spark-submit


extra JARS



How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?




1 Answer
1



The error being generated is because you are missing the .format("jdbc") in your read. It should be:


.format("jdbc")


val df : DataFrame = sqlContext.read
.format("jdbc")
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()



By default, Spark assumes sources to be Parquet files, hence the mention of Parquet in the error.



You may still run into issues with classpath/finding the drivers, but this change should give you more useful error output. I assume that folder location you listed is in the classpath for Spark on EMR and those driver versions look to be fairly current. Those drivers should work.



Note, this will only work for reading from Redshift. If you need to write to Redshift your best bet is using the Databricks Redshift data source for Spark - https://github.com/databricks/spark-redshift.



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)