How to connect to redshift data using Spark on Amazon EMR cluster

How to connect to redshift data using Spark on Amazon EMR cluster



I have an Amazon EMR cluster running. If I do


ls -l /usr/share/aws/redshift/jdbc/



it gives me


RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar



Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -


jar


Redshift database


spark-shell


import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)


val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()



and I get this error -


org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;



I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.


format


spark-redshift driver


spark-submit


extra JARS



How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?




1 Answer
1



The error being generated is because you are missing the .format("jdbc") in your read. It should be:


.format("jdbc")


val df : DataFrame = sqlContext.read
.format("jdbc")
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()



By default, Spark assumes sources to be Parquet files, hence the mention of Parquet in the error.



You may still run into issues with classpath/finding the drivers, but this change should give you more useful error output. I assume that folder location you listed is in the classpath for Spark on EMR and those driver versions look to be fairly current. Those drivers should work.



Note, this will only work for reading from Redshift. If you need to write to Redshift your best bet is using the Databricks Redshift data source for Spark - https://github.com/databricks/spark-redshift.



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.