How can I stream data from a Google PubSub topic into PySpark (on Google Cloud)

How can I stream data from a Google PubSub topic into PySpark (on Google Cloud)



I have data streaming into a topic in Google PubSub. I can see that data using simple Python code:


...
def callback(message):
print(datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") + ": message = '" + message.data + "'")
message.ack()

future = subscriber.subscribe(subscription_name, callback)
future.result()



The above python code receives data from the Google PubSub topic (with subscriber subscriber_name) and writes it to the terminal, as expected. I would like to stream the same data from the topic into PySpark (RDD or dataframe), so I can do other streaming transformation such as windowing and aggregations in PySpark, as described here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.



That link has documentation for reading other streaming sources, (e.g. Kafka), but not Google PubSub. Is there a way to stream from Google PubSub into PySpark?




2 Answers
2



You can use Apache Bahir, which provides extensions for Apache Spark, including a connector for Google Cloud Pub/Sub.



You can find an example from Google Cloud Platform that using Spark on Kubernetes computes word counts from data stream received from a Google Cloud PubSub topic and writes the result to a Google Cloud Storage (GCS) bucket.



There's another example that uses DStream to deploy an Apache Spark streaming application on Cloud Dataproc and process messages from Cloud Pub/Sub.






Those examples are all in scala, whereas I am trying to accomplish this task using python (pyspark), as I will be calling a number of functions that are already written in python.

– Rahul Shetty
Sep 18 '18 at 5:29







You could use a language binding then, or you can use an existing one. For example, check the python binding of this GitHub project that enables Apache Spark streaming applications to consume messages from Google Pubsub from Java and Python.

– Neri
Sep 19 '18 at 0:03






Thanks - I tried that but was getting some cryptic error messages. After poking around a bit, I saw that Kafka provides a better solution to what we're trying to do, so decided to take that route.

– Rahul Shetty
Sep 23 '18 at 20:26



You can use Apache Beam: https://beam.apache.org/



Apache Beam has Pyhton support for Cloud Pub/Sub: https://beam.apache.org/documentation/io/built-in/



There is a Python SDK: https://beam.apache.org/documentation/sdks/python/



And support for Spark: https://beam.apache.org/documentation/runners/capability-matrix/



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)