How can I stream data from a Google PubSub topic into PySpark (on Google Cloud)

I have data streaming into a topic in Google PubSub. I can see that data using simple Python code:

... def callback(message): print(datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") + ": message = '" + message.data + "'") message.ack() future = subscriber.subscribe(subscription_name, callback) future.result()

The above python code receives data from the Google PubSub topic (with subscriber subscriber_name) and writes it to the terminal, as expected. I would like to stream the same data from the topic into PySpark (RDD or dataframe), so I can do other streaming transformation such as windowing and aggregations in PySpark, as described here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

That link has documentation for reading other streaming sources, (e.g. Kafka), but not Google PubSub. Is there a way to stream from Google PubSub into PySpark?

2 Answers
2

You can use Apache Bahir, which provides extensions for Apache Spark, including a connector for Google Cloud Pub/Sub.

You can find an example from Google Cloud Platform that using Spark on Kubernetes computes word counts from data stream received from a Google Cloud PubSub topic and writes the result to a Google Cloud Storage (GCS) bucket.

There's another example that uses DStream to deploy an Apache Spark streaming application on Cloud Dataproc and process messages from Cloud Pub/Sub.

Those examples are all in scala, whereas I am trying to accomplish this task using python (pyspark), as I will be calling a number of functions that are already written in python.

– Rahul Shetty
Sep 18 '18 at 5:29

You could use a language binding then, or you can use an existing one. For example, check the python binding of this GitHub project that enables Apache Spark streaming applications to consume messages from Google Pubsub from Java and Python.

– Neri
Sep 19 '18 at 0:03

Thanks - I tried that but was getting some cryptic error messages. After poking around a bit, I saw that Kafka provides a better solution to what we're trying to do, so decided to take that route.

– Rahul Shetty
Sep 23 '18 at 20:26

You can use Apache Beam: https://beam.apache.org/

Apache Beam has Pyhton support for Cloud Pub/Sub: https://beam.apache.org/documentation/io/built-in/

There is a Python SDK: https://beam.apache.org/documentation/sdks/python/

And support for Spark: https://beam.apache.org/documentation/runners/capability-matrix/

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service, privacy policy and cookie policy

搜尋此網誌

Dfyjkt