Spark Structured Streaming using S3 as data source

I have data that is continuously pushed to an S3 bucket.
I want to set up a structured streaming application that uses the S3 bucket as the data source.

My question is if the application is down for some reason, will restarting the application would continue processing data from the S3 where it left off?

You would need to save the files that you've read already if you want processing to continue after failure. Depending on what you are doing with the files, you also need to save the byte offset within that file. Why not just use a Lambda S3 event handler?
– cricket_007
Aug 26 at 8:52

I am planning to stream data from multiple S3 buckets and do stream-stream join. So a simple lambda handler won't work. about the byte offset, how to let spark use that info?
– Sherif Hamdy
Aug 26 at 9:25

You should be able to use lambda... S3 PUT Event -> Parse File -> Write to Kenesis/Kafka. Then do your stream-stream join on that. In any case, something like wholeTextFiles could help you out on the spark side.
– cricket_007
Aug 26 at 9:45

wholeTextFiles

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt

Spark Structured Streaming using S3 as data source

Spark Structured Streaming using S3 as data source

Popular posts from this blog

Old paper Canadian currency

Mazie Hirono