Spark Structured Streaming using S3 as data source
Spark Structured Streaming using S3 as data source
I have data that is continuously pushed to an S3 bucket.
I want to set up a structured streaming application that uses the S3 bucket as the data source.
My question is if the application is down for some reason, will restarting the application would continue processing data from the S3 where it left off?
I am planning to stream data from multiple S3 buckets and do stream-stream join. So a simple lambda handler won't work. about the byte offset, how to let spark use that info?
– Sherif Hamdy
Aug 26 at 9:25
You should be able to use lambda... S3 PUT Event -> Parse File -> Write to Kenesis/Kafka. Then do your stream-stream join on that. In any case, something like
wholeTextFiles
could help you out on the spark side.– cricket_007
Aug 26 at 9:45
wholeTextFiles
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
You would need to save the files that you've read already if you want processing to continue after failure. Depending on what you are doing with the files, you also need to save the byte offset within that file. Why not just use a Lambda S3 event handler?
– cricket_007
Aug 26 at 8:52