Spark Structured Streaming using S3 as data source

Spark Structured Streaming using S3 as data source



I have data that is continuously pushed to an S3 bucket.
I want to set up a structured streaming application that uses the S3 bucket as the data source.



My question is if the application is down for some reason, will restarting the application would continue processing data from the S3 where it left off?





You would need to save the files that you've read already if you want processing to continue after failure. Depending on what you are doing with the files, you also need to save the byte offset within that file. Why not just use a Lambda S3 event handler?
– cricket_007
Aug 26 at 8:52






I am planning to stream data from multiple S3 buckets and do stream-stream join. So a simple lambda handler won't work. about the byte offset, how to let spark use that info?
– Sherif Hamdy
Aug 26 at 9:25






You should be able to use lambda... S3 PUT Event -> Parse File -> Write to Kenesis/Kafka. Then do your stream-stream join on that. In any case, something like wholeTextFiles could help you out on the spark side.
– cricket_007
Aug 26 at 9:45



wholeTextFiles









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)