Will Spark pick up new files from a directory once processing has started?

Will Spark pick up new files from a directory once processing has started?



If I use


sc.textFile("/my/dir1")



to make RDDs for all files in a directory, and there is another application already writing in there (so, if the processing is long, new files will be added), will spark also get the new ones, or just those found at startup? (I'd really need the latter...)




1 Answer
1



The short answer is NO. The reason is RDD or Dataframe is an immutable data-structure. Once you created an RDD/Dataframe there is no way to append to that data structure.



When you read the data in a directory, spark will create in RDD which keeps track of the partitions in the read data. This RDD then is not mutable. So spark will continue with the execution with the partitions that found at startup



Alternative to this is to use the spark streaming where the new data are discovered when they are added to the directory.






Well maybe you should look at streaming option

– Tomasz Krol
Sep 10 '18 at 16:18






I don't need the streaming option because I don't want it to pick up the new files. Thanks

– gotch4
Sep 11 '18 at 12:05



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)