Different default persist for Rdd and Dataset

Different default persist for Rdd and Dataset



I was trying to find a good answer why default persist for RDD is MEMORY_ONLY and for Dataset MEMORY_AND_DISK. But couldnt find it. I am wondering if any of you know a good reason behind?



Thanks




2 Answers
2



For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK



Please check the below



[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK



As mentioned by @user6910411 "Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core." i.e dataset/dataframe apis use column buffers to store the column datattype and column details about the raw data so in case while caching the data does not fit in to memory then it will not cache the rest of the partition and will recompute whenever needed.So in the case of dataset/dataframe the recomputation cost is more compared to rdd due to its columnar structure.So the default persist option changed to MEMORY_AND_DISK so that the blocks that does not fit in to memory will spill to disk and it will retrieved from disk whenever needed rather than recomputing next time.



Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.


MEMORY_ONLY



Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.


DISK_AND_MEMORY



The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):


MEMORY_AND_DISK


MEMORY_AND_DISK



Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Some of your past answers have not been well-received, and you're in danger of being blocked from answering.



Please pay close attention to the following guidance:



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)