uniformly partition a rdd in spark

uniformly partition a rdd in spark



I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code


var myRDD = sc.textFile("input file location")

myRDD = myRDD.repartition(10000)



and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)



So the load is high on only one executor
I also tried and got the same result


myRDD.coalesce(10000, shuffle = true)



is there a way to uniformly distribute records among partitions.



Attached is the shuffle read size/ number of records on that particular executor
the circled one has a lot more records to process than the others



any help is appreciated thank you.




1 Answer
1



To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.



You can even use the primary key of the DataFrame(RDD).



Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute.
Spark : how can evenly distribute my records in all partition



Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

Node.js puppeteer - Use values from array in a loop to cycle through pages