uniformly partition a rdd in spark

I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code

var myRDD = sc.textFile("input file location") myRDD = myRDD.repartition(10000)

and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)

So the load is high on only one executor
I also tried and got the same result

myRDD.coalesce(10000, shuffle = true)

is there a way to uniformly distribute records among partitions.

Attached is the shuffle read size/ number of records on that particular executor
the circled one has a lot more records to process than the others

any help is appreciated thank you.

1 Answer
1

To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.

You can even use the primary key of the DataFrame(RDD).

Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute.
Spark : how can evenly distribute my records in all partition

Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt