Fill null columns dynamically in dataframe using pyspark

Fill null columns dynamically in dataframe using pyspark



I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million.
I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data.
How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic.
Thanks for the help
Regards,
Robin






you want to replace null with one single value return by your function right or i interpret the question incorrectly

– Ankit Kumar Namdeo
Sep 8 '18 at 14:21






The function would return a different value each time its called.This would return a sequence and I would need to assign them to each row.

– Python Newbie
Sep 8 '18 at 18:48






Please read how to create good reproducible apache spark dataframe examples.

– pault
Sep 8 '18 at 19:23




2 Answers
2



If I get your question correctly, you want to have some unique value in a column if there has been a Null value before. One possible method would be the following code, which checks for Null values in the value column. If it finds Null it will use the monotonically_increasing id to replace the Null. In the other case the original value will remain.


Null


Null


value


Null


Null


test_df = spark.createDataFrame([
('a','2018'),
('b',None),
('c',None)
], ("col_name","value"))

test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)



Result:


+--------+-----+------------+
|col_name|value|col3 |
+--------+-----+------------+
|a |2018 |2018 |
|b |null |403726925824|
|c |null |609885356032|
+--------+-----+------------+



PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.






Thanks I will try to format and also post an example, your statement helped me find the solution. Although the row_number() didnt work for me( I m using 2.3.0). Probably the row_number() is changed with some lower or uppercase, I tired that as well. I am going to use monotonically_increasing_id() to get incremental values for each row. The last question is I dont want to create the new column but I wanted to update the existing column.

– Python Newbie
Sep 11 '18 at 9:57




For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)



Thanks for all the comments and help.



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)