Fill null columns dynamically in dataframe using pyspark

I have a situation where my dataframe has 3 columns, out of these three columns there is a possibility that there are nulls in column3. The total records in this DF is 2 million.
I would need to fill this null values via a value from mysql database(basically calling a function which returns a value). I can think about looping over each row but this could be much more time consuming given the amount of data.
How can I achieve this, I know how to fill the nulls with a static value but this is completely dynamic.
Thanks for the help
Regards,
Robin

you want to replace null with one single value return by your function right or i interpret the question incorrectly

– Ankit Kumar Namdeo
Sep 8 '18 at 14:21

The function would return a different value each time its called.This would return a sequence and I would need to assign them to each row.

– Python Newbie
Sep 8 '18 at 18:48

Please read how to create good reproducible apache spark dataframe examples.

– pault
Sep 8 '18 at 19:23

2 Answers
2

If I get your question correctly, you want to have some unique value in a column if there has been a Null value before. One possible method would be the following code, which checks for Null values in the value column. If it finds Null it will use the monotonically_increasing id to replace the Null. In the other case the original value will remain.

Null

value

Null

test_df = spark.createDataFrame([ ('a','2018'), ('b',None), ('c',None) ], ("col_name","value")) test_df.withColumn("col3", when(col("value").isNull(), row_number()).otherwise(col("value"))).show(truncate=False)

Result:

+--------+-----+------------+ |col_name|value|col3 | +--------+-----+------------+ |a |2018 |2018 | |b |null |403726925824| |c |null |609885356032| +--------+-----+------------+

PS: For future requests, it would be good if you could include a sample from your data set and your desired output. This often helps to understand the problem.

Thanks I will try to format and also post an example, your statement helped me find the solution. Although the row_number() didnt work for me( I m using 2.3.0). Probably the row_number() is changed with some lower or uppercase, I tired that as well. I am going to use monotonically_increasing_id() to get incremental values for each row. The last question is I dont want to create the new column but I wanted to update the existing column.

– Python Newbie
Sep 11 '18 at 9:57

For the above test case I would update the value column for only two rows with below command. test_df.withColumn("value", when(col("value").isNull(), monotonically_increasing_id()).otherwise(col("value"))).show(truncate=False)

Thanks for all the comments and help.

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt