Pyspark - remove duplicates from dataframe keeping the last appearance

I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:

NAME
ID
DOB

I succeeded in Pandas with the following:

df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

But in spark I tried the following:

df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')

I get this error:

TypeError: dropDuplicates() got an unexpected keyword argument 'keep'

Any ideas?

asked Nov 13 '18 at 16:00

David Kon

204

quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07

Base on the [document][1] dropDuplicates do not have para keep in pyspark

– Wen-Ben
Nov 13 '18 at 16:09

add a comment |

I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:

NAME
ID
DOB

I succeeded in Pandas with the following:

df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

But in spark I tried the following:

df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')

I get this error:

TypeError: dropDuplicates() got an unexpected keyword argument 'keep'

Any ideas?

asked Nov 13 '18 at 16:00

David Kon

204

quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07

Base on the [document][1] dropDuplicates do not have para keep in pyspark

– Wen-Ben
Nov 13 '18 at 16:09

add a comment |

I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:

NAME
ID
DOB

I succeeded in Pandas with the following:

df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

But in spark I tried the following:

df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')

I get this error:

TypeError: dropDuplicates() got an unexpected keyword argument 'keep'

Any ideas?

asked Nov 13 '18 at 16:00

David Kon

204

I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:

NAME
ID
DOB

I succeeded in Pandas with the following:

df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

But in spark I tried the following:

df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')

I get this error:

TypeError: dropDuplicates() got an unexpected keyword argument 'keep'

Any ideas?

pandas dataframe pyspark

asked Nov 13 '18 at 16:00

David Kon

204

asked Nov 13 '18 at 16:00

David Kon

204

asked Nov 13 '18 at 16:00

David Kon

204

asked Nov 13 '18 at 16:00

David Kon

204

asked Nov 13 '18 at 16:00

David Kon

204

quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07

Base on the [document][1] dropDuplicates do not have para keep in pyspark

– Wen-Ben
Nov 13 '18 at 16:09

add a comment |

quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07

Base on the [document][1] dropDuplicates do not have para keep in pyspark

– Wen-Ben
Nov 13 '18 at 16:09

quynhcodes.wordpress.com/2016/07/29/…

– karma4917
Nov 13 '18 at 16:07

Base on the [document][1] dropDuplicates do not have para keep in pyspark

– Wen-Ben
Nov 13 '18 at 16:09

add a comment |

2 Answers
2

active

oldest

votes

As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None), it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?

EDIT

As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe on the three repeated columns to make an inner join and only preserve the last values.

df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

add a comment |

Thanks for your help.
I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

The outcome was:

+-----+---+----------+------+--------+ 
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+

Showing the "Bob" with corrupted data.

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

Thanks!

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53284881%2fpyspark-remove-duplicates-from-dataframe-keeping-the-last-appearance%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

EDIT

df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

add a comment |

EDIT

df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

add a comment |

EDIT

df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

EDIT

df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

edited Nov 13 '18 at 16:12

answered Nov 13 '18 at 16:10

Manrique

549415

answered Nov 13 '18 at 16:10

Manrique

549415

answered Nov 13 '18 at 16:10

Manrique

549415

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

add a comment |

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

Cause there still other column , he need the last value for them

– Wen-Ben
Nov 13 '18 at 16:11

Oh, i understand. I will edit my answer.

– Manrique
Nov 13 '18 at 16:13

add a comment |

Thanks for your help.
I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

The outcome was:

+-----+---+----------+------+--------+ 
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+

Showing the "Bob" with corrupted data.

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

Thanks!

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

add a comment |

Thanks for your help.
I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

The outcome was:

+-----+---+----------+------+--------+ 
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+

Showing the "Bob" with corrupted data.

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

Thanks!

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

add a comment |

Thanks for your help.
I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

The outcome was:

+-----+---+----------+------+--------+ 
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+

Showing the "Bob" with corrupted data.

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

Thanks!

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

Thanks for your help.
I followed your directives but the outcome was not as expected:

d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)

The outcome was:

+-----+---+----------+------+--------+ 
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+

Showing the "Bob" with corrupted data.

Finally, I changed my approach and converted the DF to Pandas and then back to spark:

p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)

df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)

This finally brought the correct "Bob":

+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+

Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.

Thanks!

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

edited Nov 15 '18 at 7:02

answered Nov 14 '18 at 13:44

David Kon

204

answered Nov 14 '18 at 13:44

David Kon

204

answered Nov 14 '18 at 13:44

David Kon

204

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt