Pyspark - remove duplicates from dataframe keeping the last appearance
I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:
NAME
ID
DOB
I succeeded in Pandas with the following:
df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
But in spark I tried the following:
df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')
I get this error:
TypeError: dropDuplicates() got an unexpected keyword argument 'keep'
Any ideas?
pandas dataframe pyspark
add a comment |
I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:
NAME
ID
DOB
I succeeded in Pandas with the following:
df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
But in spark I tried the following:
df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')
I get this error:
TypeError: dropDuplicates() got an unexpected keyword argument 'keep'
Any ideas?
pandas dataframe pyspark
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
Base on the [document][1]dropDuplicates
do not have parakeep
inpyspark
– Wen-Ben
Nov 13 '18 at 16:09
add a comment |
I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:
NAME
ID
DOB
I succeeded in Pandas with the following:
df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
But in spark I tried the following:
df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')
I get this error:
TypeError: dropDuplicates() got an unexpected keyword argument 'keep'
Any ideas?
pandas dataframe pyspark
I'm trying to dedupe a spark dataframe leaving only the latest appearance.
The duplication is in three variables:
NAME
ID
DOB
I succeeded in Pandas with the following:
df_dedupe = df.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
But in spark I tried the following:
df_dedupe = df.dropDuplicates(['NAME', 'ID', 'DOB'], keep='last')
I get this error:
TypeError: dropDuplicates() got an unexpected keyword argument 'keep'
Any ideas?
pandas dataframe pyspark
pandas dataframe pyspark
asked Nov 13 '18 at 16:00
David KonDavid Kon
204
204
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
Base on the [document][1]dropDuplicates
do not have parakeep
inpyspark
– Wen-Ben
Nov 13 '18 at 16:09
add a comment |
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
Base on the [document][1]dropDuplicates
do not have parakeep
inpyspark
– Wen-Ben
Nov 13 '18 at 16:09
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
Base on the [document][1]
dropDuplicates
do not have para keep
in pyspark
– Wen-Ben
Nov 13 '18 at 16:09
Base on the [document][1]
dropDuplicates
do not have para keep
in pyspark
– Wen-Ben
Nov 13 '18 at 16:09
add a comment |
2 Answers
2
active
oldest
votes
As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None)
, it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?
EDIT
As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe
on the three repeated columns to make an inner join and only preserve the last values.
df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
add a comment |
Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53284881%2fpyspark-remove-duplicates-from-dataframe-keeping-the-last-appearance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None)
, it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?
EDIT
As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe
on the three repeated columns to make an inner join and only preserve the last values.
df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
add a comment |
As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None)
, it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?
EDIT
As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe
on the three repeated columns to make an inner join and only preserve the last values.
df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
add a comment |
As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None)
, it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?
EDIT
As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe
on the three repeated columns to make an inner join and only preserve the last values.
df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')
As you can see in http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html the documentation of the function dropDuplicates(subset=None)
, it only allows a subset as parameter. Why would you like to keep the last, if they're all equal?
EDIT
As @W-B pointed, u want the other columns. My solution will be to sort the original dataframe in reverse order, and use the df_dedupe
on the three repeated columns to make an inner join and only preserve the last values.
df_dedupe.join(original_df,['NAME','ID','DOB'],'inner')
edited Nov 13 '18 at 16:12
answered Nov 13 '18 at 16:10
ManriqueManrique
549415
549415
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
add a comment |
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Cause there still other column , he need the last value for them
– Wen-Ben
Nov 13 '18 at 16:11
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
Oh, i understand. I will edit my answer.
– Manrique
Nov 13 '18 at 16:13
add a comment |
Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
add a comment |
Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
add a comment |
Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
edited Nov 15 '18 at 7:02
answered Nov 14 '18 at 13:44
David KonDavid Kon
204
204
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53284881%2fpyspark-remove-duplicates-from-dataframe-keeping-the-last-appearance%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
quynhcodes.wordpress.com/2016/07/29/…
– karma4917
Nov 13 '18 at 16:07
Base on the [document][1]
dropDuplicates
do not have parakeep
inpyspark
– Wen-Ben
Nov 13 '18 at 16:09