fuzzy outer join/merge in R

fuzzy outer join/merge in R



I have 2 datasets and want to do fuzzy join.

Here is the two datasets.


library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)



Two datasets have the same characters in State and type; however, columns NAME are not the same. They are similar.

Although I can subtract columns NAME on each data with 3 or 4 charters and then merge them, it seems that the correct ratio may not high due to large observations.


State


type


NAME


NAME


dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)



The method is bad.



I check for package fuzzyjoin. But not sure whether I am correct or not.


fuzzyjoin


library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))

# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI



The result in this exercise is correct, see below. But If any NAME in these two data are the same, the answer will not correct.

I create a new observation in these two data.


dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)

fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))

NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB



This is incorrect result.
Any suggestion?



It seems that I cannot use fuzzy_full_join.


fuzzy_full_join





Hi. This is not clear. Please read & act on Minimal, Complete, and Verifiable example. That includes expected output for problem input & a clear description of what rows output based on what rows are in the input.
– philipxy
Aug 22 at 4:54









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)