remove rows in R data frame based on group and another data frame
remove rows in R data frame based on group and another data frame
Have two data frames
dat1 <- data.frame(group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18),name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V"))
dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))
Second data frame have combinations of two values grouped by ID Column. And based on second data frame (dat2) ,need to remove rows in first data fame (dat1) if that particular group combination exists in dat2.
For example: If "A" and "B" both exists in dat1 and then it should get removed.
Thus, the desired output is
desiredat <- data.frame(group= c(12,12,13,13,15,15),name= c("C","D","G","H","I","J"))
Looking for ways in R to achieve the same.
desiredat
2 Answers
2
This can be solved by an anti-join. However, we need to identify which group ids group have to be removed from dat1.
group
dat1
library(data.table)
# count names per ID
setDT(dat2)[, n.id := .N, by = ID]
# identify groups to remove by joining and ...
groups_to_remove <- dat2[setDT(dat1), on = "name", nomatch = 0L][
# ... check which groups have a match with the complete set of names
, which(n.id == .N), by = .(ID, group)]
# anti join
dat1[!groups_to_remove, on = "group"]
group name
1: 12 C
2: 12 D
3: 14 G
4: 14 H
5: 15 I
6: 15 J
7: 19 A
8: 19 X
Group 19 is not removed because names "A" and "X" belong to different IDs in dat2.
dat2
A more streamlined approach uses all() instead of counting the unique names:
all()
library(data.table)
setDT(dat1)
setDT(dat2)
groups_to_remove <- dat1[dat2, on = "name"][, which(all(ID == ID[1])), by = group]
dat1[!groups_to_remove, on = "group"]
group name
1: 12 C
2: 12 D
3: 14 G
4: 14 H
5: 15 I
6: 15 J
7: 19 A
8: 19 X
The same as above in dplyr syntax:
dplyr
library(dplyr)
dat2 %>%
left_join(dat1, by = "name") %>%
group_by(group) %>%
summarise(all_have_same_id = all(ID == ID[1L])) %>%
filter(all_have_same_id) %>%
anti_join(dat1, ., by = "group")
group name
1 12 C
2 12 D
3 14 G
4 14 H
5 15 I
6 15 J
7 19 A
8 19 X
Warning message:
Column `name` joining factors with different levels, coercing to character vector
The sample dataset dat1 provided by the OP consists of groups where either no name is included in dat2 or all names are in one ID of dat2 (perhaps plus an additional name) but it lacks a use case where only one name is contained in an ID of dat2. Therefore, I have added this use case (as group 19):
dat1
dat2
dat2
dat2
dat1 <- data.frame(
group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18,19,19),
name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V","A","X"))
dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))
perfect .... Thank you!
– string
Aug 29 at 17:29
Something like this...?
dat1[dat1$name %in% setdiff(dat1$name, dat2$name), ]
3 12 C
4 12 D
7 14 G
8 14 H
9 15 I
10 15 J
15 17 W
18 18 V
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
check your
desiredatI think there are some values missed– Jilber Urbina
Aug 29 at 15:58