remove rows in R data frame based on group and another data frame

Have two data frames

dat1 <- data.frame(group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18),name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V")) dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))

Second data frame have combinations of two values grouped by ID Column. And based on second data frame (dat2) ,need to remove rows in first data fame (dat1) if that particular group combination exists in dat2.

For example: If "A" and "B" both exists in dat1 and then it should get removed.

Thus, the desired output is

desiredat <- data.frame(group= c(12,12,13,13,15,15),name= c("C","D","G","H","I","J"))

Looking for ways in R to achieve the same.

check your desiredat I think there are some values missed
– Jilber Urbina
Aug 29 at 15:58

desiredat

2 Answers
2

This can be solved by an anti-join. However, we need to identify which group ids group have to be removed from dat1.

group

dat1

library(data.table) # count names per ID setDT(dat2)[, n.id := .N, by = ID] # identify groups to remove by joining and ... groups_to_remove <- dat2[setDT(dat1), on = "name", nomatch = 0L][ # ... check which groups have a match with the complete set of names , which(n.id == .N), by = .(ID, group)] # anti join dat1[!groups_to_remove, on = "group"]

group name 1: 12 C 2: 12 D 3: 14 G 4: 14 H 5: 15 I 6: 15 J 7: 19 A 8: 19 X

Group 19 is not removed because names "A" and "X" belong to different IDs in dat2.

dat2

A more streamlined approach uses all() instead of counting the unique names:

all()

library(data.table) setDT(dat1) setDT(dat2) groups_to_remove <- dat1[dat2, on = "name"][, which(all(ID == ID[1])), by = group] dat1[!groups_to_remove, on = "group"]

group name 1: 12 C 2: 12 D 3: 14 G 4: 14 H 5: 15 I 6: 15 J 7: 19 A 8: 19 X

The same as above in dplyr syntax:

dplyr

library(dplyr) dat2 %>% left_join(dat1, by = "name") %>% group_by(group) %>% summarise(all_have_same_id = all(ID == ID[1L])) %>% filter(all_have_same_id) %>% anti_join(dat1, ., by = "group")

group name 1 12 C 2 12 D 3 14 G 4 14 H 5 15 I 6 15 J 7 19 A 8 19 X Warning message: Column `name` joining factors with different levels, coercing to character vector

The sample dataset dat1 provided by the OP consists of groups where either no name is included in dat2 or all names are in one ID of dat2 (perhaps plus an additional name) but it lacks a use case where only one name is contained in an ID of dat2. Therefore, I have added this use case (as group 19):

dat1

dat2

dat1 <- data.frame( group= c(11,11,12,12,13,13,14,14,15,15,16,16,17,17,17,18,18,18,19,19), name= c("A","B","C","D","E","F","G","H","I","J","A","B","E","F","W","A","B","V","A","X")) dat2 <- data.frame(ID=c(1,1,2,2,3,3),name =c("A","B","E","F","X","Y"))

perfect .... Thank you!
– string
Aug 29 at 17:29

Something like this...?

dat1[dat1$name %in% setdiff(dat1$name, dat2$name), ] 3 12 C 4 12 D 7 14 G 8 14 H 9 15 I 10 15 J 15 17 W 18 18 V

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt