Comparing 2 datasets in order to keep only participants who completed 75% of trials using R

Comparing 2 datasets in order to keep only participants who completed 75% of trials using R



I have a large dataset where participants completed trials of a task. There are 100 regular trials and 10 practice trials. For this task we only want to keep the trials that people got correct. I have made a separate dataset that has my data without the outliers and incorrect trials. Now, I am stuck because I need to find a way to only keep the participants who still have at least 75% of their data.



To simplify and not post the entire large dataset it looks something like this:


subject latency
0003 454
0003 500
0003 600
0004 457
0004 600
0005 700



So subjects are in one column and their latency is in another column. The second dataset is smaller because trials were removed. I couldn't really find a good way to compare the 2 datasets and only keep subject IDs that kept 75% or more of their data.



Thank you all!




2 Answers
2



If your two data sets are called dt1 and dt2:


dt1


dt2



First find the number of trials per subject and merge the before and after tables:


library(data.table)
setDT(dt1)
setDT(dt2)

dt3 <- merge(
dt1[, .N, subject],
dt2[, .N, subject],
by = "subject"
)



The subjects you want to keep are those who have > 0.75 observations remaining:


subjToKeep <- dt3[, percRemaining := N.y / N.x][percRemaining >= 0.75, subject]

dt2[subject %in% subjToKeep]



Here's a simple dplyr solution


dplyr


# example of full dataset
df_full = data.frame(subject = c(1,1,1,1,2,2,2,2,3,3,3,3,4),
latency = 1:13)

# example of smaller dataset
df_small = data.frame(subject = c(1,2,2,2,3,3,3),
latency = c(2,5,6,7,8,10,12))


library(dplyr)

df_full %>% count(subject) %>% # count rows for each subject in full dataset
left_join(df_small %>% count(subject), by="subject") %>% # count rows for each subject in small dataset and join
filter(n.y / n.x >= 0.75) %>% # keep only subjects where we have 75% or more of their data
pull(subject) -> subj_vec # save the subjects as a vector

# use that vector to filter your smaller dataset
df_small %>% filter(subject %in% subj_vec)

# subject latency
# 1 2 5
# 2 2 6
# 3 2 7
# 4 3 8
# 5 3 10
# 6 3 12






Thank you so much, success!

– Mary Smirnova
Sep 19 '18 at 21:02



Thanks for contributing an answer to Stack Overflow!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)