Comparing 2 datasets in order to keep only participants who completed 75% of trials using R

I have a large dataset where participants completed trials of a task. There are 100 regular trials and 10 practice trials. For this task we only want to keep the trials that people got correct. I have made a separate dataset that has my data without the outliers and incorrect trials. Now, I am stuck because I need to find a way to only keep the participants who still have at least 75% of their data.

To simplify and not post the entire large dataset it looks something like this:

subject latency 0003 454 0003 500 0003 600 0004 457 0004 600 0005 700

So subjects are in one column and their latency is in another column. The second dataset is smaller because trials were removed. I couldn't really find a good way to compare the 2 datasets and only keep subject IDs that kept 75% or more of their data.

Thank you all!

2 Answers
2

If your two data sets are called dt1 and dt2:

dt1

dt2

First find the number of trials per subject and merge the before and after tables:

library(data.table) setDT(dt1) setDT(dt2) dt3 <- merge( dt1[, .N, subject], dt2[, .N, subject], by = "subject" )

The subjects you want to keep are those who have > 0.75 observations remaining:

subjToKeep <- dt3[, percRemaining := N.y / N.x][percRemaining >= 0.75, subject] dt2[subject %in% subjToKeep]

Here's a simple dplyr solution

dplyr

# example of full dataset df_full = data.frame(subject = c(1,1,1,1,2,2,2,2,3,3,3,3,4), latency = 1:13) # example of smaller dataset df_small = data.frame(subject = c(1,2,2,2,3,3,3), latency = c(2,5,6,7,8,10,12)) library(dplyr) df_full %>% count(subject) %>% # count rows for each subject in full dataset left_join(df_small %>% count(subject), by="subject") %>% # count rows for each subject in small dataset and join filter(n.y / n.x >= 0.75) %>% # keep only subjects where we have 75% or more of their data pull(subject) -> subj_vec # save the subjects as a vector # use that vector to filter your smaller dataset df_small %>% filter(subject %in% subj_vec) # subject latency # 1 2 5 # 2 2 6 # 3 2 7 # 4 3 8 # 5 3 10 # 6 3 12

Thank you so much, success!

– Mary Smirnova
Sep 19 '18 at 21:02

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt