Comparing 2 datasets in order to keep only participants who completed 75% of trials using R
Comparing 2 datasets in order to keep only participants who completed 75% of trials using R
I have a large dataset where participants completed trials of a task. There are 100 regular trials and 10 practice trials. For this task we only want to keep the trials that people got correct. I have made a separate dataset that has my data without the outliers and incorrect trials. Now, I am stuck because I need to find a way to only keep the participants who still have at least 75% of their data.
To simplify and not post the entire large dataset it looks something like this:
subject latency
0003 454
0003 500
0003 600
0004 457
0004 600
0005 700
So subjects are in one column and their latency is in another column. The second dataset is smaller because trials were removed. I couldn't really find a good way to compare the 2 datasets and only keep subject IDs that kept 75% or more of their data.
Thank you all!
2 Answers
2
If your two data sets are called dt1
and dt2
:
dt1
dt2
First find the number of trials per subject and merge the before and after tables:
library(data.table)
setDT(dt1)
setDT(dt2)
dt3 <- merge(
dt1[, .N, subject],
dt2[, .N, subject],
by = "subject"
)
The subjects you want to keep are those who have > 0.75 observations remaining:
subjToKeep <- dt3[, percRemaining := N.y / N.x][percRemaining >= 0.75, subject]
dt2[subject %in% subjToKeep]
Here's a simple dplyr
solution
dplyr
# example of full dataset
df_full = data.frame(subject = c(1,1,1,1,2,2,2,2,3,3,3,3,4),
latency = 1:13)
# example of smaller dataset
df_small = data.frame(subject = c(1,2,2,2,3,3,3),
latency = c(2,5,6,7,8,10,12))
library(dplyr)
df_full %>% count(subject) %>% # count rows for each subject in full dataset
left_join(df_small %>% count(subject), by="subject") %>% # count rows for each subject in small dataset and join
filter(n.y / n.x >= 0.75) %>% # keep only subjects where we have 75% or more of their data
pull(subject) -> subj_vec # save the subjects as a vector
# use that vector to filter your smaller dataset
df_small %>% filter(subject %in% subj_vec)
# subject latency
# 1 2 5
# 2 2 6
# 3 2 7
# 4 3 8
# 5 3 10
# 6 3 12
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thank you so much, success!
– Mary Smirnova
Sep 19 '18 at 21:02