labelling rows in one data set based on the date of the measurement compared to two other dates in another dataset
labelling rows in one data set based on the date of the measurement compared to two other dates in another dataset
library(data.table)
testset <- data.table(date=as.Date(c("2013-07-02","2013-08-03","2013-09-04",
"2013-10-05","2013-11-06")),
yr = c(2013,2013,2013,2013,2013),
mo = c(07,08,09,10,11),
da = c(02,03,04,05,06),
plant = LETTERS[1:5],
product = as.factor(letters[26:22]),
rating = runif(25))
I have this dataset that for each row I want to create a category or naming on that row depending on the date column. I want to compare this date with dates in another dataset:
library(lubridate)
splitDates <- ymd(c("2013-06-10", "2013-08-15", "2013-10-06"))
Using splitDates I want to evaluate which value in splitDates came last before the measurement was taken. (If you imagine that a new experiment took place from 2013-06-10 and until but not including 2013-08-15, I want to decide what experiment a measurement belongs to).
As I can see the first five rows in this new column should look like this:
NewColumn <- c("2013-06-10", "2013-06-10", "2013-08-15", "2013-08-15", "2013-10-06")
date yr mo da plant product rating NewColumn
1: 2013-07-02 2013 7 2 A z 0.02522850 2013-06-10
2: 2013-08-03 2013 8 3 B y 0.28274066 2013-06-10
3: 2013-09-04 2013 9 4 C x 0.86314441 2013-08-15
4: 2013-10-05 2013 10 5 D w 0.01670862 2013-08-15
5: 2013-11-06 2013 11 6 E v 0.16034175 2013-10-06
...
I can't figure out how to do this.
splitDates
testset[, v := splitDates[findInterval(date, splitDates)]]
seems to work? Related: stackoverflow.com/q/15712826– Frank
Sep 4 '18 at 18:06
testset[, v := splitDates[findInterval(date, splitDates)]]
3 Answers
3
Here's my take
library(dplyr)
dta <- data.frame(NewColumn=splitDates,newvar=1:3)
testset$newvar <- sapply(testset[,1], function(x) ifelse(x<splitDates[2],1,ifelse(x<splitDates[3],2,3)))
final_data <- semi_join(testset,dta,by="newvar")
Data:
testset <- data.table(date=as.Date(c("2013-07-02","2013-08-03","2013-09-04",
"2013-10-05","2013-11-06")),
yr = c(2013,2013,2013,2013,2013),
mo = c(07,08,09,10,11),
da = c(02,03,04,05,06),
plant = LETTERS[1:5],
product = as.factor(letters[26:22]),
rating = runif(25))
splitDates <- ymd(c("2013-06-10", "2013-08-15", "2013-10-06"))
For me, understanding your question was more difficult than solving it. Please review the answer, and give me a feedback. It has 3 steps:
make a function to return the latest date from the other dataset
findLatest<-function(date)which.min( abs( splitDates-date ))
Then call the function over all the dates in testset
:
testset
names<-splitDates[ sapply(testset[,1], findLatest ) ]
Add the result to the dataset
testset$names<-names
So, the first 10 rows are:
date yr mo da plant product rating V8
1 2013-07-02 2013 7 2 A z 0.75801493 2013-06-10
2 2013-08-03 2013 8 3 B y 0.06370597 2013-08-15
3 2013-09-04 2013 9 4 C x 0.25375231 2013-08-15
4 2013-10-05 2013 10 5 D w 0.42900236 2013-10-06
5 2013-11-06 2013 11 6 E v 0.97613291 2013-10-06
6 2013-07-02 2013 7 2 A z 0.78094927 2013-06-10
7 2013-08-03 2013 8 3 B y 0.91312684 2013-08-15
8 2013-09-04 2013 9 4 C x 0.29345599 2013-08-15
9 2013-10-05 2013 10 5 D w 0.80870134 2013-10-06
10 2013-11-06 2013 11 6 E v 0.18735280 2013-10-06
I get the error names<-splitDates[ sapply(testset[,1], findLatest ) ] Advarselsbesked: I unclass(time1) - unclass(time2) : longer object length is not a multiple of shorter object length I like the logic behind the solution. However already in the second row of your output the V8 value is after the date, which shouldn't be possible considering V8 is "the initiation of the experiment and the date is the value of result of that experiment"
– Jakn09ab
Sep 5 '18 at 6:28
sorry what is the error on
names<-splitDates[ sapply(testset[,1], findLatest ) ]
? Did you load the findLatest
beforehand?– Salman Lashkarara
Sep 5 '18 at 6:58
names<-splitDates[ sapply(testset[,1], findLatest ) ]
findLatest
Yes I did. It just means warning in Danish.
– Jakn09ab
Sep 6 '18 at 6:23
ok, please upvote me if it was helpful. @Jakn09ab
– Salman Lashkarara
Sep 6 '18 at 6:31
I have to hand the answer to Frank, which commented on my first post.
testset[, v := splitDates[findInterval(date, splitDates)]]
does the trick.
I admire this solution as well. But, it is not good to use @Frank comment as a your own solution. Maybe you can make it as community.
– Salman Lashkarara
Sep 5 '18 at 7:07
Also, although it is short, i cannot fully understand how it works
– Salman Lashkarara
Sep 5 '18 at 7:09
me neither. I have not used as a solution, but hope Frank will come and claim it. I created a new entry to make it easier for others to find the answer.
– Jakn09ab
Sep 6 '18 at 10:07
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
if i understand correctly, the values always come from the
splitDates
– Salman Lashkarara
Sep 4 '18 at 17:37