Timestamps in Ridge Regression Scikit Learn
Timestamps in Ridge Regression Scikit Learn
I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model
.
sklearn.linear_model
My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:
The field containing the target data / labels is elapsed_time
, which is expressed in seconds.
elapsed_time
import pandas as pd
import sklearn.linear_model as linear_model
delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]
df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])
The resulting DataFrame looks like this:
order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918
I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.
I suspect that time of day is predictive but that date is less predictive.
So far, I am considering extracting only the hour from the time stamp. In this example, order_time
would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.
order_time
Is there a better way to transform this datetime
data?
datetime
Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?
2 Answers
2
There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).
If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:
$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$
From order_time
you could also extract a categorical variable day of week
or binary workday
, assuming that traffic is heavier during the workdays.
order_time
day of week
workday
If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/
On the other hand, you could create a more broader categorical variable, like part of day
, with values e.g. morning
, noon
, evening
, night
part of day
morning
noon
evening
night
$begingroup$
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
$endgroup$
– whuber♦
Sep 13 '18 at 23:07
$begingroup$
+1 for poining out that
day of week
may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.$endgroup$
– Jacob Quisenberry
Sep 14 '18 at 0:48
day of week
Thanks for contributing an answer to Cross Validated!
But avoid …
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
$begingroup$
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
$endgroup$
– Jacob Quisenberry
Sep 14 '18 at 0:51