Timestamps in Ridge Regression Scikit Learn

Timestamps in Ridge Regression Scikit Learn



I am trying to transform data for use in regression, most likely the Ridge or Lasso technique implemented in sklearn.linear_model.


sklearn.linear_model



My training data contains time stamps , which I believe may have predictive power. The time stamps reflect the time that a user placed an order for pizza. Here is an example:



The field containing the target data / labels is elapsed_time, which is expressed in seconds.


elapsed_time


import pandas as pd
import sklearn.linear_model as linear_model

delivery_data =
'order_time' : ['2018-09-12 21:43:08', '2018-09-13 06:33:04', '2018-09-13 09:12:18'],
'price' : [34.54, 8.63, 21.24],
'miles' : [6, 3, 7],
'home_type' : ['apartment', 'house', 'apartment'],
'elapsed_time' : [2023, 1610, 1918]


df = pd.DataFrame(delivery_data)
df['order_time'] = pd.to_datetime(df['order_time'])



The resulting DataFrame looks like this:


order_time price miles home_type elapsed_time
0 2018-09-12 21:43:08 34.54 6 apartment 2023
1 2018-09-13 06:33:04 8.63 3 house 1610
2 2018-09-13 09:12:18 21.24 7 apartment 1918



I am trying to predict the time to deliver pizza (elapsed_time) given timestamp, quantitative, and categorical data.



I suspect that time of day is predictive but that date is less predictive.



So far, I am considering extracting only the hour from the time stamp. In this example, order_time would become [21, 6, 9]. My first concern is that 23:59 has an hour of 23 and 00:01 has an hour of 0. The two values are far apart, even though the order times are two minutes apart.


order_time



Is there a better way to transform this datetime data?


datetime



Does it make a difference that the dataset contains other quantitative data (price, miles_from_store) and categorical data (home_type)?




2 Answers
2



There's no need to round time to the nearest hour, as it's a continuous variable and rounding just discards information. If the store is only open for a certain period during the day, then you can express time as a fraction of this interval (e.g. 0=opening time, 1=closing time, 0.5=halfway through).



If the store is open 24 hours, then things are more complicated because time is a circular variable (e.g. 23:59 and 00:01 are only two minutes apart, as you mentioned). In this case, one option is to transform time into two features that properly preserve the relative distance between timepoints. Suppose $t$ is the time in hours, and can take fractional values (e.g. 21.5 corresponds to 21:30). Then, let new features $t_x$ and $t_y$ be the Cartesian coordinates after mapping time onto the unit circle:



$$t_x = cos left( fracpi12 t right), quad
t_y = sin left( fracpi12 t right)$$





$begingroup$
I think this solution is promising because it handles the problem of the midnight boundary. I am also glad it allows me to treat time as continuous, rather than a set of categories of arbitrary size.
$endgroup$
– Jacob Quisenberry
Sep 14 '18 at 0:51



From order_time you could also extract a categorical variable day of week or binary workday, assuming that traffic is heavier during the workdays.


order_time


day of week


workday



If you want to use the hour you need to transform it into categorical variable using one hot encoding, but instead of just taking the hour, you could transform timestamp into more precise time zone by splitting every day into $n$ chunks, e.g. by taking $10$ minute intervals you get $144$ time zones for every day, like they do in this example: http://radiostud.io/beat-rush-hour-traffic-with-tensorflow-machine-learning/



On the other hand, you could create a more broader categorical variable, like part of day, with values e.g. morning, noon, evening, night


part of day


morning


noon


evening


night





$begingroup$
There are problems with making time of day a categorical variable: with too many categories, you might lose many degrees of freedom (creating a non-parsimonious model); with too few, you might lose too much precision. Categorizing it also wipes out all information about its circular nature, which is the issue in question. Often there are better solutions.
$endgroup$
– whuber
Sep 13 '18 at 23:07





$begingroup$
+1 for poining out that day of week may be more useful than I had considered. And for mentioning one-hot encoding as one option while linking to an article that handles categories by assigning an integer to each category.
$endgroup$
– Jacob Quisenberry
Sep 14 '18 at 0:48


day of week



Thanks for contributing an answer to Cross Validated!



But avoid



Use MathJax to format equations. MathJax reference.



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)