Pandas: filling missing values by the time occurrence of an event
Pandas: filling missing values by the time occurrence of an event
I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change.
So, I have the following dataset (it's already sorted by date and player):
d = 'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'],
'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09',
'2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16',
'2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22',
'2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'],
'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f',
'g', np.nan, 'f', 'g']
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df
player date id
0 1 2018-01-01 00:17:01 NaN
1 1 2018-01-01 00:17:05 NaN
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 NaN
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 NaN
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 NaN
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
So, these are my three columns:
Ok, so here I have my missing values:
df.loc[df.id.isnull(),'date']
0 2018-01-01 00:17:01
1 2018-01-01 00:17:05
5 2018-01-01 00:22:07
15 2018-02-01 00:55:15
19 2018-03-01 01:00:35
Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players.
I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).
my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
min max
player id
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:09 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-02-01 00:31:16 2018-02-01 00:35:22
e 2018-02-01 00:38:16 2018-02-01 00:38:20
3 f 2018-01-03 00:55:22 2018-03-01 01:20:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.
In the dataset I try to illustrate three possible scenarios I am interested in:
Desired output:
to sum it up, the outcome should look like this df:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
2 Answers
2
May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select
:
np.select
df['shift'] = df['id'].shift(1)
df['shift-1'] = df['id'].shift(-1)
df['merge'] = df[['shift','shift-1']].values.tolist()
df.drop(columns=['shift','shift-1'], inplace=True)
alpha = np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8
diff =
for i in range(len(df)):
diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]])
df['diff'] = diff
conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(~df['diff'] < 0)),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(df['diff'] < 0)),
]
choices = [df['id'].ffill(),
0,
-99
]
df['id'] = np.select(conditions, choices, default = df['id'])
df.drop(columns=['merge','diff'], inplace=True)
df
out:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
In my solution I just had to work a bit to apply correctly the function wrote by @ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.
#define a function to sort the missing values (ysearka function from stackoverflow)
def my_custom_function(time):
#compare every date event with the range of the sessions.
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
#store length, that is the number of matches.
count = len(current_sessions)
#How many matches are there for any missing id value?
# if 0 it means that no matches are found: the event lies outside all the possible ranges
if count == 0:
return 0
#if more than one, it is impossible to say to which session the event belongs
if count > 1:
return -99
#equivalent to if count == 1 return: in this case the event belongs clearly to just one session
return current_sessions.index[0][1]
#create a list storing all the player ids
plist = list(df.player.unique())
#ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None
# create an empty new dataframe, where to store the results
final = pd.DataFrame()
#with this for loop iterate over the part of the dataset corresponding to one player at a time
for i in plist:
#slice the dataset by player
players = df.loc[df['player'] == i]
#for every player, take the dates where we are missing the id
mv_per_player = players.loc[players.id.isnull(),'date']
#for each player, groupby player id, and compute the first and last event
my_agg = players.groupby(['player', 'id']).date.agg([min, max])
#apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan
ema = mv_per_player.apply(my_custom_function)
#now we can sobstitute the missing id with the new imputed values...
players.loc[players.id.isnull(),'id'] = ema.values
#append new values stored in players to the new dataframe
final = final.append(players)
#...and check the new dataset
final
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:17 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thank you @Chris for your help, but actually ffill it's not was I was expecting. What I needed was to see if every row with a missing session id falls within range of the available sessions, and I need to do this comparison player by player. I think I found a solution (see my code below).
– EAMC
Sep 3 at 9:37