Pandas: filling missing values by the time occurrence of an event

I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change.
So, I have the following dataset (it's already sorted by date and player):

d = 'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'], 'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09', '2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29', '2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16', '2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22', '2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'], 'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f', 'g', np.nan, 'f', 'g'] #create dataframe df = pd.DataFrame(data=d) #change date to datetime df['date'] = pd.to_datetime(df['date']) df player date id 0 1 2018-01-01 00:17:01 NaN 1 1 2018-01-01 00:17:05 NaN 2 1 2018-01-01 00:19:05 a 3 1 2018-01-01 00:21:07 a 4 1 2018-01-01 00:22:09 b 5 1 2018-01-01 00:22:07 NaN 6 1 2018-01-01 00:25:09 b 7 1 2018-01-01 00:25:11 c 8 1 2018-01-01 00:27:28 c 9 1 2018-01-01 00:29:29 c 10 1 2018-01-01 00:30:35 c 11 2 2018-02-01 00:31:16 d 12 2 2018-02-01 00:35:22 d 13 2 2018-02-01 00:38:16 e 14 2 2018-02-01 00:38:20 e 15 2 2018-02-01 00:55:15 NaN 16 3 2018-01-03 00:55:22 f 17 3 2018-01-03 00:58:16 f 18 3 2018-01-03 00:58:21 g 19 3 2018-03-01 01:00:35 NaN 20 3 2018-03-01 01:20:16 f 21 3 2018-03-01 01:31:16 g

So, these are my three columns:

Ok, so here I have my missing values:

df.loc[df.id.isnull(),'date'] 0 2018-01-01 00:17:01 1 2018-01-01 00:17:05 5 2018-01-01 00:22:07 15 2018-02-01 00:55:15 19 2018-03-01 01:00:35

Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players.
I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).

my_agg = df.groupby(['player', 'id']).date.agg([min, max]) my_agg min max player id 1 a 2018-01-01 00:19:05 2018-01-01 00:21:07 b 2018-01-01 00:22:09 2018-01-01 00:25:09 c 2018-01-01 00:25:11 2018-01-01 00:30:35 2 d 2018-02-01 00:31:16 2018-02-01 00:35:22 e 2018-02-01 00:38:16 2018-02-01 00:38:20 3 f 2018-01-03 00:55:22 2018-03-01 01:20:16 g 2018-01-03 00:58:21 2018-03-01 01:31:16

Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.

In the dataset I try to illustrate three possible scenarios I am interested in:

Desired output:
to sum it up, the outcome should look like this df:

player date id 0 1 2018-01-01 00:17:01 0 1 1 2018-01-01 00:17:05 0 2 1 2018-01-01 00:19:05 a 3 1 2018-01-01 00:21:07 a 4 1 2018-01-01 00:22:09 b 5 1 2018-01-01 00:22:07 b 6 1 2018-01-01 00:25:09 b 7 1 2018-01-01 00:25:11 c 8 1 2018-01-01 00:27:28 c 9 1 2018-01-01 00:29:29 c 10 1 2018-01-01 00:30:35 c 11 2 2018-02-01 00:31:16 d 12 2 2018-02-01 00:35:22 d 13 2 2018-02-01 00:38:16 e 14 2 2018-02-01 00:38:20 e 15 2 2018-02-01 00:55:15 0 16 3 2018-01-03 00:55:22 f 17 3 2018-01-03 00:58:16 f 18 3 2018-01-03 00:58:21 g 19 3 2018-03-01 01:00:35 -99 20 3 2018-03-01 01:20:16 f 21 3 2018-03-01 01:31:16 g

2 Answers
2

May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select:

np.select

df['shift'] = df['id'].shift(1) df['shift-1'] = df['id'].shift(-1) df['merge'] = df[['shift','shift-1']].values.tolist() df.drop(columns=['shift','shift-1'], inplace=True) alpha = np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8 diff = for i in range(len(df)): diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]]) df['diff'] = diff conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))), (~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) | df['player'].shift(-1).eq(df['player'])) & (~df['diff'] < 0)), (~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) | df['player'].shift(-1).eq(df['player'])) & (df['diff'] < 0)), ] choices = [df['id'].ffill(), 0, -99 ] df['id'] = np.select(conditions, choices, default = df['id']) df.drop(columns=['merge','diff'], inplace=True) df

out:

Thank you @Chris for your help, but actually ffill it's not was I was expecting. What I needed was to see if every row with a missing session id falls within range of the available sessions, and I need to do this comparison player by player. I think I found a solution (see my code below).
– EAMC
Sep 3 at 9:37

In my solution I just had to work a bit to apply correctly the function wrote by @ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.

#define a function to sort the missing values (ysearka function from stackoverflow) def my_custom_function(time): #compare every date event with the range of the sessions. current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)] #store length, that is the number of matches. count = len(current_sessions) #How many matches are there for any missing id value? # if 0 it means that no matches are found: the event lies outside all the possible ranges if count == 0: return 0 #if more than one, it is impossible to say to which session the event belongs if count > 1: return -99 #equivalent to if count == 1 return: in this case the event belongs clearly to just one session return current_sessions.index[0][1] #create a list storing all the player ids plist = list(df.player.unique()) #ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas pd.options.mode.chained_assignment = None # create an empty new dataframe, where to store the results final = pd.DataFrame() #with this for loop iterate over the part of the dataset corresponding to one player at a time for i in plist: #slice the dataset by player players = df.loc[df['player'] == i] #for every player, take the dates where we are missing the id mv_per_player = players.loc[players.id.isnull(),'date'] #for each player, groupby player id, and compute the first and last event my_agg = players.groupby(['player', 'id']).date.agg([min, max]) #apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan ema = mv_per_player.apply(my_custom_function) #now we can sobstitute the missing id with the new imputed values... players.loc[players.id.isnull(),'id'] = ema.values #append new values stored in players to the new dataframe final = final.append(players) #...and check the new dataset final player date id 0 1 2018-01-01 00:17:01 0 1 1 2018-01-01 00:17:05 0 2 1 2018-01-01 00:19:05 a 3 1 2018-01-01 00:21:07 a 4 1 2018-01-01 00:22:09 b 5 1 2018-01-01 00:22:17 b 6 1 2018-01-01 00:25:09 b 7 1 2018-01-01 00:25:11 c 8 1 2018-01-01 00:27:28 c 9 1 2018-01-01 00:29:29 c 10 1 2018-01-01 00:30:35 c 11 2 2018-02-01 00:31:16 d 12 2 2018-02-01 00:35:22 d 13 2 2018-02-01 00:38:16 e 14 2 2018-02-01 00:38:20 e 15 2 2018-02-01 00:55:15 0 16 3 2018-01-03 00:55:22 f 17 3 2018-01-03 00:58:16 f 18 3 2018-01-03 00:58:21 g 19 3 2018-03-01 01:00:35 -99 20 3 2018-03-01 01:20:16 f 21 3 2018-03-01 01:31:16 g

I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt