Pandas: vectorization with function on two dataframes

Pandas: vectorization with function on two dataframes



I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.



Let's say I've got two pandas dataframes.



Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.


>>> data1 = 'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
ID x y R
1 1 1 4
2 10 10 5



Dataframe two describes the x,y coordinates of some points, also with unique IDs.


>>> data2 = 'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
ID x y
3 1 2
4 3 5
5 9 9



Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.



enter image description here



All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".



My desired output would be


>>> df_2
ID x y host_circle
3 1 2 1
4 3 5 None
5 9 9 2



First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.


>>> def func(x1,y1,R1,ID_1,x2,y2):
... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
... if dist < R:
... return ID_1
... else:
... return None



Next, the actual vectorization. I'm sorta lost here. I think it should be something like


df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])



but that just throws errors. Can someone help me?



One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.





Can you post your desired output? What you have now is not vectorized
– user3483203
Aug 24 at 3:54






^ Done, i've edited my original post.
– Programmer
Aug 24 at 3:58





also added a picture for clarity
– Programmer
Aug 24 at 4:12






What do you want to do if a point lies in multiple circles?
– user3483203
Aug 24 at 4:14





@user3483203 The approach I used, it is the last one that will be assigned as host. I could alter this by reversing the array when I assign. If we wanted to assign the closest? I'd have to sort along an axis, track the argsort positions, and unwind them.
– piRSquared
Aug 24 at 4:31




2 Answers
2



You might have to install numba with


numba


pip install numba



Then use numbas jit compiler via the njit function decorator


numba


njit


from numba import njit

@njit
def distances(point, points):
return ((points - point) ** 2).sum(1) ** .5

@njit
def find_my_circle(point, circles):
points = circles[:, :2]
radii = circles[:, 2]
dist = distances(point, points)
mask = dist < radii
i = mask.argmax()
return i if mask[i] else -1

@njit
def find_my_circles(points, circles):
n = len(points)
out = np.zeros(n, np.int64)
for i in range(n):
out[i] = find_my_circle(points[i], circles)
return out

ids = np.append(df_1.ID.values, np.nan)

i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]

df_2

ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0



This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.



This one is more loopy but short circuits when it finds a host


from numba import njit

@njit
def distance(a, b):
return ((a - b) ** 2).sum() ** .5

@njit
def find_my_circles(points, circles):
n = len(points)
m = len(circles)

out = -np.ones(n, np.int64)

centers = circles[:, :2]
radii = circles[:, 2]

for i in range(n):
for j in range(m):
if distance(points[i], centers[j]) < radii[j]:
out[i] = j
break

return out

ids = np.append(df_1.ID.values, np.nan)

i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]

df_2



But still problematic


c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values

i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)

df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values

df_2

ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0



Distance from any point from the center of a circle is


((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5



I can use broadcasting if I extend one of my arrays into a third dimension


points[:, None] - centers

array([[[ 0, 1],
[-9, -8]],

[[ 2, 4],
[-7, -5]],

[[ 8, 8],
[-1, -1]]])



That is all six combinations of vector differences. Now to calculate the distances.


((points[:, None] - centers) ** 2).sum(2) ** .5

array([[ 1. , 12.04159458],
[ 4.47213595, 8.60232527],
[11.3137085 , 1.41421356]])



Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles


((points[:, None] - centers) ** 2).sum(2) ** .5 < radii

array([[ True, False],
[False, False],
[False, True]])



Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.


True


np.where


True


points


i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)



Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.


df_2


i


df_1


j





Seems appropriate you'd answer the circle question :D
– ALollz
Aug 24 at 4:17





I've got this area covered (-:
– piRSquared
Aug 24 at 4:17





This works for this small dataset. However, the actual dataset I'm working with is extremely large; df_1 is 200,000 rows and df_2 is 75 million rows. I run into memory errors on "np.where"
– Programmer
Aug 24 at 4:38





@Programmer it's probably the broadcasting more than np.where
– user3483203
Aug 24 at 4:41


np.where





I'm not surprised. However, you said vectorized and that is what this is. Turns out, you want something that works/won't break you machine but also finishes sometime this year. That I can give you but it will not be vectorized. Give me a few minutes. I'm supposed to be going to sleep, but...
– piRSquared
Aug 24 at 4:41



Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle


def func(df, x2,y2):
val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
return list(val.index[val==True])

df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)