Pandas: vectorization with function on two dataframes

I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.

Let's say I've got two pandas dataframes.

Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.

>>> data1 = 'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5] >>> df_1=pd.DataFrame(data=data1) >>> >>> df_1 ID x y R 1 1 1 4 2 10 10 5

Dataframe two describes the x,y coordinates of some points, also with unique IDs.

>>> data2 = 'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9] >>> df_2=pd.DataFrame(data=data2) >>> >>> df_2 ID x y 3 1 2 4 3 5 5 9 9

Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.

enter image description here

All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".

My desired output would be

>>> df_2 ID x y host_circle 3 1 2 1 4 3 5 None 5 9 9 2

First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.

>>> def func(x1,y1,R1,ID_1,x2,y2): ... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 ) ... if dist < R: ... return ID_1 ... else: ... return None

Next, the actual vectorization. I'm sorta lost here. I think it should be something like

df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])

but that just throws errors. Can someone help me?

One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.

Can you post your desired output? What you have now is not vectorized
– user3483203
Aug 24 at 3:54

^ Done, i've edited my original post.
– Programmer
Aug 24 at 3:58

also added a picture for clarity
– Programmer
Aug 24 at 4:12

What do you want to do if a point lies in multiple circles?
– user3483203
Aug 24 at 4:14

@user3483203 The approach I used, it is the last one that will be assigned as host. I could alter this by reversing the array when I assign. If we wanted to assign the closest? I'd have to sort along an axis, track the argsort positions, and unwind them.
– piRSquared
Aug 24 at 4:31

2 Answers
2

You might have to install numba with

numba

pip install numba

Then use numbas jit compiler via the njit function decorator

numba

njit

from numba import njit @njit def distances(point, points): return ((points - point) ** 2).sum(1) ** .5 @njit def find_my_circle(point, circles): points = circles[:, :2] radii = circles[:, 2] dist = distances(point, points) mask = dist < radii i = mask.argmax() return i if mask[i] else -1 @njit def find_my_circles(points, circles): n = len(points) out = np.zeros(n, np.int64) for i in range(n): out[i] = find_my_circle(points[i], circles) return out ids = np.append(df_1.ID.values, np.nan) i = find_my_circles(points, df_1[['x', 'y', 'R']].values) df_2['host_circle'] = ids[i] df_2 ID x y host_circle 0 3 1 2 1.0 1 4 3 5 NaN 2 5 9 9 2.0

This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.

This one is more loopy but short circuits when it finds a host

from numba import njit @njit def distance(a, b): return ((a - b) ** 2).sum() ** .5 @njit def find_my_circles(points, circles): n = len(points) m = len(circles) out = -np.ones(n, np.int64) centers = circles[:, :2] radii = circles[:, 2] for i in range(n): for j in range(m): if distance(points[i], centers[j]) < radii[j]: out[i] = j break return out ids = np.append(df_1.ID.values, np.nan) i = find_my_circles(points, df_1[['x', 'y', 'R']].values) df_2['host_circle'] = ids[i] df_2

But still problematic

c = ['x', 'y'] centers = df_1[c].values points = df_2[c].values radii = df_1['R'].values i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii) df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values df_2 ID x y host_circle 0 3 1 2 1.0 1 4 3 5 NaN 2 5 9 9 2.0

Distance from any point from the center of a circle is

((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5

I can use broadcasting if I extend one of my arrays into a third dimension

points[:, None] - centers array([[[ 0, 1], [-9, -8]], [[ 2, 4], [-7, -5]], [[ 8, 8], [-1, -1]]])

That is all six combinations of vector differences. Now to calculate the distances.

((points[:, None] - centers) ** 2).sum(2) ** .5 array([[ 1. , 12.04159458], [ 4.47213595, 8.60232527], [11.3137085 , 1.41421356]])

Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles

((points[:, None] - centers) ** 2).sum(2) ** .5 < radii array([[ True, False], [False, False], [False, True]])

Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.

True

np.where

True

points

i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)

Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.

df_2

i

df_1

j

Seems appropriate you'd answer the circle question :D
– ALollz
Aug 24 at 4:17

I've got this area covered (-:
– piRSquared
Aug 24 at 4:17

This works for this small dataset. However, the actual dataset I'm working with is extremely large; df_1 is 200,000 rows and df_2 is 75 million rows. I run into memory errors on "np.where"
– Programmer
Aug 24 at 4:38

@Programmer it's probably the broadcasting more than np.where
– user3483203
Aug 24 at 4:41

np.where

I'm not surprised. However, you said vectorized and that is what this is. Turns out, you want something that works/won't break you machine but also finishes sometime this year. That I can give you but it will not be vectorized. Give me a few minutes. I'm supposed to be going to sleep, but...
– piRSquared
Aug 24 at 4:41

Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle

def func(df, x2,y2): val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1) return list(val.index[val==True]) df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt