Storing multi-dimensional attributes (columns) in pandas DataFrame

What is the best way of storing items with several entries (fixed length) in a pandas DataFrame column? I’m thinking of something like a 3D position vector. For example, if my DataFrame is storing data about a bunch of physical objects, it could look like this:

pandas

DataFrame

df = pandas.DataFrame( 'type': [1, 2, 1, 1, 3], 'mass': [1.1, 2.2, 3.3, 4.4, 5.5], 'pos': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]], 'vel': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]] ) # mass pos type vel # 0 1.1 [1, 2, 3] 1 [1, 2, 3] # 1 2.2 [4, 5, 6] 2 [4, 5, 6] # 2 3.3 [7, 8, 9] 1 [7, 8, 9] # 3 4.4 [10, 11, 12] 1 [10, 11, 12] # 4 5.5 [13, 14, 15] 3 [13, 14, 15]

Here, the columns 'pos' and 'vel' are the positions and velocities of the objects in 3D space.

'pos'

'vel'

I have thought of several options, none of which seem ideal or even work:

Storing Python lists as values in the column. This is basically what I showed in the example above. Unfortunately, this is very inefficient.

Splitting the column into several different ones:

df = pandas.DataFrame( 'type': [1, 2, 1, 1, 3], 'mass': [1.1, 2.2, 3.3, 4.4, 5.5], 'x': [1, 4, 7, 10, 13], 'y': [2, 5, 8, 11, 14], 'z': [3, 6, 8, 12, 15], 'vx': [1, 4, 7, 10, 13], 'vy': [2, 5, 8, 11, 14], 'vz': [3, 6, 8, 12, 15] ) # mass type vx vy vz x y z # 0 1.1 1 1 2 3 1 2 3 # 1 2.2 2 4 5 6 4 5 6 # 2 3.3 1 7 8 8 7 8 8 # 3 4.4 1 10 11 12 10 11 12 # 4 5.5 3 13 14 15 13 14 15

This seems like it would become cumbersome for larger attributes. But at least it’s efficient?

I’ve also tried assigning a multi-dimensional numpy array to the column, but unfortunately, pandas refuses:

numpy

pandas

pos = numpy.array([[11, 12, 13], [22, 23, 24], [33, 34, 35], [44, 45, 46], [55, 56, 57]]) df.loc[:, 'pos'] = pos # --------------------------------------------------------------------------- # ValueError Traceback (most recent call last) # <ipython-input-228-2ee95dd5aa19> in <module>() # ----> 1 df.loc[:, 'pos'] = pos # # /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value) # 177 key = com._apply_if_callable(key, self.obj) # 178 indexer = self._get_setitem_indexer(key) # --> 179 self._setitem_with_indexer(indexer, value) # 180 # 181 def _has_valid_type(self, k, axis): # # /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value) # 561 value = np.array(value, dtype=object) # 562 if len(labels) != value.shape[1]: # --> 563 raise ValueError('Must have equal len keys and value ' # 564 'when setting with an ndarray') # 565 # # ValueError: Must have equal len keys and value when setting with an ndarray

using numpy arrays you can store them, as of now I can say that
– CSMaverick
Aug 29 at 13:11

If want use vectorized operations only possible solution is 2. Splitting the column into several different ones or use numpy 2d arrays
– jezrael
Aug 29 at 13:17

2. Splitting the column into several different ones

numpy 2d arrays

2 Answers
2

I like this

d = pd.concat([ df[['mass', 'type']], pd.DataFrame(df.pos.tolist(), df.index, ['x', 'y', 'z']), pd.DataFrame(df.vel.tolist(), df.index, ['x', 'y', 'z']) ], axis=1, keys=['Scalar', 'Position', 'Velocity']) d Scalar Position Velocity mass type x y z x y z 0 1.1 1 1 2 3 1 2 3 1 2.2 2 4 5 6 4 5 6 2 3.3 1 7 8 9 7 8 9 3 4.4 1 10 11 12 10 11 12 4 5.5 3 13 14 15 13 14 15

You can easily access from top level

d.Velocity x y z 0 1 2 3 1 4 5 6 2 7 8 9 3 10 11 12 4 13 14 15

Or do math

(d.Velocity + d.Position).div(d.Scalar.mass, axis=0) x y z 0 1.818182 3.636364 5.454545 1 3.636364 4.545455 5.454545 2 4.242424 4.848485 5.454545 3 4.545455 5.000000 5.454545 4 4.727273 5.090909 5.454545

And you still can get easy access to an appropriate Numpy array

d.Position.values array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12], [13, 14, 15]])

Use Option #2: store coordinates in integer series across multiple columns. It's the only option that makes sense with Pandas.

The main property you should consider is the dtype of your resultant series. With Option #1, you will have object series, which is nothing more than a sequence of pointers. This can be achieved equally well with list and you lose all ability to perform vectorised calculations.

dtype

object

list

With Option #3, Pandas gets confused trying to assign a sequence of NumPy arrays to a single series. The error in itself demonstrates Pandas is not designed to be used in this way.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt