Storing multi-dimensional attributes (columns) in pandas DataFrame

Storing multi-dimensional attributes (columns) in pandas DataFrame



What is the best way of storing items with several entries (fixed length) in a pandas DataFrame column? I’m thinking of something like a 3D position vector. For example, if my DataFrame is storing data about a bunch of physical objects, it could look like this:


pandas


DataFrame


DataFrame


df = pandas.DataFrame(
'type': [1, 2, 1, 1, 3],
'mass': [1.1, 2.2, 3.3, 4.4, 5.5],
'pos': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]],
'vel': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]]
)
# mass pos type vel
# 0 1.1 [1, 2, 3] 1 [1, 2, 3]
# 1 2.2 [4, 5, 6] 2 [4, 5, 6]
# 2 3.3 [7, 8, 9] 1 [7, 8, 9]
# 3 4.4 [10, 11, 12] 1 [10, 11, 12]
# 4 5.5 [13, 14, 15] 3 [13, 14, 15]



Here, the columns 'pos' and 'vel' are the positions and velocities of the objects in 3D space.


'pos'


'vel'



I have thought of several options, none of which seem ideal or even work:



Storing Python lists as values in the column. This is basically what I showed in the example above. Unfortunately, this is very inefficient.



Splitting the column into several different ones:


df = pandas.DataFrame(
'type': [1, 2, 1, 1, 3],
'mass': [1.1, 2.2, 3.3, 4.4, 5.5],
'x': [1, 4, 7, 10, 13],
'y': [2, 5, 8, 11, 14],
'z': [3, 6, 8, 12, 15],
'vx': [1, 4, 7, 10, 13],
'vy': [2, 5, 8, 11, 14],
'vz': [3, 6, 8, 12, 15]
)
# mass type vx vy vz x y z
# 0 1.1 1 1 2 3 1 2 3
# 1 2.2 2 4 5 6 4 5 6
# 2 3.3 1 7 8 8 7 8 8
# 3 4.4 1 10 11 12 10 11 12
# 4 5.5 3 13 14 15 13 14 15



This seems like it would become cumbersome for larger attributes. But at least it’s efficient?



I’ve also tried assigning a multi-dimensional numpy array to the column, but unfortunately, pandas refuses:


numpy


pandas


pos = numpy.array([[11, 12, 13],
[22, 23, 24],
[33, 34, 35],
[44, 45, 46],
[55, 56, 57]])
df.loc[:, 'pos'] = pos
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-228-2ee95dd5aa19> in <module>()
# ----> 1 df.loc[:, 'pos'] = pos
#
# /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
# 177 key = com._apply_if_callable(key, self.obj)
# 178 indexer = self._get_setitem_indexer(key)
# --> 179 self._setitem_with_indexer(indexer, value)
# 180
# 181 def _has_valid_type(self, k, axis):
#
# /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
# 561 value = np.array(value, dtype=object)
# 562 if len(labels) != value.shape[1]:
# --> 563 raise ValueError('Must have equal len keys and value '
# 564 'when setting with an ndarray')
# 565
#
# ValueError: Must have equal len keys and value when setting with an ndarray





using numpy arrays you can store them, as of now I can say that
– CSMaverick
Aug 29 at 13:11





If want use vectorized operations only possible solution is 2. Splitting the column into several different ones or use numpy 2d arrays
– jezrael
Aug 29 at 13:17



2. Splitting the column into several different ones


numpy 2d arrays




2 Answers
2



I like this


d = pd.concat([
df[['mass', 'type']],
pd.DataFrame(df.pos.tolist(), df.index, ['x', 'y', 'z']),
pd.DataFrame(df.vel.tolist(), df.index, ['x', 'y', 'z'])
], axis=1, keys=['Scalar', 'Position', 'Velocity'])

d

Scalar Position Velocity
mass type x y z x y z
0 1.1 1 1 2 3 1 2 3
1 2.2 2 4 5 6 4 5 6
2 3.3 1 7 8 9 7 8 9
3 4.4 1 10 11 12 10 11 12
4 5.5 3 13 14 15 13 14 15



You can easily access from top level


d.Velocity

x y z
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15



Or do math


(d.Velocity + d.Position).div(d.Scalar.mass, axis=0)

x y z
0 1.818182 3.636364 5.454545
1 3.636364 4.545455 5.454545
2 4.242424 4.848485 5.454545
3 4.545455 5.000000 5.454545
4 4.727273 5.090909 5.454545



And you still can get easy access to an appropriate Numpy array


d.Position.values

array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12],
[13, 14, 15]])



Use Option #2: store coordinates in integer series across multiple columns. It's the only option that makes sense with Pandas.



The main property you should consider is the dtype of your resultant series. With Option #1, you will have object series, which is nothing more than a sequence of pointers. This can be achieved equally well with list and you lose all ability to perform vectorised calculations.


dtype


object


list



With Option #3, Pandas gets confused trying to assign a sequence of NumPy arrays to a single series. The error in itself demonstrates Pandas is not designed to be used in this way.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)