Storing multi-dimensional attributes (columns) in pandas DataFrame
Storing multi-dimensional attributes (columns) in pandas DataFrame
What is the best way of storing items with several entries (fixed length) in a pandas
DataFrame
column? I’m thinking of something like a 3D position vector. For example, if my DataFrame
is storing data about a bunch of physical objects, it could look like this:
pandas
DataFrame
DataFrame
df = pandas.DataFrame(
'type': [1, 2, 1, 1, 3],
'mass': [1.1, 2.2, 3.3, 4.4, 5.5],
'pos': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]],
'vel': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]]
)
# mass pos type vel
# 0 1.1 [1, 2, 3] 1 [1, 2, 3]
# 1 2.2 [4, 5, 6] 2 [4, 5, 6]
# 2 3.3 [7, 8, 9] 1 [7, 8, 9]
# 3 4.4 [10, 11, 12] 1 [10, 11, 12]
# 4 5.5 [13, 14, 15] 3 [13, 14, 15]
Here, the columns 'pos'
and 'vel'
are the positions and velocities of the objects in 3D space.
'pos'
'vel'
I have thought of several options, none of which seem ideal or even work:
Storing Python lists as values in the column. This is basically what I showed in the example above. Unfortunately, this is very inefficient.
Splitting the column into several different ones:
df = pandas.DataFrame(
'type': [1, 2, 1, 1, 3],
'mass': [1.1, 2.2, 3.3, 4.4, 5.5],
'x': [1, 4, 7, 10, 13],
'y': [2, 5, 8, 11, 14],
'z': [3, 6, 8, 12, 15],
'vx': [1, 4, 7, 10, 13],
'vy': [2, 5, 8, 11, 14],
'vz': [3, 6, 8, 12, 15]
)
# mass type vx vy vz x y z
# 0 1.1 1 1 2 3 1 2 3
# 1 2.2 2 4 5 6 4 5 6
# 2 3.3 1 7 8 8 7 8 8
# 3 4.4 1 10 11 12 10 11 12
# 4 5.5 3 13 14 15 13 14 15
This seems like it would become cumbersome for larger attributes. But at least it’s efficient?
I’ve also tried assigning a multi-dimensional numpy
array to the column, but unfortunately, pandas
refuses:
numpy
pandas
pos = numpy.array([[11, 12, 13],
[22, 23, 24],
[33, 34, 35],
[44, 45, 46],
[55, 56, 57]])
df.loc[:, 'pos'] = pos
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-228-2ee95dd5aa19> in <module>()
# ----> 1 df.loc[:, 'pos'] = pos
#
# /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
# 177 key = com._apply_if_callable(key, self.obj)
# 178 indexer = self._get_setitem_indexer(key)
# --> 179 self._setitem_with_indexer(indexer, value)
# 180
# 181 def _has_valid_type(self, k, axis):
#
# /opt/anaconda-3/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
# 561 value = np.array(value, dtype=object)
# 562 if len(labels) != value.shape[1]:
# --> 563 raise ValueError('Must have equal len keys and value '
# 564 'when setting with an ndarray')
# 565
#
# ValueError: Must have equal len keys and value when setting with an ndarray
If want use vectorized operations only possible solution is
2. Splitting the column into several different ones
or use numpy 2d arrays
– jezrael
Aug 29 at 13:17
2. Splitting the column into several different ones
numpy 2d arrays
2 Answers
2
I like this
d = pd.concat([
df[['mass', 'type']],
pd.DataFrame(df.pos.tolist(), df.index, ['x', 'y', 'z']),
pd.DataFrame(df.vel.tolist(), df.index, ['x', 'y', 'z'])
], axis=1, keys=['Scalar', 'Position', 'Velocity'])
d
Scalar Position Velocity
mass type x y z x y z
0 1.1 1 1 2 3 1 2 3
1 2.2 2 4 5 6 4 5 6
2 3.3 1 7 8 9 7 8 9
3 4.4 1 10 11 12 10 11 12
4 5.5 3 13 14 15 13 14 15
You can easily access from top level
d.Velocity
x y z
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
Or do math
(d.Velocity + d.Position).div(d.Scalar.mass, axis=0)
x y z
0 1.818182 3.636364 5.454545
1 3.636364 4.545455 5.454545
2 4.242424 4.848485 5.454545
3 4.545455 5.000000 5.454545
4 4.727273 5.090909 5.454545
And you still can get easy access to an appropriate Numpy array
d.Position.values
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12],
[13, 14, 15]])
Use Option #2: store coordinates in integer series across multiple columns. It's the only option that makes sense with Pandas.
The main property you should consider is the dtype
of your resultant series. With Option #1, you will have object
series, which is nothing more than a sequence of pointers. This can be achieved equally well with list
and you lose all ability to perform vectorised calculations.
dtype
object
list
With Option #3, Pandas gets confused trying to assign a sequence of NumPy arrays to a single series. The error in itself demonstrates Pandas is not designed to be used in this way.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
using numpy arrays you can store them, as of now I can say that
– CSMaverick
Aug 29 at 13:11