Sparse data structures¶
Note
SparseSeries and SparseDataFrame have been deprecated. Their purpose
is served equally well by a Series or DataFrame with
sparse values. See Migrating for tips on migrating.
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these
objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value
can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
Notice the dtype, Sparse[float64, nan]. The nan means that elements in the
array that are nan aren’t actually stored, only the non-nan elements are.
Those non-nan elements have a float64 dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a
large, mostly NA DataFrame:
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
In [6]: df.iloc[:9998] = np.nan
In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes