Reshaping and pivot tables¶
Reshaping by pivoting DataFrame objects¶
 
Data is often stored in so-called “stacked” or “record” format:
In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804
For the curious here is how the above DataFrame was created:
import pandas.util.testing as tm
tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value': frame.to_numpy().ravel('F'),
            'variable': np.asarray(frame.columns).repeat(N),
            'date': np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
To select out everything for variable A we could do:
In [2]: df[df['variable'] == 'A']
Out[2]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059
But suppose we wish to do time series operations with the variables. A better
representation would be where the columns are the unique variables and an
index of dates identifies individual observations. To reshape the data into
this form, we use the DataFrame.pivot() method (also implemented as a
top level function pivot()):
In [3]: df.pivot(index='date', columns='variable', values='value')
Out[3]: 
variable           A         B         C         D
date                                              
2000-01-03  0.469112 -1.135632  0.119209 -2.104569
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804
If the values argument is omitted, and the input DataFrame has more than
one column of values which are not used as column or index inputs to pivot,
then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value
column:
In [4]: df['value2'] = df['value'] * 2
In [5]: pivoted = df.pivot(index='date', columns='variable')
In [6]: pivoted
Out[6]: 
               value                                  value2                              
variable           A         B         C         D         A         B         C         D
date                                                                                      
2000-01-03  0.469112 -1.135632  0.119209 -2.104569  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804 -3.018117 -0.346429 -1.723698  2.143608
You can then select subsets from the pivoted DataFrame:
In [7]: pivoted['value2']
Out[7]: 
variable           A         B         C         D
date                                              
2000-01-03  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698  2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.
Note
pivot() will error with a ValueError: Index contains duplicate
entries, cannot reshape if the index/column pair is not unique. In this
case, consider using pivot_table() which is a generalization
of pivot that can handle duplicate values for one index/column pair.
Reshaping by stacking and unstacking¶
 
Closely related to the pivot() method are the related
stack() and unstack() methods available on
Series and DataFrame. These methods are designed to work together with
MultiIndex objects (see the section on hierarchical indexing). Here are essentially what these methods do:
- stack: “pivot” a level of the (possibly hierarchical) column labels, returning a- DataFramewith an index with a new inner-most level of row labels.
- unstack: (inverse operation of- stack) “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped- DataFramewith a new inner-most level of column labels.
 
The clearest way to explain is by example. Let’s take a prior example data set from the hierarchical indexing section:
In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ...:                      'foo', 'foo', 'qux', 'qux'],
   ...:                     ['one', 'two', 'one', 'two',
   ...:                      'one', 'two', 'one', 'two']]))
   ...: 
In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [11]: df2 = df[:4]
In [12]: df2
Out[12]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401
The stack function “compresses” a level in the DataFrame’s columns to
produce either:
- A Series, in the case of a simple column Index.
- A DataFrame, in the case of aMultiIndexin the columns.
If the columns have a MultiIndex, you can choose which level to stack. The
stacked level becomes the new lowest level in a MultiIndex on the columns:
In [13]: stacked = df2.stack()
In [14]: stacked
Out[14]: 
first  second   
bar    one     A    0.721555
               B   -0.706771
       two     A   -1.039575
               B    0.271860
baz    one     A   -0.424972
               B    0.567020
       two     A    0.276232
               B   -1.087401
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the
index), the inverse operation of stack is unstack, which by default
unstacks the last level:
In [15]: stacked.unstack()
Out[15]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401
In [16]: stacked.unstack(1)