MultiIndex / advanced indexing¶
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation may
depend on the context. This is sometimes called chained assignment and
should be avoided. See Returning a View versus Copy.
See the cookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)¶
Hierarchical / Multi-level indexing is very exciting as it opens the door to some
quite sophisticated data analysis and manipulation, especially for working with
higher dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies.
Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes
and MultiIndex.set_labels to MultiIndex.set_codes.
Creating a MultiIndex (hierarchical index) object¶
The MultiIndex object is the hierarchical analogue of the standard
Index object which typically stores the axis labels in pandas objects. You
can think of MultiIndex as an array of tuples where each tuple is unique. A
MultiIndex can be created from a list of arrays (using
MultiIndex.from_arrays()), an array of tuples (using
MultiIndex.from_tuples()), a crossed set of iterables (using
MultiIndex.from_product()), or a DataFrame (using
MultiIndex.from_frame()). The Index constructor will attempt to return
a MultiIndex when it is passed a list of tuples. The following examples
demonstrate different ways to initialize MultiIndexes.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier
to use the MultiIndex.from_product() method:
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
You can also construct a MultiIndex from a DataFrame directly, using
the method MultiIndex.from_frame(). This is a complementary method to
MultiIndex.to_frame().
New in version 0.24.0.
In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
....: ['foo', 'one'], ['foo', 'two']],
....: columns=['first', 'second'])
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series or
DataFrame to construct a MultiIndex automatically:
In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex constructors accept a names argument which stores
string names for the levels themselves. If no names are provided, None will
be assigned:
In [17]: df.index.names
Out[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
In [19]: df
Out[19]:
first bar baz foo qux
second one two one two one two one two
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])