dask_expr._collection.DataFrame.set_index
dask_expr._collection.DataFrame.set_index¶
- DataFrame.set_index(other, drop=True, sorted=False, npartitions: int | None = None, divisions=None, sort: bool = True, shuffle_method=None, upsample: float = 1.0, partition_size: float = 128000000.0, append: bool = False, **options)[source]¶
Set the DataFrame index (row labels) using an existing column.
If
sort=False
, this function operates exactly likepandas.set_index
and sets the index on the DataFrame. Ifsort=True
(default), this function also sorts the DataFrame by the new index. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often weset_index
once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.With
sort=True
, this function is much more expensive. Under normal operation this function does an initial pass over the index column to compute approximate quantiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.
- Parameters
- other: string or Dask Series
Column to use as index.
- drop: boolean, default True
Delete column to be used as the new index.
- sorted: bool, optional
If the index column is already sorted in increasing order. Defaults to False
- npartitions: int, None, or ‘auto’
The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Only used when
divisions
is not given. Ifdivisions
is given, the number of output partitions will belen(divisions) - 1
.- divisions: list, optional
The “dividing lines” used to split the new index into partitions. For
divisions=[0, 10, 50, 100]
, there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. If not given (default), good divisions are calculated by immediately computing the data and looking at the distribution of its values. For large datasets, this can be expensive. Note that ifsorted=True
, specified divisions are assumed to match the existing partitions in the data; if this is untrue you should leave divisions empty and callrepartition
afterset_index
.- inplace: bool, optional
Modifying the DataFrame in place is not supported by Dask. Defaults to False.
- sort: bool, optional
If
True
, sort the DataFrame by the new index. Otherwise set the index on the individual existing partitions. Defaults toTrue
.- shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional
Either
'disk'
for single-node operation or'tasks'
and'p2p'
for distributed operation. Will be inferred by your current scheduler.- compute: bool, default False
Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set
compute=False
, an immediate computation will still be triggered ifdivisions
isNone
.- partition_size: int, optional
Desired size of each partitions in bytes. Only used when
npartitions='auto'
Examples
>>> import dask >>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h").reset_index() >>> ddf2 = ddf.set_index("x") >>> ddf2 = ddf.set_index(ddf.x) >>> ddf2 = ddf.set_index(ddf.timestamp, sorted=True)
A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated
>>> import pandas as pd >>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D') >>> divisions DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07'], dtype='datetime64[ns]', freq='D')
Note that
len(divisons)
is equal tonpartitions + 1
. This is becausedivisions
represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.>>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions.tolist())
If you’ll be running set_index on the same (or similar) datasets repeatedly, you could save time by letting Dask calculate good divisions once, then copy-pasting them to reuse. This is especially helpful running in a Jupyter notebook:
>>> ddf2 = ddf.set_index("name") # slow, calculates data distribution >>> ddf2.divisions ["Alice", "Laura", "Ursula", "Zelda"] >>> # ^ Now copy-paste this and edit the line above to: >>> # ddf2 = ddf.set_index("name", divisions=["Alice", "Laura", "Ursula", "Zelda"])