dask.bag.Bag.groupby

dask.bag.Bag.groupby#

Bag.groupby(grouper, method=None, npartitions=None, blocksize=1048576, max_branch=None, shuffle=None)[source]#

Group collection by key function

This requires a full dataset read, serialization and shuffle. This is expensive. If possible you should use foldby.

Parameters:

grouper: function: Function on which to group elements
shuffle: str: Either ‘disk’ for an on-disk shuffle or ‘tasks’ to use the task scheduling framework. Use ‘disk’ if you are on a single machine and ‘tasks’ if you are on a distributed cluster.
npartitions: int: If using the disk-based shuffle, the number of output partitions
blocksize: int: If using the disk-based shuffle, the size of shuffle blocks (bytes)
max_branch: int: If using the task-based shuffle, the amount of splitting each partition undergoes. Increase this for fewer copies but more scheduler overhead.