dask.bag.Bag.groupby

dask.bag.Bag.groupby

Bag.groupby(grouper, method=None, npartitions=None, blocksize=1048576, max_branch=None, shuffle=None)[source]

Group collection by key function

This requires a full dataset read, serialization and shuffle. This is expensive. If possible you should use foldby.

Parameters
grouper: function

Function on which to group elements

shuffle: str

Either ‘disk’ for an on-disk shuffle or ‘tasks’ to use the task scheduling framework. Use ‘disk’ if you are on a single machine and ‘tasks’ if you are on a distributed cluster.

npartitions: int

If using the disk-based shuffle, the number of output partitions

blocksize: int

If using the disk-based shuffle, the size of shuffle blocks (bytes)

max_branch: int

If using the task-based shuffle, the amount of splitting each partition undergoes. Increase this for fewer copies but more scheduler overhead.

See also

Bag.foldby

Examples

>>> import dask.bag as db
>>> b = db.from_sequence(range(10))
>>> iseven = lambda x: x % 2 == 0
>>> dict(b.groupby(iseven))             
{True: [0, 2, 4, 6, 8], False: [1, 3, 5, 7, 9]}