dask.dataframe.groupby.Aggregation

dask.dataframe.groupby.Aggregation

class dask.dataframe.groupby.Aggregation(name, chunk, agg, finalize=None)[source]

User defined groupby-aggregation.

This class allows users to define their own custom aggregation in terms of operations on Pandas dataframes in a map-reduce style. You need to specify what operation to do on each chunk of data, how to combine those chunks of data together, and then how to finalize the result.

See Aggregate for more.

Parameters
namestr

the name of the aggregation. It should be unique, since intermediate result will be identified by this name.

chunkcallable

a function that will be called with the grouped column of each partition, takes a Pandas SeriesGroupBy in input. It can either return a single series or a tuple of series. The index has to be equal to the groups.

aggcallable

a function that will be called to aggregate the results of each chunk. Again the argument(s) will be a Pandas SeriesGroupBy. If chunk returned a tuple, agg will be called with all of them as individual positional arguments.

finalizecallable

an optional finalizer that will be called with the results from the aggregation.

Examples

We could implement sum as follows:

>>> custom_sum = dd.Aggregation(
...     name='custom_sum',
...     chunk=lambda s: s.sum(),
...     agg=lambda s0: s0.sum()
... )  
>>> df.groupby('g').agg(custom_sum)  

We can implement mean as follows:

>>> custom_mean = dd.Aggregation(
...     name='custom_mean',
...     chunk=lambda s: (s.count(), s.sum()),
...     agg=lambda count, sum: (count.sum(), sum.sum()),
...     finalize=lambda count, sum: sum / count,
... )  
>>> df.groupby('g').agg(custom_mean)  

Though of course, both of these are built-in and so you don’t need to implement them yourself.

__init__(name, chunk, agg, finalize=None)[source]

Methods

__init__(name, chunk, agg[, finalize])