dask.dataframe.groupby.Aggregation
dask.dataframe.groupby.Aggregation¶
- class dask.dataframe.groupby.Aggregation(name, chunk, agg, finalize=None)[source]¶
User defined groupby-aggregation.
This class allows users to define their own custom aggregation in terms of operations on Pandas dataframes in a map-reduce style. You need to specify what operation to do on each chunk of data, how to combine those chunks of data together, and then how to finalize the result.
See Aggregate for more.
- Parameters
- namestr
the name of the aggregation. It should be unique, since intermediate result will be identified by this name.
- chunkcallable
a function that will be called with the grouped column of each partition, takes a Pandas SeriesGroupBy in input. It can either return a single series or a tuple of series. The index has to be equal to the groups.
- aggcallable
a function that will be called to aggregate the results of each chunk. Again the argument(s) will be a Pandas SeriesGroupBy. If
chunk
returned a tuple,agg
will be called with all of them as individual positional arguments.- finalizecallable
an optional finalizer that will be called with the results from the aggregation.
Examples
We could implement
sum
as follows:>>> custom_sum = dd.Aggregation( ... name='custom_sum', ... chunk=lambda s: s.sum(), ... agg=lambda s0: s0.sum() ... ) >>> df.groupby('g').agg(custom_sum)
We can implement
mean
as follows:>>> custom_mean = dd.Aggregation( ... name='custom_mean', ... chunk=lambda s: (s.count(), s.sum()), ... agg=lambda count, sum: (count.sum(), sum.sum()), ... finalize=lambda count, sum: sum / count, ... ) >>> df.groupby('g').agg(custom_mean)
Though of course, both of these are built-in and so you don’t need to implement them yourself.
Methods
__init__
(name, chunk, agg[, finalize])