API

Dataframe

DataFrame(dsk, name, meta, divisions) Parallel Pandas DataFrame
DataFrame.add(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator add).
DataFrame.append(other) Append rows of other to the end of this frame, returning a new object.
DataFrame.apply(func[, axis, args, meta]) Parallel version of pandas.DataFrame.apply
DataFrame.assign(**kwargs) Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
DataFrame.astype(dtype) Cast a pandas object to a specified dtype dtype.
DataFrame.categorize(df[, columns, index, ...]) Convert columns of the DataFrame to category dtype.
DataFrame.columns
DataFrame.compute(**kwargs) Compute this dask collection
DataFrame.corr([method, min_periods, ...]) Compute pairwise correlation of columns, excluding NA/null values
DataFrame.count([axis, split_every]) Return Series with number of non-NA/null observations over requested axis.
DataFrame.cov([min_periods, split_every]) Compute pairwise covariance of columns, excluding NA/null values
DataFrame.cummax([axis, skipna]) Return cumulative max over requested axis.
DataFrame.cummin([axis, skipna]) Return cumulative minimum over requested axis.
DataFrame.cumprod([axis, skipna]) Return cumulative product over requested axis.
DataFrame.cumsum([axis, skipna]) Return cumulative sum over requested axis.
DataFrame.describe([split_every]) Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
DataFrame.div(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.drop(labels[, axis, errors]) Return new object with labels in requested axis removed.
DataFrame.drop_duplicates([split_every, ...]) Return DataFrame with duplicate rows removed, optionally only
DataFrame.dropna([how, subset]) Return object with labels on given axis omitted where alternately any
DataFrame.dtypes Return data types
DataFrame.fillna([value, method, limit, axis]) Fill NA/NaN values using the specified method
DataFrame.floordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator floordiv).
DataFrame.get_partition(n) Get a dask DataFrame/Series representing the nth partition.
DataFrame.groupby([by]) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
DataFrame.head([n, npartitions, compute]) First n rows of the dataset
DataFrame.index Return dask Index instance
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples() Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
DataFrame.join(other[, on, how, lsuffix, ...]) Join columns with other DataFrame either on index or on a key column.
DataFrame.known_divisions Whether divisions are already known
DataFrame.loc Purely label-location based indexer for selection by label.
DataFrame.map_partitions(func, *args, **kwargs) Apply Python function on each DataFrame partition.
DataFrame.mask(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
DataFrame.max([axis, skipna, split_every]) This method returns the maximum of the values in the object.
DataFrame.mean([axis, skipna, split_every]) Return the mean of the values for the requested axis
DataFrame.merge(right[, how, on, left_on, ...]) Merge DataFrame objects by performing a database-style join operation by columns or indexes.
DataFrame.min([axis, skipna, split_every]) This method returns the minimum of the values in the object.
DataFrame.mod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator mod).
DataFrame.mul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator mul).
DataFrame.ndim Return dimensionality
DataFrame.nlargest([n, columns, split_every]) Get the rows of a DataFrame sorted by the n largest values of columns.
DataFrame.npartitions Return number of partitions
DataFrame.pow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator pow).
DataFrame.quantile([q, axis]) Approximate row-wise and precise column-wise quantiles of DataFrame
DataFrame.query(expr, **kwargs) Filter dataframe with complex expression
DataFrame.radd(other[, axis, level, fill_value]) Addition of dataframe and other, element-wise (binary operator radd).
DataFrame.random_split(frac[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
DataFrame.rdiv(other[, axis, level, fill_value]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.rename([index, columns]) Alter axes labels.
DataFrame.repartition([divisions, ...]) Repartition dataframe along new divisions
DataFrame.reset_index([drop]) Reset the index to the default index.
DataFrame.rfloordiv(other[, axis, level, ...]) Integer division of dataframe and other, element-wise (binary operator rfloordiv).
DataFrame.rmod(other[, axis, level, fill_value]) Modulo of dataframe and other, element-wise (binary operator rmod).
DataFrame.rmul(other[, axis, level, fill_value]) Multiplication of dataframe and other, element-wise (binary operator rmul).
DataFrame.rpow(other[, axis, level, fill_value]) Exponential power of dataframe and other, element-wise (binary operator rpow).
DataFrame.rsub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator rsub).
DataFrame.rtruediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator rtruediv).
DataFrame.sample(frac[, replace, random_state]) Random sample of items
DataFrame.set_index(other[, drop, sorted, ...]) Set the DataFrame index (row labels) using an existing column
DataFrame.std([axis, skipna, ddof, split_every]) Return sample standard deviation over requested axis.
DataFrame.sub(other[, axis, level, fill_value]) Subtraction of dataframe and other, element-wise (binary operator sub).
DataFrame.sum([axis, skipna, split_every]) Return the sum of the values for the requested axis
DataFrame.tail([n, compute]) Last n rows of the dataset
DataFrame.to_bag([index]) Create Dask Bag from a Dask DataFrame
DataFrame.to_csv(filename, **kwargs) Store Dask DataFrame to CSV files
DataFrame.to_delayed() Create Dask Delayed objects from a Dask Dataframe
DataFrame.to_hdf(path_or_buf, key[, mode, ...]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
DataFrame.to_records([index]) Create Dask Array from a Dask Dataframe
DataFrame.truediv(other[, axis, level, ...]) Floating division of dataframe and other, element-wise (binary operator truediv).
DataFrame.values Return a dask.array of the values of this dataframe
DataFrame.var([axis, skipna, ddof, split_every]) Return unbiased variance over requested axis.
DataFrame.visualize([filename, format, ...]) Render the computation of this object’s task graph using graphviz.
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Series

Series(dsk, name, meta, divisions) Parallel Pandas Series
Series.add(other[, level, fill_value, axis]) Addition of series and other, element-wise (binary operator add).
Series.align(other[, join, axis, fill_value]) Align two objects on their axes with the
Series.all([axis, skipna, split_every]) Return whether all elements are True over requested axis
Series.any([axis, skipna, split_every]) Return whether any element is True over requested axis
Series.append(other) Concatenate two or more Series.
Series.apply(func[, convert_dtype, meta, args]) Parallel version of pandas.Series.apply
Series.astype(dtype) Cast a pandas object to a specified dtype dtype.
Series.autocorr([lag, split_every]) Lag-N autocorrelation
Series.between(left, right[, inclusive]) Return boolean Series equivalent to left <= series <= right.
Series.bfill([axis, limit]) Synonym for DataFrame.fillna(method='bfill')
Series.cat
Series.clear_divisions() Forget division information
Series.clip([lower, upper, out]) Trim values at input threshold(s).
Series.clip_lower(threshold) Return copy of the input with values below given value(s) truncated.
Series.clip_upper(threshold) Return copy of input with values above given value(s) truncated.
Series.compute(**kwargs) Compute this dask collection
Series.copy() Make a copy of the dataframe
Series.corr(other[, method, min_periods, ...]) Compute correlation with other Series, excluding missing values
Series.count([split_every]) Return number of non-NA/null observations in the Series
Series.cov(other[, min_periods, split_every]) Compute covariance with Series, excluding missing values
Series.cummax([axis, skipna]) Return cumulative max over requested axis.
Series.cummin([axis, skipna]) Return cumulative minimum over requested axis.
Series.cumprod([axis, skipna]) Return cumulative product over requested axis.
Series.cumsum([axis, skipna]) Return cumulative sum over requested axis.
Series.describe([split_every]) Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Series.diff([periods, axis]) 1st discrete difference of object
Series.div(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator truediv).
Series.drop_duplicates([split_every, split_out]) Return DataFrame with duplicate rows removed, optionally only
Series.dropna() Return Series without null values
Series.dt
Series.dtype Return data type
Series.eq(other[, level, axis]) Equal to of series and other, element-wise (binary operator eq).
Series.ffill([axis, limit]) Synonym for DataFrame.fillna(method='ffill')
Series.fillna([value, method, limit, axis]) Fill NA/NaN values using the specified method
Series.first(offset) Convenience method for subsetting initial periods of time series data based on a date offset.
Series.floordiv(other[, level, fill_value, axis]) Integer division of series and other, element-wise (binary operator floordiv).
Series.ge(other[, level, axis]) Greater than or equal to of series and other, element-wise (binary operator ge).
Series.get_partition(n) Get a dask DataFrame/Series representing the nth partition.
Series.groupby([by]) Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Series.gt(other[, level, axis]) Greater than of series and other, element-wise (binary operator gt).
Series.head([n, npartitions, compute]) First n rows of the dataset
Series.idxmax([axis, skipna, split_every]) Return index of first occurrence of maximum over requested axis.
Series.idxmin([axis, skipna, split_every]) Return index of first occurrence of minimum over requested axis.
Series.isin(values) Return a boolean Series showing whether each element in the Series is exactly contained in the passed sequence of values.
Series.isnull() Return a boolean same-sized object indicating if the values are NA.
Series.iteritems() Lazily iterate over (index, value) tuples
Series.known_divisions Whether divisions are already known
Series.last(offset) Convenience method for subsetting final periods of time series data based on a date offset.
Series.le(other[, level, axis]) Less than or equal to of series and other, element-wise (binary operator le).
Series.loc Purely label-location based indexer for selection by label.
Series.lt(other[, level, axis]) Less than of series and other, element-wise (binary operator lt).
Series.map(arg[, na_action, meta]) Map values of Series using input correspondence (which can be
Series.map_overlap(func, before, after, ...) Apply a function to each partition, sharing rows with adjacent partitions.
Series.map_partitions(func, *args, **kwargs) Apply Python function on each DataFrame partition.
Series.mask(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Series.max([axis, skipna, split_every]) This method returns the maximum of the values in the object.
Series.mean([axis, skipna, split_every]) Return the mean of the values for the requested axis
Series.memory_usage([index, deep]) Memory usage of the Series
Series.min([axis, skipna, split_every]) This method returns the minimum of the values in the object.
Series.mod(other[, level, fill_value, axis]) Modulo of series and other, element-wise (binary operator mod).
Series.mul(other[, level, fill_value, axis]) Multiplication of series and other, element-wise (binary operator mul).
Series.nbytes Number of bytes
Series.ndim Return dimensionality
Series.ne(other[, level, axis]) Not equal to of series and other, element-wise (binary operator ne).
Series.nlargest([n, split_every]) Return the largest n elements.
Series.notnull() Return a boolean same-sized object indicating if the values are not NA.
Series.nsmallest([n, split_every]) Return the smallest n elements.
Series.nunique([split_every]) Return number of unique elements in the object.
Series.nunique_approx([split_every]) Approximate number of unique rows.
Series.persist(**kwargs) Persist this dask collection into memory
Series.pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs)
Series.pow(other[, level, fill_value, axis]) Exponential power of series and other, element-wise (binary operator pow).
Series.prod([axis, skipna, split_every]) Return the product of the values for the requested axis
Series.quantile([q]) Approximate quantiles of Series
Series.radd(other[, level, fill_value, axis]) Addition of series and other, element-wise (binary operator radd).
Series.random_split(frac[, random_state]) Pseudorandomly split dataframe into different pieces row-wise
Series.rdiv(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator rtruediv).
Series.reduction(chunk[, aggregate, ...]) Generic row-wise reductions.
Series.repartition([divisions, npartitions, ...]) Repartition dataframe along new divisions
Series.resample(rule[, how, closed, label]) Convenience method for frequency conversion and resampling of time series.
Series.reset_index([drop]) Reset the index to the default index.
Series.rolling(window[, min_periods, freq, ...]) Provides rolling transformations.
Series.round([decimals]) Round each value in a Series to the given number of decimals.
Series.sample(frac[, replace, random_state]) Random sample of items
Series.sem([axis, skipna, ddof, split_every]) Return unbiased standard error of the mean over requested axis.
Series.shift([periods, freq, axis]) Shift index by desired number of periods with an optional time freq
Series.size Size of the series
Series.std([axis, skipna, ddof, split_every]) Return sample standard deviation over requested axis.
Series.str
Series.sub(other[, level, fill_value, axis]) Subtraction of series and other, element-wise (binary operator sub).
Series.sum([axis, skipna, split_every]) Return the sum of the values for the requested axis
Series.to_bag([index]) Craeate a Dask Bag from a Series
Series.to_csv(filename, **kwargs) Store Dask DataFrame to CSV files
Series.to_delayed() Create Dask Delayed objects from a Dask Dataframe
Series.to_frame([name]) Convert Series to DataFrame
Series.to_hdf(path_or_buf, key[, mode, ...]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
Series.to_parquet(path, *args, **kwargs) Store Dask.dataframe to Parquet files
Series.to_string([max_rows]) Render a string representation of the Series
Series.to_timestamp([freq, how, axis]) Cast to DatetimeIndex of timestamps, at beginning of period
Series.truediv(other[, level, fill_value, axis]) Floating division of series and other, element-wise (binary operator truediv).
Series.unique([split_every, split_out]) Return Series of unique values in the object.
Series.value_counts([split_every, split_out]) Returns object containing counts of unique values.
Series.values Return a dask.array of the values of this dataframe
Series.var([axis, skipna, ddof, split_every]) Return unbiased variance over requested axis.
Series.visualize([filename, format, ...]) Render the computation of this object’s task graph using graphviz.
Series.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Groupby Operations

DataFrameGroupBy.aggregate(arg[, ...]) Aggregate using callable, string, dict, or list of string/callables
DataFrameGroupBy.apply(func[, meta]) Parallel version of pandas GroupBy.apply
DataFrameGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values
DataFrameGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
DataFrameGroupBy.cumprod([axis]) Cumulative product for each group
DataFrameGroupBy.cumsum([axis]) Cumulative sum for each group
DataFrameGroupBy.get_group(key) Constructs NDFrame from group with provided name
DataFrameGroupBy.max([split_every, split_out]) Compute max of group values
DataFrameGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values
DataFrameGroupBy.min([split_every, split_out]) Compute min of group values
DataFrameGroupBy.size([split_every, split_out]) Compute group sizes
DataFrameGroupBy.std([ddof, split_every, ...]) Compute standard deviation of groups, excluding missing values
DataFrameGroupBy.sum([split_every, split_out]) Compute sum of group values
DataFrameGroupBy.var([ddof, split_every, ...]) Compute variance of groups, excluding missing values
SeriesGroupBy.aggregate(arg[, split_every, ...]) Aggregate using callable, string, dict, or list of string/callables
SeriesGroupBy.apply(func[, meta]) Parallel version of pandas GroupBy.apply
SeriesGroupBy.count([split_every, split_out]) Compute count of group, excluding missing values
SeriesGroupBy.cumcount([axis]) Number each item in each group from 0 to the length of that group - 1.
SeriesGroupBy.cumprod([axis]) Cumulative product for each group
SeriesGroupBy.cumsum([axis]) Cumulative sum for each group
SeriesGroupBy.get_group(key) Constructs NDFrame from group with provided name
SeriesGroupBy.max([split_every, split_out]) Compute max of group values
SeriesGroupBy.mean([split_every, split_out]) Compute mean of groups, excluding missing values
SeriesGroupBy.min([split_every, split_out]) Compute min of group values
SeriesGroupBy.nunique([split_every, split_out])
SeriesGroupBy.size([split_every, split_out]) Compute group sizes
SeriesGroupBy.std([ddof, split_every, split_out]) Compute standard deviation of groups, excluding missing values
SeriesGroupBy.sum([split_every, split_out]) Compute sum of group values
SeriesGroupBy.var([ddof, split_every, split_out]) Compute variance of groups, excluding missing values

Rolling Operations

rolling.map_overlap(func, df, before, after, ...) Apply a function to each partition, sharing rows with adjacent partitions.
rolling.rolling_apply(arg, window, func[, ...]) Generic moving function application.
rolling.rolling_count(arg, window, **kwargs) Rolling count of number of non-NaN observations inside provided window.
rolling.rolling_kurt(arg, window[, ...]) Unbiased moving kurtosis.
rolling.rolling_max(arg, window[, ...]) Moving maximum.
rolling.rolling_mean(arg, window[, ...]) Moving mean.
rolling.rolling_median(arg, window[, ...]) Moving median.
rolling.rolling_min(arg, window[, ...]) Moving minimum.
rolling.rolling_quantile(arg, window, quantile) Moving quantile.
rolling.rolling_skew(arg, window[, ...]) Unbiased moving skewness.
rolling.rolling_std(arg, window[, ...]) Moving standard deviation.
rolling.rolling_sum(arg, window[, ...]) Moving sum.
rolling.rolling_var(arg, window[, ...]) Moving variance.
rolling.rolling_window(arg[, window, ...]) Applies a moving window of type window_type and size window on the data.

Create DataFrames

read_csv(urlpath[, blocksize, collection, ...]) Read CSV files into a Dask.DataFrame
read_table(urlpath[, blocksize, collection, ...]) Read delimited files into a Dask.DataFrame
read_parquet(path[, columns, filters, ...]) Read ParquetFile into a Dask DataFrame
read_hdf(pattern, key[, start, stop, ...]) Read HDF files into a Dask DataFrame
read_sql_table(table, uri, index_col[, ...]) Create dataframe from an SQL table.
from_array(x[, chunksize, columns]) Read any slicable array into a Dask Dataframe
from_bcolz(x[, chunksize, categorize, ...]) Read BColz CTable into a Dask Dataframe
from_dask_array(x[, columns]) Create a Dask DataFrame from a Dask Array.
from_delayed(dfs[, meta, divisions, prefix]) Create Dask DataFrame from many Dask Delayed objects
from_pandas(data[, npartitions, chunksize, ...]) Construct a Dask DataFrame from a Pandas DataFrame
dask.bag.core.Bag.to_dataframe([meta, columns]) Create Dask Dataframe from a Dask Bag.

Store DataFrames

to_csv(df, filename[, name_function, ...]) Store Dask DataFrame to CSV files
to_parquet(df, path[, engine, compression, ...]) Store Dask.dataframe to Parquet files
to_hdf(df, path, key[, mode, append, get, ...]) Store Dask Dataframe to Hierarchical Data Format (HDF) files
to_records(df) Create Dask Array from a Dask Dataframe
to_bag(df[, index]) Create Dask Bag from a Dask DataFrame
to_delayed(df) Create Dask Delayed objects from a Dask Dataframe

DataFrame Methods

class dask.dataframe.DataFrame(dsk, name, meta, divisions)

Parallel Pandas DataFrame

Do not use this class directly. Instead use functions like dd.read_csv, dd.read_parquet, or dd.from_pandas.

Parameters:

dask: dict

The dask graph to compute this DataFrame

name: str

The key prefix that specifies which keys in the dask comprise this particular DataFrame

meta: pandas.DataFrame

An empty pandas.DataFrame with names, dtypes, and index matching the expected output.

divisions: tuple of index values

Values along which we partition our blocks on the index

abs()

Return an object with absolute value taken–only applicable to objects that are all numeric.

Returns:abs: type of caller
add(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.radd

Notes

Mismatched indices will be unioned together

align(other, join='outer', axis=None, fill_value=None)

Align two objects on their axes with the specified join method for each axis Index

Parameters:

other : DataFrame or Series

join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axis : allowed axis of the other object, default None

Align on index (0), columns (1), or both (None)

level : int or level name, default None

Broadcast across a level, matching Index values on the passed MultiIndex level

copy : boolean, default True

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_value : scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value

method : str, default None

limit : int, default None

fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Filling axis, method and limit

broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None

Broadcast values along this axis, if aligning two objects of different dimensions

New in version 0.17.0.

Returns:

(left, right) : (DataFrame, type of other)

Aligned objects

Notes

Dask doesn’t support the following argument(s).

  • level
  • copy
  • method
  • limit
  • fill_axis
  • broadcast_axis
all(axis=None, skipna=True, split_every=False)

Return whether all elements are True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

all : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • bool_only
  • level
any(axis=None, skipna=True, split_every=False)

Return whether any element is True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

any : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • bool_only
  • level
append(other)

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

Parameters:

other : DataFrame or Series/dict-like object, or list of these

The data to append.

ignore_index : boolean, default False

If True, do not use the index labels.

verify_integrity : boolean, default False

If True, raise ValueError on creating index with duplicates.

Returns:

appended : DataFrame

See also

pandas.concat
General function to concatenate DataFrame, Series or Panel objects

Notes

If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))  
>>> df  
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))  
>>> df.append(df2)  
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)  
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.

Less efficient:

>>> df = pd.DataFrame(columns=['A'])  
>>> for i in range(5):  
...     df = df.append({'A'}: i}, ignore_index=True)
>>> df  
   A
0  0
1  1
2  2
3  3
4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],  
...           ignore_index=True)
   A
0  0
1  1
2  2
3  3
4  4
apply(func, axis=0, args=(), meta='__no_default__', **kwds)

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  1. Only axis=1 is supported (and must be specified explicitly).
  2. The user should provide output metadata via the meta keyword.
Parameters:

func : function

Function to apply to each column/row

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

  • 0 or ‘index’: apply function to each column (NOT SUPPORTED)
  • 1 or ‘columns’: apply function to each row

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

args : tuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function

Returns:

applied : Series or DataFrame

See also

dask.DataFrame.map_partitions

Examples

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Apply a function to row-wise passing in extra arguments in args and kwargs:

>>> def myadd(row, a, b=1):
...     return row.sum() + a + b
>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
applymap(func, meta='__no_default__')

Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame

Parameters:

func : function

Python function, returns a single value from a single value

Returns:

applied : DataFrame

See also

DataFrame.apply
For operations on rows/columns

Examples

>>> df = pd.DataFrame(np.random.randn(3, 3))  
>>> df  
    0         1          2
0  -0.029638  1.081563   1.280300
1   0.647747  0.831136  -1.549481
2   0.513416 -0.884417   0.195343
>>> df = df.applymap(lambda x: '%.2f' % x)  
>>> df  
    0         1          2
0  -0.03      1.08       1.28
1   0.65      0.83      -1.55
2   0.51     -0.88       0.20
assign(**kwargs)

Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.

Parameters:

kwargs : keyword, value pairs

keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

df : DataFrame

A new DataFrame with the new columns in addition to all the existing columns.

Notes

For python 3.6 and above, the columns are inserted in the order of **kwargs. For python 3.5 and earlier, since **kwargs is unordered, the columns are inserted in alphabetical order at the end of your DataFrame. Assigning multiple columns within the same assign is possible, but you cannot reference other columns created within the same assign call.

Examples

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})  

Where the value is a callable, evaluated on df:

>>> df.assign(ln_A = lambda x: np.log(x.A))  
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585

Where the value already exists and is inserted:

>>> newcol = np.log(df['A'])  
>>> df.assign(ln_A=newcol)  
    A         B      ln_A
0   1  0.426905  0.000000
1   2 -0.780949  0.693147
2   3 -0.418711  1.098612
3   4 -0.269708  1.386294
4   5 -0.274002  1.609438
5   6 -0.500792  1.791759
6   7  1.649697  1.945910
7   8 -1.495604  2.079442
8   9  0.549296  2.197225
9  10 -0.758542  2.302585
astype(dtype)

Cast a pandas object to a specified dtype dtype.

Parameters:

dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copy : bool, default True.

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors : {‘raise’, ‘ignore’}, default ‘raise’.

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

New in version 0.20.0.

raise_on_error : raise on invalid input

Deprecated since version 0.20.0: Use errors instead

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

See also

pandas.to_datetime
Convert argument to datetime.
pandas.to_timedelta
Convert argument to timedelta.
pandas.to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.

Examples

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> ser.astype('category', ordered=True, categories=[2, 1])  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1,2])  
>>> s2 = s1.astype('int', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64
bfill(axis=None, limit=None)

Synonym for DataFrame.fillna(method='bfill')

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast
categorize(df, columns=None, index=None, split_every=None, **kwargs)

Convert columns of the DataFrame to category dtype.

Parameters:

columns : list, optional

A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.

index : bool, optional

Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.

kwargs

Keyword arguments are passed on to compute.

clear_divisions()

Forget division information

clip(lower=None, upper=None, out=None)

Trim values at input threshold(s).

Parameters:

lower : float or array_like, default None

upper : float or array_like, default None

axis : int or string axis name, optional

Align object with lower and upper along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : Series

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace

Examples

>>> df  
          0         1
0  0.335232 -1.256177
1 -1.367855  0.746646
2  0.027753 -1.176076
3  0.230930 -0.679613
4  1.261967  0.570967
>>> df.clip(-1.0, 0.5)  
          0         1
0  0.335232 -1.000000
1 -1.000000  0.500000
2  0.027753 -1.000000
3  0.230930 -0.679613
4  0.500000  0.500000
>>> t  
0   -0.3
1   -0.2
2   -0.1
3    0.0
4    0.1
dtype: float64
>>> df.clip(t, t + 1, axis=0)  
          0         1
0  0.335232 -0.300000
1 -0.200000  0.746646
2  0.027753 -0.100000
3  0.230930  0.000000
4  1.100000  0.570967
clip_lower(threshold)

Return copy of the input with values below given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace
clip_upper(threshold)

Return copy of input with values above given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace
combine(other, func, fill_value=None, overwrite=True)

Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)

Parameters:

other : DataFrame

func : function

fill_value : scalar value

overwrite : boolean, default True

If True then overwrite values for common keys in the calling frame

Returns:

result : DataFrame

combine_first(other)

Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns

Parameters:other : DataFrame
Returns:combined : DataFrame

Examples

a’s values prioritized, use values from b to fill holes:

>>> a.combine_first(b)  
compute(**kwargs)

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a NumPy array and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

See also

dask.base.compute

copy()

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

corr(method='pearson', min_periods=None, split_every=False)

Compute pairwise correlation of columns, excluding NA/null values

Parameters:

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation

Returns:

y : DataFrame

count(axis=None, split_every=False)

Return Series with number of non-NA/null observations over requested axis. Works with non-floating point data as well (detects NaN and None)

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame

numeric_only : boolean, default False

Include only float, int, boolean data

Returns:

count : Series (or DataFrame if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
cov(min_periods=None, split_every=False)

Compute pairwise covariance of columns, excluding NA/null values

Parameters:

min_periods : int, optional

Minimum number of observations required per pair of columns to have a valid result.

Returns:

y : DataFrame

Notes

y contains the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative max over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : Series

See also

pandas.core.window.Expanding.max
Similar functionality but ignores NaN values.
cummin(axis=None, skipna=True)

Return cumulative minimum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : Series

See also

pandas.core.window.Expanding.min
Similar functionality but ignores NaN values.
cumprod(axis=None, skipna=True)

Return cumulative product over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : Series

See also

pandas.core.window.Expanding.prod
Similar functionality but ignores NaN values.
cumsum(axis=None, skipna=True)

Return cumulative sum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : Series

See also

pandas.core.window.Expanding.sum
Similar functionality but ignores NaN values.
describe(split_every=False)

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:

percentiles : list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include : ‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.
  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
  • None (default) : The result will include all numeric columns.

exclude : list-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
  • None (default) : The result will exclude nothing.
Returns:

summary: Series/DataFrame of summary statistics

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()  
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'],  
...                     'numeric': [1, 2, 3],
...                     'categorical': pd.Categorical(['d','e','f'])
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])  
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])  
        categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
diff(periods=1, axis=0)

1st discrete difference of object

Parameters:

periods : int, default 1

Periods to shift for forming difference

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Take difference over rows (0) or columns (1).

Returns:

diffed : DataFrame

div(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

drop(labels, axis=0, errors='raise')

Return new object with labels in requested axis removed.

Parameters:

labels : single label or list-like

Index or column labels to drop.

axis : int or axis name

Whether to drop labels from the index (0 / ‘index’) or columns (1 / ‘columns’).

index, columns : single label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

New in version 0.21.0.

level : int or level name, default None

For MultiIndex

inplace : bool, default False

If True, do operation inplace and return None.

errors : {‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and existing labels are dropped.

Returns:

dropped : type of caller

Notes

Specifying both labels and index or columns will raise a ValueError.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3,4),  
                      columns=['A', 'B', 'C', 'D'])
>>> df  
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)  
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])  
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])  
   A  B   C   D
2  8  9  10  11
drop_duplicates(split_every=None, split_out=1, **kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

Notes

Dask doesn’t support the following argument(s).

  • subset
  • keep
  • inplace
dropna(how='any', subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof

Pass tuple or list to drop on multiple axes

how : {‘any’, ‘all’}

  • any : if any NA values are present, drop that label
  • all : if all values are NA, drop that label

thresh : int, default None

int value : require that many non-NA values

subset : array-like

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

inplace : boolean, default False

If True, do operation inplace and return None.

Returns:

dropped : DataFrame

Notes

Dask doesn’t support the following argument(s).

  • axis
  • thresh
  • inplace

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],  
...                    [np.nan, np.nan, np.nan, 5]],
...                   columns=list('ABCD'))
>>> df  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5

Drop the columns where all elements are nan:

>>> df.dropna(axis=1, how='all')  
     A    B  D
0  NaN  2.0  0
1  3.0  4.0  1
2  NaN  NaN  5

Drop the columns where any of the elements is nan

>>> df.dropna(axis=1, how='any')  
   D
0  0
1  1
2  5

Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same):

>>> df.dropna(axis=0, how='all')  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5

Keep only the rows with at least 2 non-na values:

>>> df.dropna(thresh=2)  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
dtypes

Return data types

eq(other, axis='columns', level=None)

Wrapper for flexible comparison methods eq

eval(expr, inplace=None, **kwargs)

Evaluate an expression in the context of the calling DataFrame instance.

Parameters:

expr : string

The expression string to evaluate.

inplace : bool, default False

If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.

New in version 0.18.0.

kwargs : dict

See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:

ret : ndarray, scalar, or pandas object

See also

pandas.DataFrame.query, pandas.DataFrame.assign, pandas.eval

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> from numpy.random import randn  
>>> from pandas import DataFrame  
>>> df = DataFrame(randn(10, 2), columns=list('ab'))  
>>> df.eval('a + b')  
>>> df.eval('c = a + b')  
ffill(axis=None, limit=None)

Synonym for DataFrame.fillna(method='ffill')

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast
fillna(value=None, method=None, limit=None, axis=None)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : DataFrame

See also

reindex, asfreq

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
>>> df  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)  
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')  
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}  
>>> df.fillna(value=values)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4
first(offset)

Convenience method for subsetting initial periods of time series data based on a date offset.

Parameters:offset : string, DateOffset, dateutil.relativedelta
Returns:subset : type of caller

Examples

ts.first(‘10D’) -> First 10 days

floordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

ge(other, axis='columns', level=None)

Wrapper for flexible comparison methods ge

get_dtype_counts()

Return the counts of dtypes in this object.

get_ftype_counts()

Return the counts of ftypes in this object.

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(by=None, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping, function, str, or iterable

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns in self

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t support the following argument(s).

  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()  
>>> data.groupby(['col1', 'col2'])['col3'].mean()  

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()  
gt(other, axis='columns', level=None)

Wrapper for flexible comparison methods gt

head(n=5, npartitions=1, compute=True)

First n rows of the dataset

Parameters:

n : int, optional

The number of rows to return. Default is 5.

npartitions : int, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be first index.

Returns:

idxmax : Series

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(axis=None, skipna=True, split_every=False)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

idxmin : Series

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

index

Return dask Index instance

info(buf=None, verbose=False, memory_usage=False)

Concise summary of a Dask DataFrame.

isin(values)

Return boolean DataFrame showing whether each element in the DataFrame is contained in values.

Parameters:

values : iterable, Series, DataFrame or dictionary

The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dictionary, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns:

DataFrame of booleans

Examples

When values is a list:

>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})  
>>> df.isin([1, 3, 12, 'a'])  
       A      B
0   True   True
1  False  False
2   True  False

When values is a dict:

>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})  
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})  
       A      B
0   True  False  # Note that B didn't match the 1 here.
1  False   True
2   True   True

When values is a Series or DataFrame:

>>> df = DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})  
>>> other = DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})  
>>> df.isin(other)  
       A      B
0   True  False
1  False  False  # Column A in `other` has a 3, but not at index 1.
2   True   True
isnull()

Return a boolean same-sized object indicating if the values are NA.

See also

DataFrame.notna
boolean inverse of isna
DataFrame.isnull
alias of isna
isna
top-level isna
iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

Returns:

it : generator

A generator that iterates over the rows of the frame.

See also

itertuples
Iterate over DataFrame rows as namedtuples of the values.
iteritems
Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])  
    >>> row = next(df.iterrows())[1]  
    >>> row  
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)  
    float64
    >>> print(df['int'].dtype)  
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples()

Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.

Parameters:

index : boolean, default True

If True, return the index as the first element of the tuple.

name : string, default “Pandas”

The name of the returned namedtuples or None to return regular tuples.

See also

iterrows
Iterate over DataFrame rows as (index, Series) pairs.
iteritems
Iterate over (column name, Series) pairs.

Notes

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},  
                      index=['a', 'b'])
>>> df  
   col1  col2
a     1   0.1
b     2   0.2
>>> for row in df.itertuples():  
...     print(row)
...
Pandas(Index='a', col1=1, col2=0.10000000000000001)
Pandas(Index='b', col1=2, col2=0.20000000000000001)
join(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

Parameters:

other : DataFrame, Series with name field set, or list of DataFrame

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame

on : column name, tuple/list of column names, or array-like

Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’

How to handle the operation of the two objects.

  • left: use calling frame’s index (or column if on is specified)
  • right: use other frame’s index
  • outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
  • inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one

lsuffix : string

Suffix to use from left frame’s overlapping columns

rsuffix : string

Suffix to use from right frame’s overlapping columns

sort : boolean, default False

Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword)

Returns:

joined : DataFrame

See also

DataFrame.merge
For column(s)-on-columns(s) operations

Notes

on, lsuffix, and rsuffix options are not supported when passing a list of DataFrame objects

Examples

>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],  
...                        'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller  
    A key
0  A0  K0
1  A1  K1
2  A2  K2
3  A3  K3
4  A4  K4
5  A5  K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],  
...                       'B': ['B0', 'B1', 'B2']})
>>> other  
    B key
0  B0  K0
1  B1  K1
2  B2  K2

Join DataFrames using their indexes.

>>> caller.join(other, lsuffix='_caller', rsuffix='_other')  
>>>     A key_caller    B key_other  
    0  A0         K0   B0        K0
    1  A1         K1   B1        K1
    2  A2         K2   B2        K2
    3  A3         K3  NaN       NaN
    4  A4         K4  NaN       NaN
    5  A5         K5  NaN       NaN

If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.

>>> caller.set_index('key').join(other.set_index('key'))  
>>>      A    B  
    key
    K0   A0   B0
    K1   A1   B1
    K2   A2   B2
    K3   A3  NaN
    K4   A4  NaN
    K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.

>>> caller.join(other.set_index('key'), on='key')  
>>>     A key    B  
    0  A0  K0   B0
    1  A1  K1   B1
    2  A2  K2   B2
    3  A3  K3  NaN
    4  A4  K4  NaN
    5  A5  K5  NaN
known_divisions

Whether divisions are already known

last(offset)

Convenience method for subsetting final periods of time series data based on a date offset.

Parameters:offset : string, DateOffset, dateutil.relativedelta
Returns:subset : type of caller

Examples

ts.last(‘5M’) -> Last 5 months

le(other, axis='columns', level=None)

Wrapper for flexible comparison methods le

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, axis='columns', level=None)

Wrapper for flexible comparison methods lt

map_overlap(func, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters:

func : function

Function applied to each partition.

before : int

The number of rows to prepend to partition i from the end of partition i - 1.

after : int

The number of rows to append to partition i from the beginning of partition i + 1.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

  1. Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.
  2. Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.
  3. Apply func to each partition, passing in any extra args and kwargs if provided.
  4. Trim before rows from the beginning of all but the first partition.
  5. Trim after rows from the end of all but the last partition.

Note that the index and divisions are assumed to remain unchanged.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
0   1  1.0
1   2  2.0
2   4  3.0
3   7  4.0
4  11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
0   NaN  NaN
1   3.0  3.0
2   6.0  5.0
3  11.0  7.0
4  18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a timedelta for time- based windows. >>> ts = pd.Series(range(10), index=pd.date_range(‘2017’, periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling(‘2D’).sum(), ... pd.Timedelta(‘2D’), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64

map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as cond.

other : scalar, NDFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

errors : str, {‘raise’, ‘ignore’}, default ‘raise’

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Deprecated since version 0.21.0.

Returns:

wh : same type as caller

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
>>> s.where(s > 1, 10)  
0    10.0
1    10.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=None, skipna=True, split_every=False)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

max : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True, split_every=False)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

mean : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
memory_usage(index=True, deep=False)

Memory usage of DataFrame columns.

Parameters:

index : bool

Specifies whether to include memory usage of DataFrame’s index in returned Series. If index=True (default is False) the first index of the Series is Index.

deep : bool

Introspect the data deeply, interrogate object dtypes for system-level memory consumption

Returns:

sizes : Series

A series with column names as index and memory usage of columns with units of bytes.

See also

numpy.ndarray.nbytes

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False

merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

New in version 0.17.0.

validate : string, default None

If specified, checks if merge is of specified type.

  • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
  • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
  • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
  • “many_to_many” or “m:m”: allowed, but does not result in checks.

New in version 0.21.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

See also

merge_ordered, merge_asof

Notes

Dask doesn’t support the following argument(s).

  • sort
  • copy
  • validate

Examples

>>> A              >>> B  
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')  
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7
min(axis=None, skipna=True, split_every=False)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

min : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
mod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmod

Notes

Mismatched indices will be unioned together

mul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rmul

Notes

Mismatched indices will be unioned together

ndim

Return dimensionality

ne(other, axis='columns', level=None)

Wrapper for flexible comparison methods ne

nlargest(n=5, columns=None, split_every=None)

Get the rows of a DataFrame sorted by the n largest values of columns.

New in version 0.17.0.

Parameters:

n : int

Number of items to retrieve

columns : list or str

Column name or names to order by

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

DataFrame

Notes

Dask doesn’t support the following argument(s).

  • keep

Examples

>>> df = DataFrame({'a': [1, 10, 8, 11, -1],  
...                 'b': list('abdce'),
...                 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nlargest(3, 'a')  
    a  b   c
3  11  c   3
1  10  b   2
2   8  d NaN
notnull()

Return a boolean same-sized object indicating if the values are not NA.

See also

DataFrame.isna
boolean inverse of notna
DataFrame.notnull
alias of notna
notna
top-level notna
npartitions

Return number of partitions

nsmallest(n=5, columns=None, split_every=None)

Get the rows of a DataFrame sorted by the n smallest values of columns.

New in version 0.17.0.

Parameters:

n : int

Number of items to retrieve

columns : list or str

Column name or names to order by

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

DataFrame

Notes

Dask doesn’t support the following argument(s).

  • keep

Examples

>>> df = DataFrame({'a': [1, 10, 8, 11, -1],  
...                 'b': list('abdce'),
...                 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nsmallest(3, 'a')  
   a  b   c
4 -1  e   4
0  1  a   1
2  8  d NaN
nunique_approx(split_every=None)

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters:

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns:

a float representing the approximate number of elements

persist(**kwargs)

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

**kwargs

Extra keywords to forward to the scheduler get function.

Returns:

New dask collections backed by in-memory data

See also

dask.base.persist

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs)

Parameters:

func : function

function to apply to the NDFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the NDFrame.

args : iterable, optional

positional arguments passed into func.

kwargs : mapping, optional

a dictionary of keyword arguments passed into func.

Returns:

object : the return type of func.

See also

pandas.DataFrame.apply, pandas.DataFrame.applymap, pandas.Series.map

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )
pivot_table(index=None, columns=None, values=None, aggfunc='mean')

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, values and aggfunc must be all scalar.

Parameters:

values : scalar

column to aggregate

index : scalar

column to be index

columns : scalar

column to be columns

aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’

Returns:

table : DataFrame

pow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rpow

Notes

Mismatched indices will be unioned together

prod(axis=None, skipna=True, split_every=False)

Return the product of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

prod : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
quantile(q=0.5, axis=0)

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters:

q : list/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

axis : {0, 1, ‘index’, ‘columns’} (default 0)

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

query(expr, **kwargs)

Filter dataframe with complex expression

Blocked version of pd.DataFrame.query

This is like the sequential version except that this will also happen in many threads. This may conflict with numexpr which will use multiple threads itself. We recommend that you set numexpr to use a single thread

import numexpr numexpr.set_nthreads(1)

See also

pandas.DataFrame.query

radd(other, axis='columns', level=None, fill_value=None)

Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.add

Notes

Mismatched indices will be unioned together

random_split(frac, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : list

List of floats that should sum to one.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

reduction(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)

Generic row-wise reductions.

Parameters:

chunk : callable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregate : callable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

  • scalar: Input is a Series, with one row per partition.
  • Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combine : callable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

token : str, optional

The name to use for the output keys.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargs : dict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargs : dict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargs : dict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
rename(index=None, columns=None)

Alter axes labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters:

mapper, index, columns : dict-like or function, optional

dict-like or functions transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.

axis : int or str, optional

Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.

copy : boolean, default True

Also copy underlying data

inplace : boolean, default False

Whether to return a new %(klass)s. If True then value of copy is ignored.

level : int or level name, default None

In case of a MultiIndex, only rename labels in the specified level.

Returns:

renamed : DataFrame

See also

pandas.DataFrame.rename_axis

Examples

DataFrame.rename supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)
  • (mapper, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})  
>>> df.rename(index=str, columns={"A": "a", "B": "c"})  
   a  c
0  1  4
1  2  5
2  3  6
>>> df.rename(index=str, columns={"A": "a", "C": "c"})  
   a  B
0  1  4
1  2  5
2  3  6

Using axis-style parameters

>>> df.rename(str.lower, axis='columns')  
   a  b
0  1  4
1  2  5
2  3  6
>>> df.rename({1: 2, 2: 4}, axis='index')  
   A  B
0  1  4
2  2  5
4  3  6
repartition(divisions=None, npartitions=None, freq=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list, optional

List of partitions to be used. If specified npartitions will be ignored.

npartitions : int, optional

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

freq : str, pd.Timedelta

A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  
resample(rule, how=None, closed=None, label=None)

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters:

rule : string

the offset string or object representing target conversion

axis : int, optional, default 0

closed : {‘right’, ‘left’}

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

label : {‘right’, ‘left’}

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

convention : {‘start’, ‘end’, ‘s’, ‘e’}

For PeriodIndex only, controls whether to use the start or end of rule

loffset : timedelta

Adjust the resampled time labels

base : int, default 0

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

on : string, optional

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

New in version 0.19.0.

level : string or int, optional

For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.

New in version 0.19.0.

Notes

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5] #select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):  
...     return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
                                                freq='A',
                                                periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64

Resample by month using ‘start’ convention. Values are assigned to the first month of the period.

>>> s.resample('M', convention='start').asfreq().head()  
2012-01    1.0
2012-02    NaN
2012-03    NaN
2012-04    NaN
2012-05    NaN
Freq: M, dtype: float64

Resample by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> s.resample('M', convention='end').asfreq()  
2012-12    1.0
2013-01    NaN
2013-02    NaN
2013-03    NaN
2013-04    NaN
2013-05    NaN
2013-06    NaN
2013-07    NaN
2013-08    NaN
2013-09    NaN
2013-10    NaN
2013-11    NaN
2013-12    2.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])  
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> df.resample('3T', on='time').sum()  
                     a  b  c  d
time
2000-01-01 00:00:00  0  3  6  9
2000-01-01 00:03:00  0  3  6  9
2000-01-01 00:06:00  0  3  6  9

For a DataFrame with MultiIndex, the keyword level can be used to specify on level the resampling needs to take place.

>>> time = pd.date_range('1/1/2000', periods=5, freq='T')  
>>> df2 = pd.DataFrame(data=10*[range(4)],  
                       columns=['a', 'b', 'c', 'd'],
                       index=pd.MultiIndex.from_product([time, [1, 2]])
                       )
>>> df2.resample('3T', level=0).sum()  
                     a  b   c   d
2000-01-01 00:00:00  0  6  12  18
2000-01-01 00:03:00  0  4   8  12
reset_index(drop=False)

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:

drop : boolean, default False

Do not try to insert index into dataframe columns.

rfloordiv(other, axis='columns', level=None, fill_value=None)

Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

rmod(other, axis='columns', level=None, fill_value=None)

Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mod

Notes

Mismatched indices will be unioned together

rmul(other, axis='columns', level=None, fill_value=None)

Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.mul

Notes

Mismatched indices will be unioned together

rolling(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)

Provides rolling transformations.

Parameters:

window : int, str, offset

Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

center : boolean, default False

Set the labels at the center of the window.

win_type : string, default None

Provide a window type. The recognized window types are identical to pandas.

axis : int, default 0

Returns:

a Rolling object on which to call a method to compute a statistic

Notes

The freq argument is not supported.

round(decimals=0)

Round a DataFrame to a variable number of decimal places.

New in version 0.17.0.

Parameters:

decimals : int, dict, Series

Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

Returns:

DataFrame object

See also

numpy.around, Series.round

Examples

>>> df = pd.DataFrame(np.random.random([3, 3]),  
...     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
>>> df  
               A         B         C
first   0.028208  0.992815  0.173891
second  0.038683  0.645646  0.577595
third   0.877076  0.149370  0.491027
>>> df.round(2)  
           A     B     C
first   0.03  0.99  0.17
second  0.04  0.65  0.58
third   0.88  0.15  0.49
>>> df.round({'A': 1, 'C': 2})  
          A         B     C
first   0.0  0.992815  0.17
second  0.0  0.645646  0.58
third   0.9  0.149370  0.49
>>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C'])  
>>> df.round(decimals)  
          A  B     C
first   0.0  1  0.17
second  0.0  1  0.58
third   0.9  0  0.49
rpow(other, axis='columns', level=None, fill_value=None)

Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.pow

Notes

Mismatched indices will be unioned together

rsub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.sub

Notes

Mismatched indices will be unioned together

rtruediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

See also

DataFrame.random_split, pandas.DataFrame.sample

select_dtypes(include=None, exclude=None)

Return a subset of a DataFrame including/excluding columns based on their dtype.

Parameters:

include, exclude : scalar or list-like

A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns:

subset : DataFrame

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises:

ValueError

  • If both of include and exclude are empty
  • If include and exclude have overlapping elements
  • If any kind of string dtype is passed in.

Notes

  • To select all numeric types use the numpy dtype numpy.number
  • To select strings you must use the object dtype, but note that this will return all object dtype columns
  • See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, ‘datetime’ or ‘datetime64’
  • To select timedeltas, use np.timedelta64, ‘timedelta’ or ‘timedelta64’
  • To select Pandas categorical dtypes, use ‘category’
  • To select Pandas datetimetz dtypes, use ‘datetimetz’ (new in 0.20.0), or a ‘datetime64[ns, tz]’ string

Examples

>>> df = pd.DataFrame({'a': np.random.randn(6).astype('f4'),  
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df  
        a      b  c
0  0.3962   True  1
1  0.1459  False  2
2  0.2623   True  1
3  0.0764  False  2
4 -0.9703   True  1
5 -1.2094  False  2
>>> df.select_dtypes(include='bool')  
   c
0  True
1  False
2  True
3  False
4  True
5  False
>>> df.select_dtypes(include=['float64'])  
   c
0  1
1  2
2  1
3  2
4  1
5  2
>>> df.select_dtypes(exclude=['floating'])  
       b
0   True
1  False
2   True
3  False
4   True
5  False
sem(axis=None, skipna=None, ddof=1, split_every=False)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sem : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
set_index(other, drop=True, sorted=False, npartitions=None, divisions=None, **kwargs)

Set the DataFrame index (row labels) using an existing column

This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we set_index once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.

This function operates exactly like pandas.set_index except with different performance costs (it is much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate qunatiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.

In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.

Parameters:

df: Dask DataFrame

index: string or Dask Series

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None use the same as the input. If ‘auto’ then decide by memory use.

shuffle: string, optional

Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your current scheduler.

sorted: bool, optional

If the index column is already sorted in increasing order. Defaults to False

divisions: list, optional

Known values on which to separate index values of the partitions. See http://dask.pydata.org/en/latest/dataframe-design.html#partitions Defaults to computing this with a single pass over the data. Note that if sorted=True, specified divisions are assumed to match the existing partitions in the data. If this is untrue, you should leave divisions empty and call repartition after set_index.

compute: bool

Whether or not to trigger an immediate computation. Defaults to False.

Examples

>>> df2 = df.set_index('x')  
>>> df2 = df.set_index(d.x)  
>>> df2 = df.set_index(d.timestamp, sorted=True)  

A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated

>>> import pandas as pd
>>> divisions = pd.date_range('2000', '2010', freq='1D')
>>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions)  
shift(periods=1, freq=None, axis=0)

Shift index by desired number of periods with an optional time freq

Parameters:

periods : int

Number of periods to move, can be positive or negative

freq : DateOffset, timedelta, or time rule string, optional

Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.

axis : {0 or ‘index’, 1 or ‘columns’}

Returns:

shifted : DataFrame

Notes

If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.

size

Size of the series

std(axis=None, skipna=True, ddof=1, split_every=False)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

std : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
sub(other, axis='columns', level=None, fill_value=None)

Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

See also

DataFrame.rsub

Notes

Mismatched indices will be unioned together

sum(axis=None, skipna=True, split_every=False)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sum : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Create Dask Bag from a Dask DataFrame

Parameters:

index : bool, optional

If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

Examples

>>> bag = df.to_bag()  
to_csv(filename, **kwargs)

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, ...

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 
Parameters:

filename : string

Path glob indicating the naming scheme for the output files

name_function : callable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions

compression : string or None

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

sep : character, default ‘,’

Field delimiter for the output file

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

storage_options: dict

Parameters passed on to the backend filesystem class.

Returns

——-

The names of the file written if they were computed right away

If not, the delayed tasks associated to the writing of the files

to_delayed()

Create Dask Delayed objects from a Dask Dataframe

Returns a list of delayed values, one value per partition.

Examples

>>> partitions = df.to_delayed()  
to_hdf(path_or_buf, key, mode='a', append=False, get=None, **kwargs)

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters:

path: string

Path to a target filename. May contain a * to denote many filenames

key: string

Datapath within the files. May contain a * to denote many locations

name_function: function

A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)

compute: bool

Whether or not to execute immediately. If False then this returns a dask.Delayed value.

lock: Lock, optional

Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.

**other:

See pandas.to_hdf for more information

Returns:

None: if compute == True

delayed value: if compute == False

See also

read_hdf, to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', get=dask.multiprocessing.get) 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 
to_html(max_rows=5)

Render a DataFrame as an HTML table.

to_html-specific options:

bold_rows : boolean, default True
Make the row labels bold in the output
classes : str or list or tuple, default None
CSS class(es) to apply to the resulting html table
escape : boolean, default True
Convert the characters <, >, and & to HTML-safe sequences.=
max_rows : int, optional
Maximum number of rows to show before truncating. If None, show all.
max_cols : int, optional
Maximum number of columns to show before truncating. If None, show all.
decimal : string, default ‘.’

Character recognized as decimal separator, e.g. ‘,’ in Europe

New in version 0.18.0.

border : int

A border=border attribute is included in the opening <table> tag. Default pd.options.html.border.

New in version 0.19.0.

Parameters:

buf : StringIO-like, optional

buffer to write to

columns : sequence, optional

the subset of columns to write; default None writes all columns

col_space : int, optional

the minimum width of each column

header : bool, optional

whether to print column labels, default True

index : bool, optional

whether to print index (row) labels, default True

na_rep : string, optional

string representation of NAN to use, default ‘NaN’

formatters : list or dict of one-parameter functions, optional

formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.

float_format : one-parameter function, optional

formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.

sparsify : bool, optional

Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True

index_names : bool, optional

Prints the names of the indexes, default True

line_width : int, optional

Width to wrap a line in characters, default no wrap

justify : {‘left’, ‘right’, ‘center’, ‘justify’,

‘justify-all’, ‘start’, ‘end’, ‘inherit’, ‘match-parent’, ‘initial’, ‘unset’}, default None

How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box.

Returns:

formatted : string (or unicode, depending on data and options)

to_timestamp(freq=None, how='start', axis=0)

Cast to DatetimeIndex of timestamps, at beginning of period

Parameters:

freq : string, default frequency of PeriodIndex

Desired frequency

how : {‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The axis to convert (the index by default)

copy : boolean, default True

If false then underlying input data is not copied

Returns:

df : DataFrame with DatetimeIndex

Notes

Dask doesn’t support the following argument(s).

  • copy
truediv(other, axis='columns', level=None, fill_value=None)

Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series, DataFrame, or constant

axis : {0, 1, ‘index’, ‘columns’}

For Series input, axis to match Series index on

fill_value : None or float value, default None

Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : DataFrame

Notes

Mismatched indices will be unioned together

values

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(axis=None, skipna=True, ddof=1, split_every=False)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

var : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:

filename : str or None, optional

The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.

format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graph : bool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:

result : IPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.base.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

http://dask.pydata.org/en/latest/optimize.html

where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as cond.

other : scalar, NDFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

errors : str, {‘raise’, ‘ignore’}, default ‘raise’

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Deprecated since version 0.21.0.

Returns:

wh : same type as caller

See also

DataFrame.mask()

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
>>> s.where(s > 1, 10)  
0    10.0
1    10.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

Series Methods

class dask.dataframe.Series(dsk, name, meta, divisions)

Parallel Pandas Series

Do not use this class directly. Instead use functions like dd.read_csv, dd.read_parquet, or dd.from_pandas.

Parameters:

dsk: dict

The dask graph to compute this Series

_name: str

The key prefix that specifies which keys in the dask comprise this particular Series

meta: pandas.Series

An empty pandas.Series with names, dtypes, and index matching the expected output.

divisions: tuple of index values

Values along which we partition our blocks on the index

abs()

Return an object with absolute value taken–only applicable to objects that are all numeric.

Returns:abs: type of caller
add(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.radd

align(other, join='outer', axis=None, fill_value=None)

Align two objects on their axes with the specified join method for each axis Index

Parameters:

other : DataFrame or Series

join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’

axis : allowed axis of the other object, default None

Align on index (0), columns (1), or both (None)

level : int or level name, default None

Broadcast across a level, matching Index values on the passed MultiIndex level

copy : boolean, default True

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_value : scalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value

method : str, default None

limit : int, default None

fill_axis : {0, ‘index’}, default 0

Filling axis, method and limit

broadcast_axis : {0, ‘index’}, default None

Broadcast values along this axis, if aligning two objects of different dimensions

New in version 0.17.0.

Returns:

(left, right) : (Series, type of other)

Aligned objects

Notes

Dask doesn’t support the following argument(s).

  • level
  • copy
  • method
  • limit
  • fill_axis
  • broadcast_axis
all(axis=None, skipna=True, split_every=False)

Return whether all elements are True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

all : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • bool_only
  • level
any(axis=None, skipna=True, split_every=False)

Return whether any element is True over requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

bool_only : boolean, default None

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

Returns:

any : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • bool_only
  • level
append(other)

Concatenate two or more Series.

Parameters:

to_append : Series or list/tuple of Series

ignore_index : boolean, default False

If True, do not use the index labels.

verify_integrity : boolean, default False

If True, raise Exception on creating index with duplicates

Returns:

appended : Series

See also

pandas.concat
General function to concatenate DataFrame, Series or Panel objects

Notes

Iteratively appending to a Series can be more computationally intensive than a single concatenate. A better solution is to append values to a list and then concatenate the list with the original Series all at once.

Examples

>>> s1 = pd.Series([1, 2, 3])  
>>> s2 = pd.Series([4, 5, 6])  
>>> s3 = pd.Series([4, 5, 6], index=[3,4,5])  
>>> s1.append(s2)  
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s1.append(s3)  
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With ignore_index set to True:

>>> s1.append(s2, ignore_index=True)  
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With verify_integrity set to True:

>>> s1.append(s2, verify_integrity=True)  
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: [0, 1, 2]
apply(func, convert_dtype=True, meta='__no_default__', args=(), **kwds)

Parallel version of pandas.Series.apply

Parameters:

func : function

Function to apply

convert_dtype : boolean, default True

Try to find better dtype for elementwise function results. If False, leave as dtype=object.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

args : tuple

Positional arguments to pass to function in addition to the value.

Additional keyword arguments will be passed as keywords to the function.

Returns:

applied : Series or DataFrame if func returns a Series.

See also

dask.Series.map_partitions

Examples

>>> import dask.dataframe as dd
>>> s = pd.Series(range(5), name='x')
>>> ds = dd.from_pandas(s, npartitions=2)

Apply a function elementwise across the Series, passing in extra arguments in args and kwargs:

>>> def myadd(x, a, b=1):
...     return x + a + b
>>> res = ds.apply(myadd, args=(2,), b=1.5)

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ds.apply(lambda x: x + 1, meta=ds)
astype(dtype)

Cast a pandas object to a specified dtype dtype.

Parameters:

dtype : data type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copy : bool, default True.

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors : {‘raise’, ‘ignore’}, default ‘raise’.

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

New in version 0.20.0.

raise_on_error : raise on invalid input

Deprecated since version 0.20.0: Use errors instead

kwargs : keyword arguments to pass on to the constructor

Returns:

casted : type of caller

See also

pandas.to_datetime
Convert argument to datetime.
pandas.to_timedelta
Convert argument to timedelta.
pandas.to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.

Examples

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> ser.astype('category', ordered=True, categories=[2, 1])  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1,2])  
>>> s2 = s1.astype('int', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64
autocorr(lag=1, split_every=False)

Lag-N autocorrelation

Parameters:

lag : int, default 1

Number of lags to apply before performing autocorrelation.

Returns:

autocorr : float

between(left, right, inclusive=True)

Return boolean Series equivalent to left <= series <= right. NA values will be treated as False

Parameters:

left : scalar

Left boundary

right : scalar

Right boundary

Returns:

is_between : Series

bfill(axis=None, limit=None)

Synonym for DataFrame.fillna(method='bfill')

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast
clear_divisions()

Forget division information

clip(lower=None, upper=None, out=None)

Trim values at input threshold(s).

Parameters:

lower : float or array_like, default None

upper : float or array_like, default None

axis : int or string axis name, optional

Align object with lower and upper along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : Series

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace

Examples

>>> df  
          0         1
0  0.335232 -1.256177
1 -1.367855  0.746646
2  0.027753 -1.176076
3  0.230930 -0.679613
4  1.261967  0.570967
>>> df.clip(-1.0, 0.5)  
          0         1
0  0.335232 -1.000000
1 -1.000000  0.500000
2  0.027753 -1.000000
3  0.230930 -0.679613
4  0.500000  0.500000
>>> t  
0   -0.3
1   -0.2
2   -0.1
3    0.0
4    0.1
dtype: float64
>>> df.clip(t, t + 1, axis=0)  
          0         1
0  0.335232 -0.300000
1 -0.200000  0.746646
2  0.027753 -0.100000
3  0.230930  0.000000
4  1.100000  0.570967
clip_lower(threshold)

Return copy of the input with values below given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace
clip_upper(threshold)

Return copy of input with values above given value(s) truncated.

Parameters:

threshold : float or array_like

axis : int or string axis name, optional

Align object with threshold along the given axis.

inplace : boolean, default False

Whether to perform the operation in place on the data

New in version 0.21.0.

Returns:

clipped : same type as input

See also

clip

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace
combine(other, func, fill_value=None)

Perform elementwise binary operation on two Series using given function with optional fill value when an index is missing from one Series or the other

Parameters:

other : Series or scalar value

func : function

fill_value : scalar value

Returns:

result : Series

combine_first(other)

Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes

Parameters:other : Series
Returns:y : Series
compute(**kwargs)

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a NumPy array and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

See also

dask.base.compute

copy()

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

corr(other, method='pearson', min_periods=None, split_every=False)

Compute correlation with other Series, excluding missing values

Parameters:

other : Series

method : {‘pearson’, ‘kendall’, ‘spearman’}

  • pearson : standard correlation coefficient
  • kendall : Kendall Tau correlation coefficient
  • spearman : Spearman rank correlation

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

correlation : float

count(split_every=False)

Return number of non-NA/null observations in the Series

Parameters:

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series

Returns:

nobs : int or Series (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
cov(other, min_periods=None, split_every=False)

Compute covariance with Series, excluding missing values

Parameters:

other : Series

min_periods : int, optional

Minimum number of observations needed to have a valid result

Returns:

covariance : float

Normalized by N-1 (unbiased estimator).

cummax(axis=None, skipna=True)

Return cumulative max over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummax : Series

See also

pandas.core.window.Expanding.max
Similar functionality but ignores NaN values.
cummin(axis=None, skipna=True)

Return cumulative minimum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cummin : Series

See also

pandas.core.window.Expanding.min
Similar functionality but ignores NaN values.
cumprod(axis=None, skipna=True)

Return cumulative product over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumprod : Series

See also

pandas.core.window.Expanding.prod
Similar functionality but ignores NaN values.
cumsum(axis=None, skipna=True)

Return cumulative sum over requested axis.

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

cumsum : Series

See also

pandas.core.window.Expanding.sum
Similar functionality but ignores NaN values.
describe(split_every=False)

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:

percentiles : list-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include : ‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.
  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
  • None (default) : The result will include all numeric columns.

exclude : list-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To exclude pandas categorical columns, use 'category'
  • None (default) : The result will exclude nothing.
Returns:

summary: Series/DataFrame of summary statistics

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()  
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'],  
...                     'numeric': [1, 2, 3],
...                     'categorical': pd.Categorical(['d','e','f'])
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])  
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])  
        categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
diff(periods=1, axis=0)

1st discrete difference of object

Parameters:

periods : int, default 1

Periods to shift for forming difference

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

Take difference over rows (0) or columns (1).

Returns:

diffed : DataFrame

div(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

drop_duplicates(split_every=None, split_out=1, **kwargs)

Return DataFrame with duplicate rows removed, optionally only considering certain columns

Parameters:

subset : column label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

inplace : boolean, default False

Whether to drop duplicates in place or to return a copy

Returns:

deduplicated : DataFrame

Notes

Dask doesn’t support the following argument(s).

  • subset
  • keep
  • inplace
dropna()

Return Series without null values

Returns:

valid : Series

inplace : boolean, default False

Do operation in place.

Notes

Dask doesn’t support the following argument(s).

  • axis
  • inplace
dtype

Return data type

eq(other, level=None, axis=0)

Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

ffill(axis=None, limit=None)

Synonym for DataFrame.fillna(method='ffill')

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast
fillna(value=None, method=None, limit=None, axis=None)

Fill NA/NaN values using the specified method

Parameters:

value : scalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.

method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap

axis : {0 or ‘index’, 1 or ‘columns’}

inplace : boolean, default False

If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

limit : int, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcast : dict, default is None

a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)

Returns:

filled : DataFrame

See also

reindex, asfreq

Notes

Dask doesn’t support the following argument(s).

  • inplace
  • downcast

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, 5],
...                    [np.nan, 3, np.nan, 4]],
...                    columns=list('ABCD'))
>>> df  
     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

Replace all NaN elements with 0s.

>>> df.fillna(0)  
    A   B   C   D
0   0.0 2.0 0.0 0
1   3.0 4.0 0.0 1
2   0.0 0.0 0.0 5
3   0.0 3.0 0.0 4

We can also propagate non-null values forward or backward.

>>> df.fillna(method='ffill')  
    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   3.0 4.0 NaN 5
3   3.0 3.0 NaN 4

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}  
>>> df.fillna(value=values)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 2.0 1
2   0.0 1.0 2.0 5
3   0.0 3.0 2.0 4

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
    A   B   C   D
0   0.0 2.0 2.0 0
1   3.0 4.0 NaN 1
2   NaN 1.0 NaN 5
3   NaN 3.0 NaN 4
first(offset)

Convenience method for subsetting initial periods of time series data based on a date offset.

Parameters:offset : string, DateOffset, dateutil.relativedelta
Returns:subset : type of caller

Examples

ts.first(‘10D’) -> First 10 days

floordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rfloordiv

ge(other, level=None, axis=0)

Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(by=None, **kwargs)

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Parameters:

by : mapping, function, str, or iterable

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A str or list of strs may be passed to group by the columns in self

axis : int, default 0

level : int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : boolean, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : boolean, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

squeeze : boolean, default False

reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns:

GroupBy object

Notes

Dask doesn’t support the following argument(s).

  • axis
  • level
  • as_index
  • sort
  • group_keys
  • squeeze

Examples

DataFrame results

>>> data.groupby(func, axis=0).mean()  
>>> data.groupby(['col1', 'col2'])['col3'].mean()  

DataFrame with hierarchical index

>>> data.groupby(['col1', 'col2']).mean()  
gt(other, level=None, axis=0)

Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

head(n=5, npartitions=1, compute=True)

First n rows of the dataset

Parameters:

n : int, optional

The number of rows to return. Default is 5.

npartitions : int, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

compute : bool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False)

Return index of first occurrence of maximum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be first index.

Returns:

idxmax : Series

See also

Series.idxmax

Notes

This method is the DataFrame version of ndarray.argmax.

idxmin(axis=None, skipna=True, split_every=False)

Return index of first occurrence of minimum over requested axis. NA/null values are excluded.

Parameters:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns:

idxmin : Series

See also

Series.idxmin

Notes

This method is the DataFrame version of ndarray.argmin.

index

Return dask Index instance

isin(values)

Return a boolean Series showing whether each element in the Series is exactly contained in the passed sequence of values.

Parameters:

values : set or list-like

The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

New in version 0.18.1.

Support for values as a set

Returns:

isin : Series (bool dtype)

Raises:

TypeError

  • If values is a string

See also

pandas.DataFrame.isin

Examples

>>> s = pd.Series(list('abc'))  
>>> s.isin(['a', 'c', 'e'])  
0     True
1    False
2     True
dtype: bool

Passing a single string as s.isin('a') will raise an error. Use a list of one element instead:

>>> s.isin(['a'])  
0     True
1    False
2    False
dtype: bool
isnull()

Return a boolean same-sized object indicating if the values are NA.

See also

DataFrame.notna
boolean inverse of isna
DataFrame.isnull
alias of isna
isna
top-level isna
iteritems()

Lazily iterate over (index, value) tuples

known_divisions

Whether divisions are already known

last(offset)

Convenience method for subsetting final periods of time series data based on a date offset.

Parameters:offset : string, DateOffset, dateutil.relativedelta
Returns:subset : type of caller

Examples

ts.last(‘5M’) -> Last 5 months

le(other, level=None, axis=0)

Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

loc

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, level=None, axis=0)

Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

map(arg, na_action=None, meta='__no_default__')

Map values of Series using input correspondence (which can be a dict, Series, or function)

Parameters:

arg : function, dict, or Series

na_action : {None, ‘ignore’}

If ‘ignore’, propagate NA values, without passing them to the mapping function

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

y : Series

same index as caller

See also

Series.apply
For applying more complex functions on a Series
DataFrame.apply
Apply a function row-/column-wise
DataFrame.applymap
Apply a function elementwise on a whole DataFrame

Notes

When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN:

>>> from collections import Counter  
>>> counter = Counter()  
>>> counter['bar'] += 1  
>>> y.map(counter)  
1    0
2    1
3    0
dtype: int64

Examples

Map inputs to outputs (both of type Series)

>>> x = pd.Series([1,2,3], index=['one', 'two', 'three'])  
>>> x  
one      1
two      2
three    3
dtype: int64
>>> y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])  
>>> y  
1    foo
2    bar
3    baz
>>> x.map(y)  
one   foo
two   bar
three baz

If arg is a dictionary, return a new Series with values converted according to the dictionary’s mapping:

>>> z = {1: 'A', 2: 'B', 3: 'C'}  
>>> x.map(z)  
one   A
two   B
three C

Use na_action to control whether NA values are affected by the mapping function.

>>> s = pd.Series([1, 2, 3, np.nan])  
>>> s2 = s.map('this is a string {}'.format, na_action=None)  
0    this is a string 1.0
1    this is a string 2.0
2    this is a string 3.0
3    this is a string nan
dtype: object
>>> s3 = s.map('this is a string {}'.format, na_action='ignore')  
0    this is a string 1.0
1    this is a string 2.0
2    this is a string 3.0
3                     NaN
dtype: object
map_overlap(func, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters:

func : function

Function applied to each partition.

before : int

The number of rows to prepend to partition i from the end of partition i - 1.

after : int

The number of rows to append to partition i from the beginning of partition i + 1.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

  1. Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.
  2. Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.
  3. Apply func to each partition, passing in any extra args and kwargs if provided.
  4. Trim before rows from the beginning of all but the first partition.
  5. Trim after rows from the end of all but the last partition.

Note that the index and divisions are assumed to remain unchanged.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
0   1  1.0
1   2  2.0
2   4  3.0
3   7  4.0
4  11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
0   NaN  NaN
1   3.0  3.0
2   6.0  5.0
3  11.0  7.0
4  18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a timedelta for time- based windows. >>> ts = pd.Series(range(10), index=pd.date_range(‘2017’, periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling(‘2D’).sum(), ... pd.Timedelta(‘2D’), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64

map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  
mask(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.

Parameters:

cond : boolean NDFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as cond.

other : scalar, NDFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

errors : str, {‘raise’, ‘ignore’}, default ‘raise’

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Deprecated since version 0.21.0.

Returns:

wh : same type as caller

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
>>> s.where(s > 1, 10)  
0    10.0
1    10.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=None, skipna=True, split_every=False)
This method returns the maximum of the values in the object.
If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

max : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
mean(axis=None, skipna=True, split_every=False)

Return the mean of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

mean : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
memory_usage(index=True, deep=False)

Memory usage of the Series

Parameters:

index : bool

Specifies whether to include memory usage of Series index

deep : bool

Introspect the data deeply, interrogate object dtypes for system-level memory consumption

Returns:

scalar bytes of memory consumed

See also

numpy.ndarray.nbytes

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False

min(axis=None, skipna=True, split_every=False)
This method returns the minimum of the values in the object.
If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.
Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

min : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
mod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmod

mul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rmul

nbytes

Number of bytes

ndim

Return dimensionality

ne(other, level=None, axis=0)

Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.None

nlargest(n=5, split_every=None)

Return the largest n elements.

Parameters:

n : int

Return this many descending sorted values

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

top_n : Series

The n largest values in the Series, in sorted order

See also

Series.nsmallest

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> import pandas as pd  
>>> import numpy as np  
>>> s = pd.Series(np.random.randn(10**6))  
>>> s.nlargest(10)  # only sorts up to the N requested  
219921    4.644710
82124     4.608745
421689    4.564644
425277    4.447014
718691    4.414137
43154     4.403520
283187    4.313922
595519    4.273635
503969    4.250236
121637    4.240952
dtype: float64
notnull()

Return a boolean same-sized object indicating if the values are not NA.

See also

DataFrame.isna
boolean inverse of notna
DataFrame.notnull
alias of notna
notna
top-level notna
npartitions

Return number of partitions

nsmallest(n=5, split_every=None)

Return the smallest n elements.

Parameters:

n : int

Return this many ascending sorted values

keep : {‘first’, ‘last’, False}, default ‘first’

Where there are duplicate values: - first : take the first occurrence. - last : take the last occurrence.

Returns:

bottom_n : Series

The n smallest values in the Series, in sorted order

See also

Series.nlargest

Notes

Faster than .sort_values().head(n) for small n relative to the size of the Series object.

Examples

>>> import pandas as pd  
>>> import numpy as np  
>>> s = pd.Series(np.random.randn(10**6))  
>>> s.nsmallest(10)  # only sorts up to the N requested  
288532   -4.954580
732345   -4.835960
64803    -4.812550
446457   -4.609998
501225   -4.483945
669476   -4.472935
973615   -4.401699
621279   -4.355126
773916   -4.347355
359919   -4.331927
dtype: float64
nunique(split_every=None)

Return number of unique elements in the object.

Excludes NA values by default.

Parameters:

dropna : boolean, default True

Don’t include NaN in the count.

Returns:

nunique : int

Notes

Dask doesn’t support the following argument(s).

  • dropna
nunique_approx(split_every=None)

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters:

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns:

a float representing the approximate number of elements

persist(**kwargs)

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

Parameters:

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graph : bool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

**kwargs

Extra keywords to forward to the scheduler get function.

Returns:

New dask collections backed by in-memory data

See also

dask.base.persist

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs)

Parameters:

func : function

function to apply to the NDFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the NDFrame.

args : iterable, optional

positional arguments passed into func.

kwargs : mapping, optional

a dictionary of keyword arguments passed into func.

Returns:

object : the return type of func.

See also

pandas.DataFrame.apply, pandas.DataFrame.applymap, pandas.Series.map

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> f(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(f, arg2=b, arg3=c)
... )

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((f, 'arg2'), arg1=a, arg3=c)
...  )
pow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rpow

prod(axis=None, skipna=True, split_every=False)

Return the product of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

prod : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
quantile(q=0.5)

Approximate quantiles of Series

q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
radd(other, level=None, fill_value=None, axis=0)

Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.add

random_split(frac, random_state=None)

Pseudorandomly split dataframe into different pieces row-wise

Parameters:

frac : list

List of floats that should sum to one.

random_state: int or np.random.RandomState

If int create a new RandomState with this as the seed

Otherwise draw from the passed RandomState

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

reduction(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)

Generic row-wise reductions.

Parameters:

chunk : callable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregate : callable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

  • scalar: Input is a Series, with one row per partition.
  • Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combine : callable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

token : str, optional

The name to use for the output keys.

split_every : int, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargs : dict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargs : dict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargs : dict, optional

Keyword arguments to pass on to combine only.

kwargs :

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'sum': x.sum(), 'count': x.count()})
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
repartition(divisions=None, npartitions=None, freq=None, force=False)

Repartition dataframe along new divisions

Parameters:

divisions : list, optional

List of partitions to be used. If specified npartitions will be ignored.

npartitions : int, optional

Number of partitions of output, must be less than npartitions of input. Only used if divisions isn’t specified.

freq : str, pd.Timedelta

A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

force : bool, default False

Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  
resample(rule, how=None, closed=None, label=None)

Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Parameters:

rule : string

the offset string or object representing target conversion

axis : int, optional, default 0

closed : {‘right’, ‘left’}

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

label : {‘right’, ‘left’}

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

convention : {‘start’, ‘end’, ‘s’, ‘e’}

For PeriodIndex only, controls whether to use the start or end of rule

loffset : timedelta

Adjust the resampled time labels

base : int, default 0

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

on : string, optional

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

New in version 0.19.0.

level : string or int, optional

For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.

New in version 0.19.0.

Notes

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5] #select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the pad method.

>>> series.resample('30S').pad()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(array_like):  
...     return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
                                                freq='A',
                                                periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64

Resample by month using ‘start’ convention. Values are assigned to the first month of the period.

>>> s.resample('M', convention='start').asfreq().head()  
2012-01    1.0
2012-02    NaN
2012-03    NaN
2012-04    NaN
2012-05    NaN
Freq: M, dtype: float64

Resample by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> s.resample('M', convention='end').asfreq()  
2012-12    1.0
2013-01    NaN
2013-02    NaN
2013-03    NaN
2013-04    NaN
2013-05    NaN
2013-06    NaN
2013-07    NaN
2013-08    NaN
2013-09    NaN
2013-10    NaN
2013-11    NaN
2013-12    2.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])  
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> df.resample('3T', on='time').sum()  
                     a  b  c  d
time
2000-01-01 00:00:00  0  3  6  9
2000-01-01 00:03:00  0  3  6  9
2000-01-01 00:06:00  0  3  6  9

For a DataFrame with MultiIndex, the keyword level can be used to specify on level the resampling needs to take place.

>>> time = pd.date_range('1/1/2000', periods=5, freq='T')  
>>> df2 = pd.DataFrame(data=10*[range(4)],  
                       columns=['a', 'b', 'c', 'd'],
                       index=pd.MultiIndex.from_product([time, [1, 2]])
                       )
>>> df2.resample('3T', level=0).sum()  
                     a  b   c   d
2000-01-01 00:00:00  0  6  12  18
2000-01-01 00:03:00  0  4   8  12
reset_index(drop=False)

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:

drop : boolean, default False

Do not try to insert index into dataframe columns.

rfloordiv(other, level=None, fill_value=None, axis=0)

Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.floordiv

rmod(other, level=None, fill_value=None, axis=0)

Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mod

rmul(other, level=None, fill_value=None, axis=0)

Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.mul

rolling(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)

Provides rolling transformations.

Parameters:

window : int, str, offset

Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

center : boolean, default False

Set the labels at the center of the window.

win_type : string, default None

Provide a window type. The recognized window types are identical to pandas.

axis : int, default 0

Returns:

a Rolling object on which to call a method to compute a statistic

Notes

The freq argument is not supported.

round(decimals=0)

Round each value in a Series to the given number of decimals.

Parameters:

decimals : int

Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns:

Series object

See also

numpy.around, DataFrame.round

rpow(other, level=None, fill_value=None, axis=0)

Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.pow

rsub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.sub

rtruediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.truediv

sample(frac, replace=False, random_state=None)

Random sample of items

Parameters:

frac : float, optional

Fraction of axis items to return.

replace: boolean, optional

Sample with or without replacement. Default = False.

random_state: int or ``np.random.RandomState``

If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState

See also

DataFrame.random_split, pandas.DataFrame.sample

sem(axis=None, skipna=None, ddof=1, split_every=False)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sem : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
shift(periods=1, freq=None, axis=0)

Shift index by desired number of periods with an optional time freq

Parameters:

periods : int

Number of periods to move, can be positive or negative

freq : DateOffset, timedelta, or time rule string, optional

Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.

axis : {0 or ‘index’, 1 or ‘columns’}

Returns:

shifted : DataFrame

Notes

If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.

size

Size of the series

std(axis=None, skipna=True, ddof=1, split_every=False)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

std : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
sub(other, level=None, fill_value=None, axis=0)

Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rsub

sum(axis=None, skipna=True, split_every=False)

Return the sum of the values for the requested axis

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA or empty, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

sum : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
tail(n=5, compute=True)

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_bag(index=False)

Craeate a Dask Bag from a Series

to_csv(filename, **kwargs)

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, ...

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 
Parameters:

filename : string

Path glob indicating the naming scheme for the output files

name_function : callable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions

compression : string or None

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

sep : character, default ‘,’

Field delimiter for the output file

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

storage_options: dict

Parameters passed on to the backend filesystem class.

Returns

——-

The names of the file written if they were computed right away

If not, the delayed tasks associated to the writing of the files

to_delayed()

Create Dask Delayed objects from a Dask Dataframe

Returns a list of delayed values, one value per partition.

Examples

>>> partitions = df.to_delayed()  
to_frame(name=None)

Convert Series to DataFrame

Parameters:

name : object, default None

The passed name should substitute for the series name (if it has one).

Returns:

data_frame : DataFrame

to_hdf(path_or_buf, key, mode='a', append=False, get=None, **kwargs)

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters:

path: string

Path to a target filename. May contain a * to denote many filenames

key: string

Datapath within the files. May contain a * to denote many locations

name_function: function

A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)

compute: bool

Whether or not to execute immediately. If False then this returns a dask.Delayed value.

lock: Lock, optional

Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.

**other:

See pandas.to_hdf for more information

Returns:

None: if compute == True

delayed value: if compute == False

See also

read_hdf, to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', get=dask.multiprocessing.get) 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 
to_parquet(path, *args, **kwargs)

Store Dask.dataframe to Parquet files

Parameters:

df : dask.dataframe.DataFrame

path : string

Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.

engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.

compression : string or dict, optional

Either a string like "snappy" or a dictionary mapping column names to compressors like {"name": "gzip", "values": "snappy"}. The default is "default", which uses the default compression for whichever engine is selected.

write_index : boolean, optional

Whether or not to write the index. Defaults to True if divisions are known.

append : bool, optional

If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

ignore_divisions : bool, optional

If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.

partition_on : list, optional

Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.

storage_options : dict, optional

Key/value pairs to be passed on to the file-system backend, if any.

compute : bool, optional

If True (default) then the result is computed immediately. If False then a dask.delayed object is returned for future computation.

**kwargs

Extra options to be passed on to the specific backend.

See also

read_parquet
Read parquet data to dask.dataframe

Notes

Each partition will be written to a separate file.

Examples

>>> df = dd.read_csv(...)  
>>> to_parquet('/path/to/output/', df, compression='snappy')  
to_string(max_rows=5)

Render a string representation of the Series

Parameters:

buf : StringIO-like, optional

buffer to write to

na_rep : string, optional

string representation of NAN to use, default ‘NaN’

float_format : one-parameter function, optional

formatter function to apply to columns’ elements if they are floats default None

header: boolean, default True

Add the Series header (index name)

index : bool, optional

Add index (row) labels, default True

length : boolean, default False

Add the Series length

dtype : boolean, default False

Add the Series dtype

name : boolean, default False

Add the Series name if not None

max_rows : int, optional

Maximum number of rows to show before truncating. If None, show all.

Returns:

formatted : string (if not buffer passed)

Notes

Dask doesn’t support the following argument(s).

  • buf
  • na_rep
  • float_format
  • header
  • index
  • length
  • dtype
  • name
to_timestamp(freq=None, how='start', axis=0)

Cast to DatetimeIndex of timestamps, at beginning of period

Parameters:

freq : string, default frequency of PeriodIndex

Desired frequency

how : {‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

The axis to convert (the index by default)

copy : boolean, default True

If false then underlying input data is not copied

Returns:

df : DataFrame with DatetimeIndex

Notes

Dask doesn’t support the following argument(s).

  • copy
truediv(other, level=None, fill_value=None, axis=0)

Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:

other : Series or scalar value

fill_value : None or float value, default None (NaN)

Fill missing (NaN) values with this value. If both Series are missing, the result will be missing

level : int or name

Broadcast across a level, matching Index values on the passed MultiIndex level

Returns:

result : Series

See also

Series.rtruediv

unique(split_every=None, split_out=1)

Return Series of unique values in the object. Includes NA values.

Returns:uniques : Series
value_counts(split_every=None, split_out=1)

Returns object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:

normalize : boolean, default False

If True then the object returned will contain the relative frequencies of the unique values.

sort : boolean, default True

Sort by values

ascending : boolean, default False

Sort in ascending order

bins : integer, optional

Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data

dropna : boolean, default True

Don’t include counts of NaN.

Returns:

counts : Series

Notes

Dask doesn’t support the following argument(s).

  • normalize
  • sort
  • ascending
  • bins
  • dropna
values

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(axis=None, skipna=True, ddof=1, split_every=False)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:

axis : {index (0), columns (1)}

skipna : boolean, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA

level : int or level name, default None

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

ddof : int, default 1

degrees of freedom

numeric_only : boolean, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns:

var : Series or DataFrame (if level specified)

Notes

Dask doesn’t support the following argument(s).

  • level
  • numeric_only
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:

filename : str or None, optional

The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.

format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graph : bool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:

result : IPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.base.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

http://dask.pydata.org/en/latest/optimize.html

where(cond, other=nan)

Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.

Parameters:

cond : boolean NDFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as cond.

other : scalar, NDFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).

New in version 0.18.1: A callable can be used as other.

inplace : boolean, default False

Whether to perform the operation in place on the data

axis : alignment axis if needed, default None

level : alignment level if needed, default None

errors : str, {‘raise’, ‘ignore’}, default ‘raise’

  • raise : allow exceptions to be raised
  • ignore : suppress exceptions. On error return original object

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

try_cast : boolean, default False

try to cast the result back to the input type (if possible),

raise_on_error : boolean, default True

Whether to raise on invalid data types (e.g. trying to where on strings)

Deprecated since version 0.21.0.

Returns:

wh : same type as caller

See also

DataFrame.mask()

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
>>> s.where(s > 1, 10)  
0    10.0
1    10.0
2    2.0
3    3.0
4    4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

DataFrameGroupBy

class dask.dataframe.groupby.DataFrameGroupBy(df, by=None, slice=None)
agg(arg, split_every=None, split_out=1)

Aggregate using callable, string, dict, or list of string/callables

Parameters:

func : callable, string, dictionary, or list of string/callables

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.

Accepted Combinations are:

  • string function name
  • function
  • list of functions
  • dict of column names -> functions (or list of functions)
Returns:

aggregated : DataFrame

See also

pandas.DataFrame.groupby.apply, pandas.DataFrame.groupby.transform, pandas.DataFrame.aggregate

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

agg is an alias for aggregate. Use the alias.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})
>>> df  
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907
aggregate(arg, split_every=None, split_out=1)

Aggregate using callable, string, dict, or list of string/callables

Parameters:

func : callable, string, dictionary, or list of string/callables

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.

Accepted Combinations are:

  • string function name
  • function
  • list of functions
  • dict of column names -> functions (or list of functions)
Returns:

aggregated : DataFrame

See also

pandas.DataFrame.groupby.apply, pandas.DataFrame.groupby.transform, pandas.DataFrame.aggregate

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

agg is an alias for aggregate. Use the alias.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 2],  
...                    'B': [1, 2, 3, 4],
...                    'C': np.random.randn(4)})
>>> df  
   A  B         C
0  1  1  0.362838
1  1  2  0.227877
2  2  3  1.267767
3  2  4 -0.562860

The aggregation is for each column.

>>> df.groupby('A').agg('min')  
   B         C
A
1  1  0.227877
2  3 -0.562860

Multiple aggregations

>>> df.groupby('A').agg(['min', 'max'])  
    B             C
  min max       min       max
A
1   1   2  0.227877  0.362838
2   3   4 -0.562860  1.267767

Select a column for aggregation

>>> df.groupby('A').B.agg(['min', 'max'])  
   min  max
A
1    1    2
2    3    4

Different aggregations per column

>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})  
    B             C
  min max       sum
A
1   1   2  0.590716
2   3   4  0.704907
apply(func, meta='__no_default__')

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. The user should provide output metadata.
  2. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters:

func: function

Function to apply

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

count(split_every=None, split_out=1)

Compute count of group, excluding missing values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

cumcount(axis=None)

Number each item in each group from 0 to the length of that group - 1.

Essentially this is equivalent to

>>> self.apply(lambda x: Series(np.arange(len(x)), x.index))  
Parameters:

ascending : bool, default True

If False, number in reverse, from length of group - 1 to 0.

See also

ngroup
Number the groups themselves.

Notes

Dask doesn’t support the following argument(s).

  • ascending

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  
...                   columns=['A'])
>>> df  
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64
cumprod(axis=0)

Cumulative product for each group

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

cumsum(axis=0)

Cumulative sum for each group

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

get_group(key)

Constructs NDFrame from group with provided name

Parameters:

name : object

the name of the group to get as a DataFrame

obj : NDFrame, default None

the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : type of obj

Notes

Dask doesn’t support the following argument(s).

  • name
  • obj
max(split_every=None, split_out=1)

Compute max of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

mean(split_every=None, split_out=1)

Compute mean of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

min(split_every=None, split_out=1)

Compute min of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

size(split_every=None, split_out=1)

Compute group sizes

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

std(ddof=1, split_every=None, split_out=1)

Compute standard deviation of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

sum(split_every=None, split_out=1)

Compute sum of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

var(ddof=1, split_every=None, split_out=1)

Compute variance of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

SeriesGroupBy

class dask.dataframe.groupby.SeriesGroupBy(df, by=None, slice=None)
agg(arg, split_every=None, split_out=1)

Aggregate using callable, string, dict, or list of string/callables

Parameters:

func : callable, string, dictionary, or list of string/callables

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.

Accepted Combinations are:

  • string function name
  • function
  • list of functions
  • dict of column names -> functions (or list of functions)
Returns:

aggregated : Series

See also

pandas.Series.groupby.apply, pandas.Series.groupby.transform, pandas.Series.aggregate

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

agg is an alias for aggregate. Use the alias.

Examples

>>> s = Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.groupby([1, 1, 2, 2]).min()  
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min')  
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  
   min  max
1    1    2
2    3    4
aggregate(arg, split_every=None, split_out=1)

Aggregate using callable, string, dict, or list of string/callables

Parameters:

func : callable, string, dictionary, or list of string/callables

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.

Accepted Combinations are:

  • string function name
  • function
  • list of functions
  • dict of column names -> functions (or list of functions)
Returns:

aggregated : Series

See also

pandas.Series.groupby.apply, pandas.Series.groupby.transform, pandas.Series.aggregate

Notes

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

agg is an alias for aggregate. Use the alias.

Examples

>>> s = Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.groupby([1, 1, 2, 2]).min()  
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min')  
1    1
2    3
dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max'])  
   min  max
1    1    2
2    3    4
apply(func, meta='__no_default__')

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. The user should provide output metadata.
  2. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters:

func: function

Function to apply

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

applied : Series or DataFrame depending on columns keyword

count(split_every=None, split_out=1)

Compute count of group, excluding missing values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

cumcount(axis=None)

Number each item in each group from 0 to the length of that group - 1.

Essentially this is equivalent to

>>> self.apply(lambda x: Series(np.arange(len(x)), x.index))  
Parameters:

ascending : bool, default True

If False, number in reverse, from length of group - 1 to 0.

See also

ngroup
Number the groups themselves.

Notes

Dask doesn’t support the following argument(s).

  • ascending

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  
...                   columns=['A'])
>>> df  
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64
cumprod(axis=0)

Cumulative product for each group

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

cumsum(axis=0)

Cumulative sum for each group

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

get_group(key)

Constructs NDFrame from group with provided name

Parameters:

name : object

the name of the group to get as a DataFrame

obj : NDFrame, default None

the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used

Returns:

group : type of obj

Notes

Dask doesn’t support the following argument(s).

  • name
  • obj
max(split_every=None, split_out=1)

Compute max of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

mean(split_every=None, split_out=1)

Compute mean of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

min(split_every=None, split_out=1)

Compute min of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

size(split_every=None, split_out=1)

Compute group sizes

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

std(ddof=1, split_every=None, split_out=1)

Compute standard deviation of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

sum(split_every=None, split_out=1)

Compute sum of group values

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

var(ddof=1, split_every=None, split_out=1)

Compute variance of groups, excluding missing values

For multiple groupings, the result index will be a MultiIndex

Parameters:

ddof : integer, default 1

degrees of freedom

See also

pandas.Series.groupby, pandas.DataFrame.groupby, pandas.Panel.groupby

Storage and Conversion

dask.dataframe.read_csv(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, **kwargs)

Read CSV files into a Dask.DataFrame

This parallelizes the pandas.read_csv function in the following ways:

  • It supports loading many files at once using globstrings:

    >>> df = dd.read_csv('myfiles.*.csv')  
    
  • In some cases it can break up large files:

    >>> df = dd.read_csv('largefile.csv', blocksize=25e6)  # 25MB chunks  
    
  • It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

    >>> df = dd.read_csv('s3://bucket/myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs:///myfiles.*.csv')  
    >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv')  
    

Internally dd.read_csv uses pandas.read_csv and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_csv for more information on available keyword arguments.

Parameters:

urlpath : string

Absolute or relative filepath, URL (may include protocols like s3://), or globstring for CSV files.

blocksize : int or None, optional

Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If None, use a single block for each file.

collection : boolean, optional

Return a dask.dataframe if True or list of dask.delayed objects if False

sample : int, optional

Number of bytes to use when determining dtypes

assume_missing : bool, optional

If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.

storage_options : dict, optional

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.

**kwargs

Extra keyword arguments to forward to pandas.read_csv.

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

  • Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
  • Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
  • Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

dask.dataframe.read_table(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, **kwargs)

Read delimited files into a Dask.DataFrame

This parallelizes the pandas.read_table function in the following ways:

  • It supports loading many files at once using globstrings:

    >>> df = dd.read_table('myfiles.*.csv')  
    
  • In some cases it can break up large files:

    >>> df = dd.read_table('largefile.csv', blocksize=25e6)  # 25MB chunks  
    
  • It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:

    >>> df = dd.read_table('s3://bucket/myfiles.*.csv')  
    >>> df = dd.read_table('hdfs:///myfiles.*.csv')  
    >>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv')  
    

Internally dd.read_table uses pandas.read_table and supports many of the same keyword arguments with the same performance guarantees. See the docstring for pandas.read_table for more information on available keyword arguments.

Parameters:

urlpath : string

Absolute or relative filepath, URL (may include protocols like s3://), or globstring for delimited files.

blocksize : int or None, optional

Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If None, use a single block for each file.

collection : boolean, optional

Return a dask.dataframe if True or list of dask.delayed objects if False

sample : int, optional

Number of bytes to use when determining dtypes

assume_missing : bool, optional

If True, all integer columns that aren’t specified in dtype are assumed to contain missing values, and are converted to floats. Default is False.

storage_options : dict, optional

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.

**kwargs

Extra keyword arguments to forward to pandas.read_table.

Notes

Dask dataframe tries to infer the dtype of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if the dtype is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was a NaN, then this would error at compute time. To fix this, you have a few options:

  • Provide explicit dtypes for the offending columns using the dtype keyword. This is the recommended solution.
  • Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats.
  • Increase the size of the sample using the sample keyword.

It should also be noted that this function may fail if a delimited file includes quoted strings that contain the line terminator. To get around this you can specify blocksize=None to not split files into multiple partitions, at the cost of reduced parallelism.

dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto')

Read ParquetFile into a Dask DataFrame

This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.

Parameters:

path : string

Source directory for data. May be a glob string. Prepend with protocol like s3:// or hdfs:// for remote data.

columns: list or None

List of column names to load

filters: list

List of filters to apply, like [('x', '>', 0), ...]. This implements row-group (partition) -level filtering only, i.e., to prevent the loading of some chunks of the data, and only if relevant statistics have been included in the metadata.

index: string or None (default) or False

Name of index column to use if that column is sorted; False to force dask to not use any column as the index

categories: list, dict or None

For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.

storage_options : dict

Key/value pairs to be passed on to the file-system backend, if any.

engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet reader library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’

See also

to_parquet

Examples

>>> df = read_parquet('s3://bucket/my-parquet-data')  
dask.dataframe.read_hdf(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='a')

Read HDF files into a Dask DataFrame

Read hdf files into a dask dataframe. This function is like pandas.read_hdf, except it can read from a single large file, or from multiple files, or from multiple keys from the same file.

Parameters:

pattern : string, list

File pattern (string), buffer to read from, or list of file paths. Can contain wildcards.

key : group identifier in the store. Can contain wildcards

start : optional, integer (defaults to 0), row number to start at

stop : optional, integer (defaults to None, the last row), row number to

stop at

columns : list of columns, optional

A list of columns that if not None, will limit the return columns (default is None)

chunksize : positive integer, optional

Maximal number of rows per partition (default is 1000000).

sorted_index : boolean, optional

Option to specify whether or not the input hdf files have a sorted index (default is False).

lock : boolean, optional

Option to use a lock to prevent concurrency issues (default is True).

mode : {‘a’, ‘r’, ‘r+’}, default ‘a’. Mode to use when opening file(s).

‘r’

Read-only; no data can be modified.

‘a’

Append; an existing file is opened for reading and writing, and if the file does not exist it is created.

‘r+’

It is similar to ‘a’, but the file must already exist.

Returns:

dask.DataFrame

Examples

Load single file

>>> dd.read_hdf('myfile.1.hdf5', '/x')  

Load multiple files

>>> dd.read_hdf('myfile.*.hdf5', '/x')  
>>> dd.read_hdf(['myfile.1.hdf5', 'myfile.2.hdf5'], '/x')  

Load multiple datasets

>>> dd.read_hdf('myfile.1.hdf5', '/*')  
dask.dataframe.read_sql_table(table, uri, index_col, divisions=None, npartitions=None, limits=None, columns=None, bytes_per_chunk=268435456, **kwargs)

Create dataframe from an SQL table.

If neither divisions or npartitions is given, the memory footprint of the first five rows will be determined, and partitions of size ~256MB will be used.

Parameters:

table : string or sqlalchemy expression

Select columns from here.

uri : string

Full sqlalchemy URI for the database connection

index_col : string

Column which becomes the index, and defines the partitioning. Should be a indexed column in the SQL server, and numerical. Could be a function to return a value, e.g., sql.func.abs(sql.column('value')).label('abs(value)'). Labeling columns created by functions or arithmetic operations is required.

divisions: sequence

Values of the index column to split the table by.

npartitions : int

Number of partitions, if divisions is not given. Will split the values of the index column linearly between limits, if given, or the column max/min.

limits: 2-tuple or None

Manually give upper and lower range of values for use with npartitions; if None, first fetches max/min from the DB. Upper limit, if given, is inclusive.

columns : list of strings or None

Which columns to select; if None, gets all; can include sqlalchemy functions, e.g., sql.func.abs(sql.column('value')).label('abs(value)'). Labeling columns created by functions or arithmetic operations is recommended.

bytes_per_chunk: int

If both divisions and npartitions is None, this is the target size of each partition, in bytes

kwargs : dict

Additional parameters to pass to pd.read_sql()

Returns:

dask.dataframe

Examples

>>> df = dd.read_sql('accounts', 'sqlite:///path/to/bank.db',
...                  npartitions=10, index_col='id')  
dask.dataframe.from_array(x, chunksize=50000, columns=None)

Read any slicable array into a Dask Dataframe

Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax

x[50000:100000]

and have 2 dimensions:

x.ndim == 2

or have a record dtype:

x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
dask.dataframe.from_pandas(data, npartitions=None, chunksize=None, sort=True, name=None)

Construct a Dask DataFrame from a Pandas DataFrame

This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel.

Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.

Parameters:

df : pandas.DataFrame or pandas.Series

The DataFrame/Series with which to construct a Dask DataFrame/Series

npartitions : int, optional

The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

chunksize : int, optional

The number of rows per index partition to use.

sort: bool

Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions

name: string, optional

An optional keyname for the dataframe. Defaults to hashing the input

Returns:

dask.DataFrame or dask.Series

A dask DataFrame/Series partitioned along the index

Raises:

TypeError

If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array
Construct a dask.DataFrame from an array that has record dtype
read_csv
Construct a dask.DataFrame from a CSV file

Examples

>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
dask.dataframe.from_bcolz(x, chunksize=None, categorize=True, index=None, lock=<unlocked _thread.lock object>, **kwargs)

Read BColz CTable into a Dask Dataframe

BColz is a fast on-disk compressed column store with careful attention given to compression. https://bcolz.readthedocs.io/en/latest/

Parameters:

x : bcolz.ctable

chunksize : int, optional

The size(rows) of blocks to pull out from ctable.

categorize : bool, defaults to True

Automatically categorize all string dtypes

index : string, optional

Column to make the index

lock: bool or Lock

Lock to use when reading or False for no lock (not-thread-safe)

See also

from_array
more generic function not optimized for bcolz
dask.dataframe.from_dask_array(x, columns=None)

Create a Dask DataFrame from a Dask Array.

Converts a 2d array into a DataFrame and a 1d array into a Series.

Parameters:

x: da.Array

columns: list or string

list of column names if DataFrame, single string if Series

See also

dask.bag.to_dataframe
from dask.bag
dask.dataframe._Frame.values
Reverse conversion
dask.dataframe._Frame.to_records
Reverse conversion

Examples

>>> import dask.array as da
>>> import dask.dataframe as dd
>>> x = da.ones((4, 2), chunks=(2, 2))
>>> df = dd.io.from_dask_array(x, columns=['a', 'b'])
>>> df.compute()
     a    b
0  1.0  1.0
1  1.0  1.0
2  1.0  1.0
3  1.0  1.0
dask.dataframe.from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed')

Create Dask DataFrame from many Dask Delayed objects

Parameters:

dfs : list of Delayed

An iterable of dask.delayed.Delayed objects, such as come from dask.delayed These comprise the individual partitions of the resulting dataframe.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

divisions : tuple, str, optional

Partition boundaries along the index. For tuple, see http://dask.pydata.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

prefix : str, optional

Prefix to prepend to the keys.

dask.dataframe.to_delayed(df)

Create Dask Delayed objects from a Dask Dataframe

Returns a list of delayed values, one value per partition.

Examples

>>> partitions = df.to_delayed()  
dask.dataframe.to_records(df)

Create Dask Array from a Dask Dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

See also

dask.dataframe._Frame.values, dask.dataframe.from_dask_array

Examples

>>> df.to_records()  
dask.array<shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,)>
dask.dataframe.to_csv(df, filename, name_function=None, compression=None, compute=True, get=None, storage_options=None, **kwargs)

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, ...

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 
Parameters:

filename : string

Path glob indicating the naming scheme for the output files

name_function : callable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions

compression : string or None

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

sep : character, default ‘,’

Field delimiter for the output file

na_rep : string, default ‘’

Missing data representation

float_format : string, default None

Format string for floating point numbers

columns : sequence, optional

Columns to write

header : boolean or list of string, default True

Write out column names. If a list of string is given it is assumed to be aliases for the column names

index : boolean, default True

Write row names (index)

index_label : string or sequence, or False, default None

Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R

nanRep : None

deprecated, use na_rep

mode : str

Python write mode, default ‘w’

encoding : string, optional

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.

compression : string, optional

a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename

line_terminator : string, default ‘n’

The newline character or character sequence to use in the output file

quoting : optional constant from csv module

defaults to csv.QUOTE_MINIMAL

quotechar : string (length 1), default ‘”’

character used to quote fields

doublequote : boolean, default True

Control quoting of quotechar inside a field

escapechar : string (length 1), default None

character used to escape sep and quotechar when appropriate

chunksize : int or None

rows to write at a time

tupleize_cols : boolean, default False

write multi_index columns as a list of tuples (if True) or new (expanded format) if False)

date_format : string, default None

Format string for datetime objects

decimal: string, default ‘.’

Character recognized as decimal separator. E.g. use ‘,’ for European data

storage_options: dict

Parameters passed on to the backend filesystem class.

Returns

——-

The names of the file written if they were computed right away

If not, the delayed tasks associated to the writing of the files

dask.dataframe.to_bag(df, index=False)

Create Dask Bag from a Dask DataFrame

Parameters:

index : bool, optional

If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

Examples

>>> bag = df.to_bag()  
dask.dataframe.to_hdf(df, path, key, mode='a', append=False, get=None, name_function=None, compute=True, lock=None, dask_kwargs={}, **kwargs)

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters:

path: string

Path to a target filename. May contain a * to denote many filenames

key: string

Datapath within the files. May contain a * to denote many locations

name_function: function

A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)

compute: bool

Whether or not to execute immediately. If False then this returns a dask.Delayed value.

lock: Lock, optional

Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.

**other:

See pandas.to_hdf for more information

Returns:

None: if compute == True

delayed value: if compute == False

See also

read_hdf, to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', get=dask.multiprocessing.get) 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 
dask.dataframe.to_parquet(df, path, engine='auto', compression='default', write_index=None, append=False, ignore_divisions=False, partition_on=None, storage_options=None, compute=True, **kwargs)

Store Dask.dataframe to Parquet files

Parameters:

df : dask.dataframe.DataFrame

path : string

Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.

engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’

Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.

compression : string or dict, optional

Either a string like "snappy" or a dictionary mapping column names to compressors like {"name": "gzip", "values": "snappy"}. The default is "default", which uses the default compression for whichever engine is selected.

write_index : boolean, optional

Whether or not to write the index. Defaults to True if divisions are known.

append : bool, optional

If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.

ignore_divisions : bool, optional

If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.

partition_on : list, optional

Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.

storage_options : dict, optional

Key/value pairs to be passed on to the file-system backend, if any.

compute : bool, optional

If True (default) then the result is computed immediately. If False then a dask.delayed object is returned for future computation.

**kwargs

Extra options to be passed on to the specific backend.

See also

read_parquet
Read parquet data to dask.dataframe

Notes

Each partition will be written to a separate file.

Examples

>>> df = dd.read_csv(...)  
>>> to_parquet('/path/to/output/', df, compression='snappy')  

Rolling

dask.dataframe.rolling.rolling_apply(arg, window, func, min_periods=None, freq=None, center=False, args=(), kwargs={})

Generic moving function application.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

func : function

Must produce a single value from an ndarray input

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

args : tuple

Passed on to func

kwargs : dict

Passed on to func

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.map_overlap(func, df, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

Parameters:

func : function

Function applied to each partition.

df : dd.DataFrame, dd.Series

before : int or timedelta

The rows to prepend to partition i from the end of partition i - 1.

after : int or timedelta

The rows to append to partition i from the beginning of partition i + 1.

args, kwargs :

Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.

See also

dd.DataFrame.map_overlap

dask.dataframe.rolling.rolling_count(arg, window, **kwargs)

Rolling count of number of non-NaN observations inside provided window.

Parameters:

arg : DataFrame or numpy ndarray-like

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

rolling_count : type of caller

Notes

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.rolling_kurt(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Unbiased moving kurtosis.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_max(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving maximum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’max’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_mean(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving mean.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_median(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving median.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’median’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_min(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving minimum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘’min’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_quantile(arg, window, quantile, min_periods=None, freq=None, center=False)

Moving quantile.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

quantile : float

0 <= quantile <= 1

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

dask.dataframe.rolling.rolling_skew(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Unbiased moving skewness.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_std(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving standard deviation.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_sum(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving sum.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_var(arg, window, min_periods=None, freq=None, center=False, **kwargs)

Moving variance.

Parameters:

arg : Series, DataFrame

window : int

Size of the moving window. This is the number of observations used for calculating the statistic.

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Set the labels at the center of the window.

how : string, default ‘None’

Method for down- or re-sampling

ddof : int, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

Returns:

y : type of input argument

Notes

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

dask.dataframe.rolling.rolling_window(arg, window=None, win_type=None, min_periods=None, freq=None, center=False, mean=True, axis=0, how=None, **kwargs)

Applies a moving window of type window_type and size window on the data.

Parameters:

arg : Series, DataFrame

window : int or ndarray

Weighting window specification. If the window is an integer, then it is treated as the window length and win_type is required

win_type : str, default None

Window type (see Notes)

min_periods : int, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

freq : string or DateOffset object, optional (default None)

Frequency to conform the data to before computing the statistic. Specified as a frequency string or DateOffset object.

center : boolean, default False

Whether the label should correspond with center of window

mean : boolean, default True

If True computes weighted mean, else weighted sum

axis : {0, 1}, default 0

how : string, default ‘mean’

Method for down- or re-sampling

Returns:

y : type of input argument

Notes

The recognized window types are:

  • boxcar
  • triang
  • blackman
  • hamming
  • bartlett
  • parzen
  • bohman
  • blackmanharris
  • nuttall
  • barthann
  • kaiser (needs beta)
  • gaussian (needs std)
  • general_gaussian (needs power, width)
  • slepian (needs width).

By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting center=True.

The freq keyword is used to conform time series data to a specified frequency by resampling the data. This is done with the default parameters of resample() (i.e. using the mean).

To learn more about the frequency strings, please see this link.

Other functions

dask.dataframe.compute(*args, **kwargs)

Compute several dask collections at once.

Parameters:

args : object

Any number of objects. If it is a dask object, it’s computed and the result is returned. By default, python builtin collections are also traversed to look for dask objects (for more information see the traverse keyword). Non-dask arguments are passed through unchanged.

traverse : bool, optional

By default dask traverses builtin python collections looking for dask objects passed to compute. For large collections this can be expensive. If none of the arguments contain any dask objects, set traverse=False to avoid doing this traversal.

get : callable, optional

A scheduler get function to use. If not provided, the default is to check the global settings first, and then fall back to defaults for the collections.

optimize_graph : bool, optional

If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler get function.

Examples

>>> import dask.array as da
>>> a = da.arange(10, chunks=2).sum()
>>> b = da.arange(10, chunks=2).mean()
>>> compute(a, b)
(45, 4.5)

By default, dask objects inside python collections will also be computed:

>>> compute({'a': a, 'b': b, 'c': 1})  
({'a': 45, 'b': 4.5, 'c': 1},)
dask.dataframe.map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

Parameters:

func : function

Function applied to each partition.

args, kwargs :

Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe.

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

dask.dataframe.multi.concat(dfs, axis=0, join='outer', interleave_partitions=False)

Concatenate DataFrames along rows.

  • When axis=0 (default), concatenate DataFrames row-wise:
    • If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
    • If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
  • When axis=1, concatenate DataFrames column-wise:
    • Allowed if all divisions are known.
    • If any of division is unknown, it raises ValueError.
Parameters:

dfs : list

List of dask.DataFrames to be concatenated

axis : {0, 1, ‘index’, ‘columns’}, default 0

The axis to concatenate along

join : {‘inner’, ‘outer’}, default ‘outer’

How to handle indexes on other axis

interleave_partitions : bool, default False

Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.

Examples

If all divisions are known and ordered, divisions are kept.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(6, 8, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>

Unable to concatenate if divisions are not ordered.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(2, 3, 6)>
>>> dd.concat([a, b])                               
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order

Specify interleave_partitions=True to ignore the division order.

>>> dd.concat([a, b], interleave_partitions=True)   
dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>

If any of division is unknown, the result division will be unknown

>>> a                                               
dd.DataFrame<x, divisions=(None, None)>
>>> b                                               
dd.DataFrame<y, divisions=(1, 4, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(None, None, None, None)>
dask.dataframe.multi.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

Parameters:

left : DataFrame

right : DataFrame

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

sort : boolean, default False

Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

copy : boolean, default True

If False, do not copy data unnecessarily

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

New in version 0.17.0.

validate : string, default None

If specified, checks if merge is of specified type.

  • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
  • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
  • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
  • “many_to_many” or “m:m”: allowed, but does not result in checks.

New in version 0.21.0.

Returns:

merged : DataFrame

The output type will the be same as ‘left’, if it is a subclass of DataFrame.

See also

merge_ordered, merge_asof

Examples

>>> A              >>> B
    lkey value         rkey value
0   foo  1         0   foo  5
1   bar  2         1   bar  6
2   baz  3         2   qux  7
3   foo  4         3   bar  8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
   lkey  value_x  rkey  value_y
0  foo   1        foo   5
1  foo   4        foo   5
2  bar   2        bar   6
3  bar   2        bar   8
4  baz   3        NaN   NaN
5  NaN   NaN      qux   7