Dask DataFrame API with Logical Query Planning
Contents
Dask DataFrame API with Logical Query Planning¶
DataFrame¶
|
DataFrame-like Expr Collection. |
Return a Series/DataFrame with absolute numeric value of each element. |
|
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.DataFrame.apply |
|
Assign new columns to a DataFrame. |
|
Cast a pandas object to a specified dtype |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
|
Convert columns of the DataFrame to category dtype. |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute pairwise correlation of columns, excluding NA/null values. |
|
Count non-NA cells for each column or row. |
|
Compute pairwise covariance of columns, excluding NA/null values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Tuple of |
|
|
Drop specified labels from rows or columns. |
|
Return DataFrame with duplicate rows removed. |
|
Remove missing values. |
Return data types |
|
|
|
|
Evaluate a string describing operations on DataFrame columns. |
|
Transform each element of a list-like to a row, replicating index values. |
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group DataFrame using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
|
Return index of first occurrence of maximum over requested axis. |
|
Return index of first occurrence of minimum over requested axis. |
Purely integer-location based indexing for selection by position. |
|
Return dask Index instance |
|
|
Concise summary of a Dask DataFrame |
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Iterate over (column name, Series) pairs. |
|
Iterate over DataFrame rows as (index, Series) pairs. |
|
|
Iterate over DataFrame rows as namedtuples. |
|
Join columns of another DataFrame. |
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
|
Return the mean of the values over the requested axis. |
|
Return the median of the values over the requested axis. |
|
Return the approximate median of the values over the requested axis. |
|
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. |
|
Return the memory usage of each column in bytes. |
Return the memory usage of each partition |
|
|
Merge the DataFrame with another DataFrame |
|
Return the minimum of the values over the requested axis. |
|
|
|
Get the mode(s) of each element along the selected axis. |
|
|
Return dimensionality |
|
|
|
|
Return the first n rows ordered by columns in descending order. |
Return number of partitions |
|
|
Return the first n rows ordered by columns in ascending order. |
Slice dataframe by partitions |
|
|
Persist this dask collection into memory |
|
Create a spreadsheet-style pivot table as a DataFrame. |
|
Return item and drop from frame. |
|
|
|
Return the product of the values over the requested axis. |
|
Approximate row-wise and precise column-wise quantiles of DataFrame |
|
Filter dataframe with complex expression |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Rename columns or index labels. |
|
Set the name of the axis for the index or columns. |
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
|
|
|
|
|
|
Round a DataFrame to a variable number of decimal places. |
|
|
|
|
|
|
|
Random sample of items |
|
Return a subset of the DataFrame's columns based on the column dtypes. |
|
Return unbiased standard error of the mean over requested axis. |
|
Set the DataFrame index (row labels) using an existing column. |
|
Rearrange DataFrame into new partitions |
Size of the Series or DataFrame as a Delayed object. |
|
|
Sort the dataset by a single column. |
|
Squeeze 1 dimensional axis objects into scalars. |
|
Return sample standard deviation over requested axis. |
|
|
|
Return the sum of the values over the requested axis. |
|
Last n rows of the dataset |
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert to a legacy dask-dataframe collection |
|
Convert into a list of |
|
See dd.to_hdf docstring for more information |
|
Render a DataFrame as an HTML table. |
|
See dd.to_json docstring for more information |
|
Convert to a legacy dask-dataframe collection |
|
|
|
|
|
Render a DataFrame to a console-friendly tabular output. |
|
|
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
Return a dask.array of the values of this dataframe |
|
|
Return unbiased variance over requested axis. |
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
Series¶
|
Series-like Expr Collection. |
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.Series.apply |
|
Cast a pandas object to a specified dtype |
|
Compute the lag-N autocorrelation. |
|
Return boolean Series equivalent to left <= series <= right. |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
Forget division information. |
|
|
Trim values at input threshold(s). |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute correlation with other Series, excluding missing values. |
|
Count non-NA cells for each column or row. |
|
Compute covariance with Series, excluding missing values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Return a new Series with missing values removed. |
|
|
|
Transform each element of a list-like to a row. |
|
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group Series using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
|
Return index of first occurrence of maximum over requested axis. |
|
Return index of first occurrence of minimum over requested axis. |
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Map values of Series according to an input mapping or function. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
|
Return the mean of the values over the requested axis. |
Return the median of the values over the requested axis. |
|
|
Return the approximate median of the values over the requested axis. |
|
Return the memory usage of the Series. |
|
Return the memory usage of each partition |
|
Return the minimum of the values over the requested axis. |
|
|
|
|
Number of bytes |
|
Return dimensionality |
|
|
|
|
Return the largest n elements. |
DataFrame.notnull is an alias for DataFrame.notna. |
|
|
Return the smallest n elements. |
|
Return number of unique elements in the object. |
|
Approximate number of unique rows. |
|
Persist this dask collection into memory |
|
Apply chainable functions that expect Series or DataFrames. |
|
|
|
Return the product of the values over the requested axis. |
|
Approximate quantiles of Series |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Alter Series index labels or name |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
Provides rolling transformations. |
|
Round a DataFrame to a variable number of decimal places. |
|
Random sample of items |
|
Return unbiased standard error of the mean over requested axis. |
Return a tuple representing the dimensionality of the DataFrame. |
|
|
Shift index by desired number of periods with an optional time freq. |
Size of the Series or DataFrame as a Delayed object. |
|
|
Return sample standard deviation over requested axis. |
|
|
|
Return the sum of the values over the requested axis. |
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
|
Convert Series to DataFrame. |
|
See dd.to_hdf docstring for more information |
|
Render a string representation of the Series. |
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
|
Return Series of unique values in the object. |
|
Return a Series containing counts of unique values. |
Return a dask.array of the values of this dataframe |
|
|
Return unbiased variance over requested axis. |
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
Index¶
|
Index-like Expr Collection. |
|
|
|
Align two objects on their axes with the specified join method. |
|
Return whether all elements are True, potentially over an axis. |
|
Return whether any element is True, potentially over an axis. |
|
Parallel version of pandas.Series.apply |
|
Cast a pandas object to a specified dtype |
|
Compute the lag-N autocorrelation. |
|
Return boolean Series equivalent to left <= series <= right. |
|
Fill NA/NaN values by using the next valid observation to fill the gap. |
Forget division information. |
|
|
Trim values at input threshold(s). |
|
Compute this DataFrame. |
|
Make a copy of the dataframe |
|
Compute correlation with other Series, excluding missing values. |
|
Count non-NA cells for each column or row. |
|
Compute covariance with Series, excluding missing values. |
|
Return cumulative maximum over a DataFrame or Series axis. |
|
Return cumulative minimum over a DataFrame or Series axis. |
|
Return cumulative product over a DataFrame or Series axis. |
|
Return cumulative sum over a DataFrame or Series axis. |
|
Generate descriptive statistics. |
|
First discrete difference of element. |
|
|
|
|
Return a new Series with missing values removed. |
|
|
|
Transform each element of a list-like to a row. |
|
|
Fill NA/NaN values by propagating the last valid observation to next valid. |
|
Fill NA/NaN values using the specified method. |
|
|
|
|
Get a dask DataFrame/Series representing the nth partition. |
|
|
Group Series using a mapper or by a Series of columns. |
|
|
|
First n rows of the dataset |
Return boolean if values in the object are monotonically decreasing. |
|
Return boolean if values in the object are monotonically increasing. |
|
|
Whether each element in the DataFrame is contained in values. |
Detect missing values. |
|
DataFrame.isnull is an alias for DataFrame.isna. |
|
Whether the divisions are known. |
|
|
|
Purely label-location based indexer for selection by label. |
|
|
|
|
Map values using an input mapping or function. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
|
Apply a Python function to each partition |
|
Replace values where the condition is True. |
|
Return the maximum of the values over the requested axis. |
Return the median of the values over the requested axis. |
|
|
Return the approximate median of the values over the requested axis. |
|
Memory usage of the values. |
|
Return the memory usage of each partition |
|
Return the minimum of the values over the requested axis. |
|
|
|
|
Number of bytes |
|
Return dimensionality |
|
|
|
|
Return the largest n elements. |
DataFrame.notnull is an alias for DataFrame.notna. |
|
|
Return the smallest n elements. |
|
Return number of unique elements in the object. |
|
Approximate number of unique rows. |
|
Persist this dask collection into memory |
|
Apply chainable functions that expect Series or DataFrames. |
|
|
|
Approximate quantiles of Series |
|
|
|
Pseudorandomly split dataframe into different pieces row-wise |
|
|
|
Alter Series index labels or name |
|
Repartition a collection |
|
Replace values given in to_replace with value. |
|
Resample time-series data. |
|
Reset the index to the default index. |
|
Provides rolling transformations. |
|
Round a DataFrame to a variable number of decimal places. |
|
Random sample of items |
|
Return unbiased standard error of the mean over requested axis. |
Return a tuple representing the dimensionality of the DataFrame. |
|
|
Shift index by desired number of periods with an optional time freq. |
Size of the Series or DataFrame as a Delayed object. |
|
|
|
|
Move to a new DataFrame backend |
|
Create a Dask Bag from a Series |
|
See dd.to_csv docstring for more information |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
|
Create a DataFrame with a column containing the Index. |
|
See dd.to_hdf docstring for more information |
|
Create a Series with both index and values equal to the index keys. |
|
Render a string representation of the Series. |
|
Cast to DatetimeIndex of timestamps, at beginning of period. |
|
|
|
Return Series of unique values in the object. |
|
Return a Series containing counts of unique values. |
Return a dask.array of the values of this dataframe |
|
|
Visualize the expression or task graph |
|
Replace values where the condition is False. |
|
Create a DataFrame with a column containing the Index. |
Accessors¶
Similar to pandas, Dask provides dtype-specific methods under various accessors.
These are separate namespaces within Series
that only apply to specific data types.
The accessor implementations are consistent with the current Dask DataFrame implementation.
Groupby Operations¶
DataFrame Groupby¶
|
Aggregate using one or more specified operations |
|
Parallel version of pandas GroupBy.apply |
|
Backward fill the values. |
|
Compute count of group, excluding missing values. |
Number each item in each group from 0 to the length of that group - 1. |
|
|
Cumulative product for each group. |
|
Cumulative sum for each group. |
|
Forward fill the values. |
|
Construct DataFrame from group with provided name. |
|
Compute max of group values. |
|
Compute mean of groups, excluding missing values. |
|
Compute min of group values. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
|
Compute pairwise covariance of columns, excluding NA/null values. |
|
Compute pairwise correlation of columns, excluding NA/null values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Return index of first occurrence of minimum over requested axis. |
|
Return index of first occurrence of maximum over requested axis. |
|
Provides rolling transformations. |
|
Parallel version of pandas GroupBy.transform |
Series Groupby¶
|
Aggregate using one or more specified operations |
|
Parallel version of pandas GroupBy.apply |
|
Backward fill the values. |
|
Compute count of group, excluding missing values. |
Number each item in each group from 0 to the length of that group - 1. |
|
|
Cumulative product for each group. |
|
Cumulative sum for each group. |
|
Forward fill the values. |
Construct DataFrame from group with provided name. |
|
|
Compute max of group values. |
|
Compute mean of groups, excluding missing values. |
|
Compute min of group values. |
|
Return number of unique elements in the group. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Return index of first occurrence of minimum over requested axis. |
|
Return index of first occurrence of maximum over requested axis. |
|
Provides rolling transformations. |
|
Parallel version of pandas GroupBy.transform |
Custom Aggregation¶
|
User defined groupby-aggregation. |
Rolling Operations¶
|
Provides rolling transformations. |
|
Provides rolling transformations. |
|
Calculate the rolling custom aggregation function. |
Calculate the rolling count of non NaN observations. |
|
Calculate the rolling Fisher's definition of kurtosis without bias. |
|
Calculate the rolling maximum. |
|
Calculate the rolling mean. |
|
Calculate the rolling median. |
|
Calculate the rolling minimum. |
|
Calculate the rolling quantile. |
|
Calculate the rolling unbiased skewness. |
|
Calculate the rolling standard deviation. |
|
Calculate the rolling sum. |
|
Calculate the rolling variance. |
Create DataFrames¶
|
|
|
|
|
|
|
Read a Parquet file into a Dask DataFrame |
|
|
|
Create a dataframe from a set of JSON files |
|
Read dataframe from ORC file(s) |
|
Read SQL database table into a DataFrame. |
|
Read SQL query into a DataFrame. |
|
Read SQL query or database table into a DataFrame. |
|
Read any sliceable array into a Dask Dataframe |
|
Create a Dask DataFrame from a Dask Array. |
|
Create Dask DataFrame from many Dask Delayed objects |
|
Create a DataFrame collection from a custom function map. |
|
Construct a Dask DataFrame from a Pandas DataFrame |
|
Construct a Dask DataFrame from a Python Dictionary |
Store DataFrames¶
|
Store Dask DataFrame to CSV files |
|
Store Dask.dataframe to Parquet files |
|
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
|
Create Dask Array from a Dask Dataframe |
|
Store Dask Dataframe to a SQL table |
|
Write dataframe into JSON text files |
Convert DataFrames¶
|
Create a Dask Bag from a Series |
|
Convert a dask DataFrame to a dask array. |
|
Convert into a list of |
Convert from/to legacy DataFrames¶
|
Convert to a legacy dask-dataframe collection |
|
Create a dask-expr collection from a legacy dask-dataframe collection |
Reshape DataFrames¶
|
Convert categorical variable into dummy/indicator variables. |
|
Create a spreadsheet-style pivot table as a DataFrame. |
|
Concatenate DataFrames¶
|
Merge the DataFrame with another DataFrame |
|
Concatenate DataFrames along rows. |
|
Merge DataFrame or named Series objects with a database-style join. |
|
Perform a merge by key distance. |
Resampling¶
|
Aggregate using one or more operations |
|
Aggregate using one or more operations over the specified axis. |
Compute count of group, excluding missing values. |
|
Compute the first entry of each column within each group. |
|
Compute the last entry of each column within each group. |
|
Compute max value of group. |
|
Compute mean of groups, excluding missing values. |
|
Compute median of groups, excluding missing values. |
|
Compute min value of group. |
|
Return number of unique elements in the group. |
|
Compute open, high, low and close values of a group, excluding missing values. |
|
Compute prod of group values. |
|
Return value at the given quantile. |
|
Compute standard error of the mean of groups, excluding missing values. |
|
Compute group sizes. |
|
Compute standard deviation of groups, excluding missing values. |
|
Compute sum of group values. |
|
Compute variance of groups, excluding missing values. |
Dask Metadata¶
|
This method creates meta-data based on the type of |
Query Planning and Optimization¶
|
Create a graph representation of the Expression. |
|
Visualize the expression or task graph |
|
Outputs statistics about every node in the expression. |
Other functions¶
|
Compute several dask collections at once. |
|
Apply Python function on each DataFrame partition. |
|
Apply a function to each partition, sharing rows with adjacent partitions. |
Convert argument to datetime. |
|
|
Convert argument to a numeric type. |
Convert argument to timedelta. |