Contents

Changelog

Contents

Changelog

Note

This is not exhaustive. For an exhaustive list of changes, see the git log.

2025.2.0

Highlights

This release includes a critical fix that fixes a deadlock that can arise when seceded task are rescheduled, or cancelled and resubmitted, e.g. due to a worker being lost.

See distributed#8991 by Hendrik Makait for more details.

Additional changes

2025.1.0

Highlights

Legacy Dask DataFrame Implementation removed

This release drops the legacy Dask DataFrame implementation. The API with query planning is now the only available Dask DataFrame implementation.

This enforces the deprecation of the configuration:

dask.config.set({"dataframe.query-planning": False})

Dask-Expr was merged into the dask package as well as the dask/dask repository. It is no longer necessary to install dask-expr separately.

Reducing Memory Pressure for Xarray Workloads

Dask introduced a mechanism that is called root task queuing in 2022. This mechanism allows Dask to detect tasks that are reading data from storage and schedule them defensively to avoid memory pressure on the cluster through overproduction of these tasks. The underlying mechanism was very fragile and failed for specific types of computations like opening multiple zarr stores or loading a large number of netcdf files.

The recent changes in Dask’s task graph representation allow for more robust detection of root tasks. This change makes the detection mechanism independent of the workload running and is especially beneficial for Xarray workloads.

This results in significantly more memory stability and a reduced memory footprint for workloads where root task detection was previously failing and makes the expected memory profile deterministic and independent of the topology of the task graph.

2024.12.1

Highlights

Improved scheduler responsiveness for large task graphs

This release reduces the number of Python object references related to tracking tasks by the Dask scheduler. This increases scheduler responsiveness by reducing the time needed to run garbage collection on the scheduler.

See dask#8958, dask#11608, dask#11600, dask#11598, dask#11597, and distributed#8963 from Hendrik Makait for more details.

Additional changes

2024.12.0

Highlights

Python 3.13 Support

This release adds support for Python 3.13. Dask now supports Python 3.10-3.13.

See dask#11456 and distributed#8904 from Patrick Hoefler and James Bourbeau for more details.

Additional changes

2024.11.2

Note

Versions 2024.11.0 and 2024.11.1 included a critical performance regression and should be skipped by every user.

Highlights

Legacy Dask DataFrame Deprecated

This release deprecates the legacy Dask DataFrame implementation. The old implementation will be removed completely in a future release. Users are encourage to switch to the new implementation now and to report any issues they are facing.

Users are also encourage to check that they are only importing functions from dask.dataframe and not any of the submodules.

New quantile methods for Dask Array API

Dask Array added new quantile and nanquantile methods. Previously, Dask dispatched to the NumPy implementation, which blocked the GIL a lot. This caused large slowdowns on workers with more than one tread and could lead to runtimes over 200s per chunk.

The new quantile implementation avoids many of these problems and reduces runtime to around 1s per chunk independently of the number of threads.

Consistent chunksize in Xarray rolling-construct

Using Xarrays rolling(...).construct(...) with Dask Arrays led to very large chunksizes that rarely fit into memory on a single worker.

The underlying operations is a view on the smaller NumPy array, but triggering a copy of the data will lead to very large memory usage.

import xarray as xr
import dask.array as da

arr = xr.DataArray(
    da.ones((93504, 721, 1440), chunks=("auto", -1, -1)),
    dims=["time", "lat", "longitude"],
)   # Initial chunks are ~128 MiB
arr.rolling(time=30).construct("window_dim")

Previously

Individual chunks are exploding to 10 GiB, likely causing out of memory errors.

Individual chunks are exploding to 10 GiB, likely causing out of memory errors.

Now

Dask will now automatically split individual chunks into chunks that will have the same chunksize minus a small tolerance.

Individual chunks are now roughly the same size

Improved efficiency of map overlap

map_overlap now creates smaller and more efficient graphs to keep task graphs generally a lot smaller.

The previous version injected a lot of tasks that weren’t necessary, increasing the number of tasks by a factor of 2-10x of what actually necessary. This caused a lot of stress on the scheduler.

Consistent chunksizes for Einstein summation

Einstein summation historically led to very large chunksizes if applied to more than one Dask Array. This behavior is inherited from NumPy but led to out of memory errors on workers:

import dask.array as da
arr = da.random.random((1024, 64, 64, 64, 64), chunks=(256, 16, 16, 16, 16)) # Initial chunks are 128 MiB
result = da.einsum("aijkl,amnop->ijklmnop", arr, arr)

Previously

Individual chunks are exploding to 32 GiB, very likely causing out of memory errors.

Individual chunks are exploding to 32 GiB, very likely causing out of memory errors

Now

The operation keeps individual chunksizes the same.

Individual chunks are now roughly the same size
Additional changes

2024.10.0

Notable Changes

  • Zarr-Python 3 compatibility (dask#11388)

  • Avoid exponentially increasing taskgraph in overlap (dask#11423)

  • Ensure numba tokenization does not use slow pickle path (dask#11419)

Additional changes

2024.9.1

Highlights

Improved adaptive scaling resilience

Adaptive scaling clusters now recover from spurious errors during scaling.

See distributed#8871 by Hendrik Makait for more details.

Additional changes

2024.9.0

Highlights

Bump Bokeh minimum version to 3.1.0

bokeh>=3.1.0 is now required for diagnostics and the distributed cluster dashboard.

See dask#11375 and distributed#8861 by James Bourbeau for more details.

Introduce new Task class

Add a Task class to replace tuples for task specification.

See dask#11248 by Florian Jetter for more details.

Additional changes

2024.8.2

Highlights

Automatic selection of rechunking method

To enable users to rechunk data at larger scales than before, Dask now automatically chooses an appropriate rechunking method when rechunking on a cluster. This requires no additional configuration and is enabled by default.

Specifically, Dask chooses between task-based and P2P rechunking. While task-based rechunking has been the previous default, P2P rechunking is beneficial when rechunking requires almost all-to-all communication between the old and new chunks, e.g., when changing between spacial and temporal chunking. In these cases, P2P rechunking offers constant memory usage and creates smaller task graphs. As a result, it works for cases where tasks-based rechunking would have previously failed.

To disable automatic selection, users can select their preferred method via the configuration

import dask.config
# Choose either "tasks" or "p2p"
dask.config.set({"array.rechunk.method": "tasks"})

or when rechunking

import dask.array as da
arr = da.random.random(size=(1000, 1000, 365), chunks=(-1, -1, "auto"))
# Choose either "tasks" or "p2p"
arr = arr.rechunk(("auto", "auto", -1), method="tasks")

See dask#11337 by Hendrik Makait for more details.

New shuffle API for Dask Arrays

Dask added a shuffle-API to Dask Arrays. This API allows for shuffling the data along a single dimension. It will ensure that every group of elements along this dimension are in exactly one chunk. This is a very useful operation for GroupBy-Map patterns in Xarray. See shuffle() for more information and API signature.

See dask#11267, dask#11311 and dask#11326 by Patrick Hoefler for more details.

New blockwise_reshape API for Dask Arrays

The new blockwise_reshape() enables an embarassingly parallel reshaping operation for cases where you don’t care about the order of the underlying array. It is embarassingly parallel and doesn’t trigger a rechunking operation under the hood anymore. This is useful when you don’t care about the order of the resulting Array, i.e. if a reduction is applied to the array or if the reshaping is only temporary.

arr = da.random.random(size=(100, 100, 48_000), chunks=(1000, 100, 83)
result = reshape_blockwise(arr, (10_000, 48_000))
result.sum()

# or: do something that preserves the shape of each chunk

result = reshape_blockwise(result, (100, 100, 48_000), chunks=arr.chunks)

Dask will automatically calculate the resulting chunks if the number of dimensions is reduced, but you have to specify the resulting chunks if the number of dimensions is increased.

Reshaping a Dask Array oftentimes creates a very complicated computations with rechunk operations in between because Dask respect the C ordering of the Array by default. This ensures that the resulting Dask Array is returned in the same order as the corresponding NumPy Array. However, this can lead to very inefficient computations. The blockwise_reshape is a lot more efficient than the default implemenation if you don’t care about the order.

Warning

Blockwise reshape operations are more efficient as the default, but they will return an Array that is ordered differently. Use with care!

See dask#11328 by Patrick Hoefler for more details.

Mutlidimensional positional indexing keeping chunksizes consistent

Indexing a Dask Array with vindex() previously created a single output chunk along the dimensions that were indexed. vindex is commonly used in Xarray when indexing multiple dimensions in a single step, i.e.:

arr = xr.DataArray(
    da.random.random((100, 100, 100), chunks=(5, 5, 50)),
    dims=['a', "b", "c"],
)

Previously, this put the indexed dimensions into a single chunk:

Size of each individual chunk increases to over 1GB

Dask now uses an improved algorithm that ensures that the chunksizes are kept consistent:

Size of each individual chunk increases to over 1GB

See dask#11330 by Patrick Hoefler for more details.

Additional changes

2024.8.1

Highlights

Improve output chunksizes for reshaping Dask Arrays

Reshaping a Dask Array oftentimes squashed the dimensions to reshape into a single chunk. This caused very large output chunks and subsequently a lot of out of memory errors and performance issues.

arr = da.ones(shape=(1000, 100, 48_000), chunks=(1000, 100, 83))
arr.reshape(1000, 100, 4, 12_000)

Previously, this put the last dimension into a single chunk of size 12_000.

Size of each individual chunk increases to over 1GB

The new algorithm will ensure that the chunk-size between in- and output is kept the same. This will avoid large increases in chunk-size and fragmentation of chunks.

Size of each individual chunk stays the same

Improve scheduling efficiency for Xarray Rechunk-GroupBy-Reduce patterns

The scheduler previously created an inefficient execution graph for Xarray GroupBy-Reduction patterns that use the cohorts strategy:

import xarray as xr

arr = xr.open_zarr(...)
arr.chunk(time=TimeResampler("ME")).groupby("time.month").mean()

An issue in the algorithm that creates the execution order of the task graph lead to an inefficient execution strategy that accumulates a lot of unnecessary memory on the cluster. The improvement is very similar to the previous ordering improvement in 2024.08.0.

Drop support for Python 3.9

This release drops support for Python 3.9 in accordance with NEP 29. Python 3.10 is now the required minimum version to run Dask.

See dask#11245 and distributed#8793 by Patrick Hoefler for more details.

Additional changes

2024.8.0

Highlights

Improve efficiency and performance of slicing with positional indexers

Performance improvement for slicing a Dask Array with a positional indexer. Random access patterns are now more stable and produce easier-to-use results.

x[slice(None), [1, 1, 3, 6, 3, 4, 5]]

Using a positional indexer was previously prone to drastically increasing the number of output chunks and generating a very large task graph. This has been fixed with a more efficient algorithm.

The new algorithm will keep the chunk-sizes along the axis that is indexed the same to avoid fragmentation of chunks or a large increase in chunk-size.

See dask#11262 and dask#11267 by Patrick Hoefler for more details and performance benchmarks.

Improve scheduling efficiency for Xarray GroupBy-Reduce patterns

The scheduler previously created an inefficient execution graph for Xarray GroupBy-Reduction patterns like:

import xarray as xr

arr = xr.open_zarr(...)
arr.groupby("time.month").mean()

An issue in the algorithm that creates the execution order of the task graph lead to an inefficient execution strategy that accumulates a lot of unneceessary memory on the cluster.

Memory keeps accumulating on the cluster when running an embarassingly parallel operation.

The operation itself is embarassingly parallel. Using the proper execution strategy the scheduler can now execute the operation with constant memory, avoiding spilling and allowing us to scale to larger datasets.

Same operation is running with constant memory usage for the whole computation and can scale for bigger datasets.

See distributed#8818 by Patrick Hoefler for more details and examples.

Additional changes

2024.7.1

Highlights

More resilient distributed lock

distributed.Lock is now resilient to worker failures. Previously deadlocks were possible in cases where a lock-holding worker was lost and/or failed to release the lock due to an error.

See distributed#8770 by Florian Jetter for more details.

Additional changes

2024.7.0

Highlights

Drop support for pandas 1.x

This release drops support for pandas<2. pandas 2.0 is now the required minimum version to run Dask DataFrame.

The mimimum version of partd was also raised to 1.4.0. Versions before 1.4 are not compatible with pandas 2.

See dask#11199 by Patrick Hoefler for more details.

Publish-subscribe APIs deprecated

distributed.Pub and distributed.Sub have been deprecated and will be removed in a future release. Please switch to distributed.Client.log_event() and distributed.Worker.log_event() instead.

See distributed#8724 by Hendrik Makait for more details.

Additional changes

2024.6.2

This is a patch release to update an issue with dask and distributed version pinning in the 2024.6.1 release.

Additional changes

2024.6.1

Highlights

This release includes a critical fix that fixes a deadlock that can arise when dependencies of root-ish tasks are rescheduled, e.g. due to a worker being lost.

See distributed#8703 by Hendrik Makait for more details.

Additional changes

2024.6.0

Highlights

memmap array tokenization

Tokenizing memmap arrays will now avoid materializing the array into memory.

See dask#11161 by Florian Jetter for more details.

Additional changes

2024.5.2

This release primarily contains minor bug fixes.

Additional changes

2024.5.1

Highlights

NumPy 2.0 support

This release contains compatibility updates for the upcoming NumPy 2.0 release.

See dask#11096 by Benjamin Zaitlen and dask#11106 by James Bourbeau for more details.

Increased Zarr store support

This release contains adds support for MutableMapping-backed Zarr stores like zarr.storage.DirectoryStore, etc.

See dask#10422 by Greg M. Fleishman for more details.

Additional changes

2024.5.0

Highlights

This release primarily contains minor bugfixes.

Additional changes

2024.4.2

Highlights

Trivial Merge Implementation

The Query Optimizer will inspect quires to determine if a merge(...) or groupby(...).apply(...) requires a shuffle. A shuffle can be avoided, if the DataFrame was shuffled on the same columns in a previous step without any operations in between that change the partitioning layout or the relevant values in each partition.

>>> result = df.merge(df2, on="a")
>>> result = result.merge(df3, on="a")

The Query optimizer will identify that result was previously shuffled on "a" as well and thus only shuffle df3 in the second merge operation before doing a blockwise merge.

Auto-partitioning in read_parquet

The Query Optimizer will automatically repartition datasets read from Parquet files if individual partitions are too small. This will reduce the number of partitions in consequentially also the size of the task graph.

The Optimizer aims to produce partitions of at least 75MB and will combine multiple files together if necessary to reach this threshold. The value can be configured by using

>>> dask.config.set({"dataframe.parquet.minimum-partition-size": 100_000_000})

The value is given in bytes. The default threshold is relatively conservative to avoid memory issues on worker nodes with a relatively small amount of memory per thread.

Additional changes

2024.4.1

This is a minor bugfix release that that fixes an error when importing dask.dataframe with Python 3.11.9.

See dask#11035 and dask#11039 from Richard (Rick) Zamora for details.

Additional changes

2024.4.0

Highlights

Query planning fixes

This release contains a variety of bugfixes in Dask DataFrame’s new query planner.

GPU metric dashboard fixes

GPU memory and utilization dashboard functionality has been restored. Previously these plots were unintentionally left blank.

See distributed#8572 from Benjamin Zaitlen for details.

Additional changes

2024.3.1

This is a minor release that primarily demotes an exception to a warning if dask-expr is not installed when upgrading.

Additional changes

2024.3.0

Released on March 11, 2024

Highlights

Query planning

This release is enabling query planning by default for all users of dask.dataframe.

The query planning functionality represents a rewrite of the DataFrame using dask-expr. This is a drop-in replacement and we expect that most users will not have to adjust any of their code. Any feedback can be reported on the Dask issue tracker or on the query planning feedback issue.

If you are encountering any issues you are still able to opt-out by setting

>>> import dask
>>> dask.config.set({'dataframe.query-planning': False})

Sunset of Pandas 1.X support

The new query planning backend is requiring at least pandas 2.0. This pandas version will automatically be installed if you are installing from conda or if you are installing using dask[complete] or dask[dataframe] from pip.

The legacy DataFrame implementation is still supporting pandas 1.X if you install dask without extras.

Additional changes

2024.2.1

Released on February 23, 2024

Highlights

Allow silencing dask.DataFrame deprecation warning

The last release contained a DeprecationWarning that alerts users to an upcoming switch of dask.dafaframe to use the new backend with support for query planning (see also dask#10934).

This DeprecationWarning is triggered in import of the dask.dataframe module and the community raised concerns about this being to verbose.

It is now possible to silence this warning

# via Python
>>> dask.config.set({'dataframe.query-planning-warning': False})

# via CLI
dask config set dataframe.query-planning-warning False

See dask#10936 and dask#10925 from Miles for details.

More robust distributed scheduler for rare key collisions

Blockwise fusion optimization can cause a task key collision that is not being handled properly by the distributed scheduler (see dask#9888). Users will typically notice this by seeing one of various internal exceptions that cause a system deadlock or critical failure. While this issue could not be fixed, the scheduler now implements a mechanism that should mitigate most occurences and issues a warning if the issue is detected.

See distributed#8185 from crusaderky and Florian Jetter for details.

Over the course of this, various improvements to tokenization have been implemented. See dask#10913, dask#10884, dask#10919, dask#10896 and primarily dask#10883 from crusaderky for more details.

More robust adaptive scaling on large clusters

Adaptive scaling could previously lose data during downscaling if many tasks had to be moved. This typically, but not exclusively, occured on large clusters and would manifest as a recomputation of tasks and could cause clusters to oscillate between up- and downscaling without ever finishing.

See distributed#8522 from crusaderky for more details.

Additional changes

2024.2.0

Released on February 9, 2024

Highlights

Deprecate Dask DataFrame implementation

The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by installing the dask-expr library:

$ pip install dask-expr

and turning the query planning option on:

>>> import dask
>>> dask.config.set({'dataframe.query-planning': True})
>>> import dask.dataframe as dd

API documentation for the new implementation is available at https://docs.dask.org/en/stable/dataframe-api.html

Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues

See dask#10912 from Patrick Hoefler for details.

Improved tokenization

This release contains several improvements to Dask’s object tokenization logic. More objects now produce deterministic tokens, which can lead to improved performance through caching of intermediate results.

See dask#10898, dask#10904, dask#10876, dask#10874, and dask#10865 from crusaderky for details.

Additional changes

2024.1.1

Released on January 26, 2024

Highlights

Pandas 2.2 and Scipy 1.12 support

This release contains compatibility updates for the latest pandas and scipy releases.

See dask#10834, dask#10849, dask#10845, and distributed#8474 from crusaderky for details.

Deprecations

Additional changes

2024.1.0

Released on January 12, 2024

Highlights

Partial rechunks within P2P

P2P rechunking now utilizes the relationships between input and output chunks. For situations that do not require all-to-all data transfer, this may significantly reduce the runtime and memory/disk footprint. It also enables task culling.

See distributed#8330 from Hendrik Makait for details.

Fastparquet engine deprecated

The fastparquet Parquet engine has been deprecated. Users should migrate to the pyarrow engine by installing PyArrow and removing engine="fastparquet" in read_parquet or to_parquet calls.

See dask#10743 from crusaderky for details.

Improved serialization for arbitrary data

This release improves serialization robustness for arbitrary data. Previously there were some cases where serialization could fail for non-msgpack serializable data. In those cases we now fallback to using pickle.

See dask#8447 from Hendrik Makait for details.

Additional deprecations

Additional changes

2023.12.1

Released on December 15, 2023

Highlights

Logical Query Planning now available for Dask DataFrames

Dask DataFrames are now much more performant by using a logical query planner. This feature is currently off by default, but can be turned on with:

dask.config.set({"dataframe.query-planning": True})

You also need to have dask-expr installed:

pip install dask-expr

We’ve seen promising performance improvements so far, see this blog post and these regularly updated benchmarks for more information. A more detailed explanation of how the query optimizer works can be found in this blog post.

This feature is still under active development and the API isn’t stable yet, so breaking changes can occur. We expect to make the query optimizer the default early next year.

See dask#10634 from Patrick Hoefler for details.

Dtype inference in read_parquet

read_parquet will now infer the Arrow types pa.date32(), pa.date64() and pa.decimal() as a ArrowDtype in pandas. These dtypes are backed by the original Arrow array, and thus avoid the conversion to NumPy object. Additionally, read_parquet will no longer infer nested and binary types as strings, they will be stored in NumPy object arrays.

See dask#10698 and dask#10705 from Patrick Hoefler for details.

Scheduling improvements to reduce memory usage

This release includes a major rewrite to a core part of our scheduling logic. It includes a new approach to the topological sorting algorithm in dask.order which determines the order in which tasks are run. Improper ordering is known to be a major contributor to too large cluster memory pressure.

Updates in this release fix a couple of performance regressions that were introduced in the release 2023.10.0 (see dask#10535). Generally, computations should now be much more eager to release data if it is no longer required in memory.

See dask#10660, dask#10697 from Florian Jetter for details.

Improved P2P-based merging robustness and performance

This release contains several updates that fix a possible deadlock introduced in 2023.9.2 and improve the robustness of P2P-based merging when the cluster is dynamically scaling up.

See distributed#8415, distributed#8416, and distributed#8414 from Hendrik Makait for details.

Removed disabling pickle option

The distributed.scheduler.pickle configuration option is no longer supported. As of the 2023.4.0 release, pickle is used to transmit task graphs, so can no longer be disabled. We now raise an informative error when distributed.scheduler.pickle is set to False.

See distributed#8401 from Florian Jetter for details.

Additional changes

2023.12.0

Released on December 1, 2023

Highlights

PipInstall restart and environment variables

The distributed.PipInstall plugin now has more robust restart logic and also supports environment variables.

Below shows how users can use the distributed.PipInstall plugin and a TOKEN environment variable to securely install a package from a private repository:

from dask.distributed import PipInstall
plugin = PipInstall(packages=["private_package@git+https://${TOKEN}@github.com/dask/private_package.git])
client.register_plugin(plugin)

See distributed#8374, distributed#8357, and distributed#8343 from Hendrik Makait for details.

Bokeh 3.3.0 compatibility

This release contains compatibility updates for using bokeh>=3.3.0 with proxied Dask dashboards. Previously the contents of dashboard plots wouldn’t be displayed.

See distributed#8347 and distributed#8381 from Jacob Tomlinson for details.

Additional changes

2023.11.0

Released on November 10, 2023

Highlights

Zero-copy P2P Array Rechunking

Users should see significant performance improvements when using in-memory P2P array rechunking. This is due to no longer copying underlying data buffers.

Below shows a simple example where we compare performance of different rechunking methods.

shape = (30_000, 6_000, 150) # 201.17 GiB
input_chunks = (60, -1, -1) # 411.99 MiB
output_chunks = (-1, 6, -1) # 205.99 MiB

arr = da.random.random(size, chunks=input_chunks)
with dask.config.set({
    "array.rechunk.method": "p2p",
    "distributed.p2p.disk": True,
}):
    (
      da.random.random(size, chunks=input_chunks)
      .rechunk(output_chunks)
      .sum()
      .compute()
    )
A comparison of rechunking performance between the different methods tasks, p2p with disk and p2p without disk on different cluster sizes. The graph shows that p2p without disk is up to 60% faster than the default tasks based approach.

See distributed#8282, distributed#8318, distributed#8321 from crusaderky and (distributed#8322) from Hendrik Makait for details.

Deprecating PyArrow <14.0.1

pyarrow<14.0.1 usage is deprecated starting in this release. It’s recommended for all users to upgrade their version of pyarrow or install pyarrow-hotfix. See this CVE for full details.

See dask#10622 from Florian Jetter for details.

Improved PyArrow filesystem for Parquet

Using filesystem="arrow" when reading Parquet datasets now properly inferrs the correct cloud region when accessing remote, cloud-hosted data.

See dask#10590 from Richard (Rick) Zamora for details.

Improve Type Reconciliation in P2P Shuffling

See distributed#8332 from Hendrik Makait for details.

Additional changes

2023.10.1

Released on October 27, 2023

Highlights

Python 3.12

This release adds official support for Python 3.12.

See dask#10544 and distributed#8223 from Thomas Grainger for details.

Additional changes

2023.10.0

Released on October 13, 2023

Highlights

Reduced memory pressure for multi array reductions

This release contains major updates to Dask’s task graph scheduling logic. The updates here significantly reduce memory pressure on array reductions. We anticipate this will have a strong impact on the array computing community.

See dask#10535 from Florian Jetter for details.

Improved P2P shuffling robustness

There are several updates (listed below) that make P2P shuffling much more robust and less likely to fail.

See distributed#8262, distributed#8264, distributed#8242, distributed#8244, and distributed#8235 from Hendrik Makait and distributed#8124 from Charles Blackmon-Luca for details.

Reduced scheduler CPU load for large graphs

Users should see reduced CPU load on their scheduler when computing large task graphs.

See distributed#8238 and dask#10547 from Florian Jetter and distributed#8240 from crusaderky for details.

Additional changes

2023.9.3

Released on September 29, 2023

Highlights

Restore previous configuration override behavior

The 2023.9.2 release introduced an unintentional breaking change in how configuration options are overriden in dask.config.get with the override_with= keyword (see dask#10519). This release restores the previous behavior.

See dask#10521 from crusaderky for details.

Complex dtypes in Dask Array reductions

This release includes improved support for using common reductions in Dask Array (e.g. var, std, moment) with complex dtypes.

See dask#10009 from wkrasnicki for details.

Additional changes

2023.9.2

Released on September 15, 2023

Highlights

P2P shuffling now raises when outdated PyArrow is installed

Previously the default shuffling method would silently fallback from P2P to task-based shuffling if an older version of pyarrow was installed. Now we raise an informative error with the minimum required pyarrow version for P2P instead of silently falling back.

See dask#10496 from Hendrik Makait for details.

Deprecation cycle for admin.traceback.shorten

The 2023.9.0 release modified the admin.traceback.shorten configuration option without introducing a deprecation cycle. This resulted in failures to create Dask clusters in some cases. This release introduces a deprecation cycle for this configuration change.

See dask#10509 from crusaderky for details.

Additional changes

2023.9.1

Released on September 6, 2023

Note

This is a hotfix release that fixes a P2P shuffling bug introduced in the 2023.9.0 release (see dask#10493).

Enhancements

Bug Fixes

Maintenance

2023.9.0

Released on September 1, 2023

Bug Fixes

Documentation

Maintenance

2023.8.1

Released on August 18, 2023

Enhancements

Bug Fixes

  • Fix ValueError when running to_csv in append mode with single_file as True (dask#10441) Ben

Maintenance

2023.8.0

Released on August 4, 2023

Enhancements

Documentation

Maintenance

2023.7.1

Released on July 20, 2023

Note

This release updates Dask DataFrame to automatically convert text data using object data types to string[pyarrow] if pandas>=2 and pyarrow>=12 are installed.

This should result in significantly reduced memory consumption and increased computation performance in many workflows that deal with text data.

You can disable this change by setting the dataframe.convert-string configuration value to False with

dask.config.set({"dataframe.convert-string": False})

Enhancements

Bug Fixes

2023.7.0

Released on July 7, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.6.1

Released on June 26, 2023

Enhancements

Bug Fixes

Deprecations

Maintenance

2023.6.0

Released on June 9, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.5.1

Released on May 26, 2023

Note

This release drops support for Python 3.8. As of this release Dask supports Python 3.9, 3.10, and 3.11. See this community issue for more details.

Enhancements

Bug Fixes

Documentation

Maintenance

2023.5.0

Released on May 12, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.4.1

Released on April 28, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.4.0

Released on April 14, 2023

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2023.3.2

Released on March 24, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.3.1

Released on March 10, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.3.0

Released on March 1, 2023

Bug Fixes

Documentation

Maintenance

2023.2.1

Released on February 24, 2023

Note

This release changes the default DataFrame shuffle algorithm to p2p to improve stability and performance. Learn more here and please provide any feedback on this discussion.

If you encounter issues with this new algorithm, please see the documentation for more information, and how to switch back to the old mode.

Enhancements

Bug Fixes

Documentation

Maintenance

2023.2.0

Released on February 10, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.1.1

Released on January 27, 2023

Enhancements

Bug Fixes

Documentation

Maintenance

2023.1.0

Released on January 13, 2023

Enhancements

Documentation

Maintenance

2022.12.1

Released on December 16, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.12.0

Released on December 2, 2022

Enhancements

Bug Fixes

Maintenance

2022.11.1

Released on November 18, 2022

Enhancements

Maintenance

2022.11.0

Released on November 15, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.10.2

Released on October 31, 2022

This was a hotfix and has no changes in this repository. The necessary fix was in dask/distributed, but we decided to bump this version number for consistency.

2022.10.1

Released on October 28, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.10.0

Released on October 14, 2022

New Features

Enhancements

Bug Fixes

Documentation

Maintenance

2022.9.2

Released on September 30, 2022

Enhancements

Documentation

Maintenance

2022.9.1

Released on September 16, 2022

New Features

Enhancements

Bug Fixes

Deprecations

  • Allow split_out to be None, which then defaults to 1 in groupby().aggregate() (dask#9491) Ian Rose

Documentation

Maintenance

2022.9.0

Released on September 2, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.8.1

Released on August 19, 2022

New Features

Enhancements

Bug Fixes

Documentation

Maintenance

2022.8.0

Released on August 5, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.7.1

Released on July 22, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.7.0

Released on July 8, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.6.1

Released on June 24, 2022

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.6.0

Released on June 10, 2022

Enhancements

Bug Fixes

Documentation

Maintenance

2022.05.2

Released on May 26, 2022

Enhancements

Documentation

Maintenance

2022.05.1

Released on May 24, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.05.0

Released on May 2, 2022

Highlights

This is a bugfix release for this issue.

Documentation

2022.04.2

Released on April 29, 2022

Highlights

This release includes several deprecations/breaking API changes to dask.dataframe.read_parquet and dask.dataframe.to_parquet:

  • to_parquet no longer writes _metadata files by default. If you want to write a _metadata file, you can pass in write_metadata_file=True.

  • read_parquet now defaults to split_row_groups=False, which results in one Dask dataframe partition per parquet file when reading in a parquet dataset. If you’re working with large parquet files you may need to set split_row_groups=True to reduce your partition size.

  • read_parquet no longer calculates divisions by default. If you require read_parquet to return dataframes with known divisions, please set calculate_divisions=True.

  • read_parquet has deprecated the gather_statistics keyword argument. Please use the calculate_divisions keyword argument instead.

  • read_parquet has deprecated the require_extensions keyword argument. Please use the parquet_file_extension keyword argument instead.

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.04.1

Released on April 15, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.04.0

Released on April 1, 2022

Note

This is the first release with support for Python 3.10

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.03.0

Released on March 18, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.02.1

Released on February 25, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.02.0

Released on February 11, 2022

Note

This is the last release with support for Python 3.7

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.01.1

Released on January 28, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2022.01.0

Released on January 14, 2022

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2021.12.0

Released on December 10, 2021

New Features

Enhancements

Bug Fixes

Deprecations

Documentation

Maintenance

2021.11.2

Released on November 19, 2021

2021.11.1

Released on November 8, 2021

Patch release to update distributed dependency to version 2021.11.1.

2021.11.0

Released on November 5, 2021

2021.10.0

Released on October 22, 2021

2021.09.1

Released on September 21, 2021

2021.09.0

Released on September 3, 2021

2021.08.1

Released on August 20, 2021

2021.08.0

Released on August 13, 2021

2021.07.2

Released on July 30, 2021

Note

This is the last release with support for NumPy 1.17 and pandas 0.25. Beginning with the next release, NumPy 1.18 and pandas 1.0 will be the minimum supported versions.

2021.07.1

Released on July 23, 2021

2021.07.0

Released on July 9, 2021

2021.06.2

Released on June 22, 2021

2021.06.1

Released on June 18, 2021

2021.06.0

Released on June 4, 2021

2021.05.1

Released on May 28, 2021

2021.05.0

Released on May 14, 2021

2021.04.1

Released on April 23, 2021

2021.04.0

Released on April 2, 2021

2021.03.1

Released on March 26, 2021

2021.03.0

Released on March 5, 2021

Note

This is the first release with support for Python 3.9 and the last release with support for Python 3.6

2021.02.0

Released on February 5, 2021

2021.01.1

Released on January 22, 2021

2021.01.0

Released on January 15, 2021

2020.12.0

Released on December 10, 2020

Highlights

  • Switched to CalVer for versioning scheme.

  • Introduced new APIs for HighLevelGraph to enable sending high-level representations of task graphs to the distributed scheduler.

  • Introduced new HighLevelGraph layer objects including BasicLayer, Blockwise, BlockwiseIO, ShuffleLayer, and more.

  • Added support for applying custom Layer-level annotations like priority, retries, etc. with the dask.annotations context manager.

  • Updated minimum supported version of pandas to 0.25.0 and NumPy to 1.15.1.

  • Support for the pyarrow.dataset API to read_parquet.

  • Several fixes to Dask Array’s SVD.

All changes

2.30.0 / 2020-10-06

Array

2.29.0 / 2020-10-02

Array

Bag

Core

DataFrame

Documentation

2.28.0 / 2020-09-25

Array

Core

DataFrame

2.27.0 / 2020-09-18

Array

Core

DataFrame

Documentation

2.26.0 / 2020-09-11

Array

Core

DataFrame

Documentation

2.25.0 / 2020-08-28

Core

DataFrame

Documentation

2.24.0 / 2020-08-22

Array

Dataframe

Core

2.23.0 / 2020-08-14

Array

Bag

Core

DataFrame

Documentation

2.22.0 / 2020-07-31

Array

Core

DataFrame

Documentation

2.21.0 / 2020-07-17

Array

Bag

Core

DataFrame

Documentation

2.20.0 / 2020-07-02

Array

DataFrame

Documentation

2.19.0 / 2020-06-19

Array

Core

DataFrame

Documentation

2.18.1 / 2020-06-09

Array

Core

Documentation

2.18.0 / 2020-06-05

Array

Bag

DataFrame

Documentation

2.17.2 / 2020-05-28

Core

DataFrame

2.17.1 / 2020-05-28

Array

Core

DataFrame

2.17.0 / 2020-05-26

Array

Bag

Core

DataFrame

Documentation

2.16.0 / 2020-05-08

Array

Core

DataFrame

Documentation

2.15.0 / 2020-04-24

Array

Core

DataFrame

Documentation

2.14.0 / 2020-04-03

Array

Core

DataFrame

Documentation

2.13.0 / 2020-03-25

Array

Bag

Core

DataFrame

Documentation

2.12.0 / 2020-03-06

Array

Core

DataFrame

Documentation

2.11.0 / 2020-02-19

Array

Bag

Core

DataFrame

Documentation

2.10.1 / 2020-01-30

2.10.0 / 2020-01-28

2.9.2 / 2020-01-16

Array

Core

DataFrame

Documentation

2.9.1 / 2019-12-27

Array

Core

DataFrame

Documentation

2.9.0 / 2019-12-06

Array

Core

DataFrame

Documentation

2.8.1 / 2019-11-22

Array

Core

DataFrame

Documentation

2.8.0 / 2019-11-14

Array

Bag

Core

DataFrame

Documentation

2.7.0 / 2019-11-08

This release drops support for Python 3.5

Array

Core

DataFrame

Documentation

2.6.0 / 2019-10-15

Core

DataFrame

Documentation

2.5.2 / 2019-10-04

Array

DataFrame

Documentation

2.5.0 / 2019-09-27

Core

DataFrame

Documentation

2.4.0 / 2019-09-13

Array

Core

DataFrame

Documentation

2.3.0 / 2019-08-16

Array

Bag

Core

DataFrame

Documentation

2.2.0 / 2019-08-01

Array

Bag

Core

DataFrame

Documentation

2.1.0 / 2019-07-08

Array

Core

DataFrame

Documentation

2.0.0 / 2019-06-25

Array

Core

DataFrame

Documentation

1.2.2 / 2019-05-08

Array

Bag

Core

DataFrame

Documentation

1.2.1 / 2019-04-29

Array

Core

DataFrame

Documentation

1.2.0 / 2019-04-12

Array

Core

DataFrame

Documentation

1.1.5 / 2019-03-29

Array

Core

DataFrame

Documentation

1.1.4 / 2019-03-08

Array

Core

DataFrame

Documentation

1.1.3 / 2019-03-01

Array

DataFrame

Documentation

1.1.2 / 2019-02-25

Array

Bag

DataFrame

Documentation

Core

1.1.1 / 2019-01-31

Array

DataFrame

Delayed

Documentation

Core

  • Work around psutil 5.5.0 not allowing pickling Process objects Janne Vuorela

1.1.0 / 2019-01-18

Array

DataFrame

Documentation

Core

1.0.0 / 2018-11-28

Array

DataFrame

Documentation

Core

0.20.2 / 2018-11-15

Array

Dataframe

Documentation

0.20.1 / 2018-11-09

Array

Core

Dataframe

Documentation

0.20.0 / 2018-10-26

Array

Bag

Core

Dataframe

Documentation

0.19.4 / 2018-10-09

Array

Bag

Dataframe

Core

Documentation

0.19.3 / 2018-10-05

Array

Bag

Dataframe

Core

Documentation

0.19.2 / 2018-09-17

Array

Core

Documentation

0.19.1 / 2018-09-06

Array

Dataframe

Documentation

0.19.0 / 2018-08-29

Array

DataFrame

Core

Docs

0.18.2 / 2018-07-23

Array

Bag

Dataframe

Delayed

Core

0.18.1 / 2018-06-22

Array

DataFrame

Core

0.18.0 / 2018-06-14

Array

Dataframe

Bag

Core

0.17.5 / 2018-05-16

Array

DataFrame

0.17.4 / 2018-05-03

Dataframe

0.17.3 / 2018-05-02

Array

DataFrame

Core

0.17.2 / 2018-03-21

Array

DataFrame

Bag

Core

0.17.1 / 2018-02-22

Array

DataFrame

Core

0.17.0 / 2018-02-09

Array

DataFrame

Bag

  • Document bag.map_paritions function may receive either a list or generator. (dask#3150) Nir

Core

0.16.1 / 2018-01-09

Array

DataFrame

Core

0.16.0 / 2017-11-17

This is a major release. It includes breaking changes, new protocols, and a large number of bug fixes.

Array

DataFrame

Core

0.15.4 / 2017-10-06

Array

  • da.random.choice now works with array arguments (dask#2781)

  • Support indexing in arrays with np.int (fixes regression) (dask#2719)

  • Handle zero dimension with rechunking (dask#2747)

  • Support -1 as an alias for “size of the dimension” in chunks (dask#2749)

  • Call mkdir in array.to_npy_stack (dask#2709)

DataFrame

  • Added the .str accessor to Categoricals with string categories (dask#2743)

  • Support int96 (spark) datetimes in parquet writer (dask#2711)

  • Pass on file scheme to fastparquet (dask#2714)

  • Support Pandas 0.21 (dask#2737)

Bag

  • Add tree reduction support for foldby (dask#2710)

Core

  • Drop s3fs from pip install dask[complete] (dask#2750)

0.15.3 / 2017-09-24

Array

DataFrame

  • Added Series.str[index] (dask#2634)

  • Allow the groupby by param to handle columns and index levels (dask#2636)

  • DataFrame.to_csv and Bag.to_textfiles now return the filenames to

    which they have written (dask#2655)

  • Fix combination of partition_on and append in to_parquet (dask#2645)

  • Fix for parquet file schemes (dask#2667)

  • Repartition works with mixed categoricals (dask#2676)

Core

0.15.2 / 2017-08-25

Array

Bag

  • Remove deprecated Bag behaviors (dask#2525)

DataFrame

Core

  • Remove bare except: blocks everywhere (dask#2590)

0.15.1 / 2017-07-08

0.15.0 / 2017-06-09

Array

Bag

  • Fix bug where reductions on bags with no partitions would fail (dask#2324)

  • Add broadcasting and variadic db.map top-level function. Also remove auto-expansion of tuples as map arguments (dask#2339)

  • Rename Bag.concat to Bag.flatten (dask#2402)

DataFrame

Core

  • Move dask.async module to dask.local (dask#2318)

  • Support callbacks with nested scheduler calls (dask#2397)

  • Support pathlib.Path objects as uris (dask#2310)

0.14.3 / 2017-05-05

DataFrame

  • Pandas 0.20.0 support

0.14.2 / 2017-05-03

Array

  • Add da.indices (dask#2268), da.tile (dask#2153), da.roll (dask#2135)

  • Simultaneously support drop_axis and new_axis in da.map_blocks (dask#2264)

  • Rechunk and concatenate work with unknown chunksizes (dask#2235) and (dask#2251)

  • Support non-numpy container arrays, notably sparse arrays (dask#2234)

  • Tensordot contracts over multiple axes (dask#2186)

  • Allow delayed targets in da.store (dask#2181)

  • Support interactions against lists and tuples (dask#2148)

  • Constructor plugins for debugging (dask#2142)

  • Multi-dimensional FFTs (single chunk) (dask#2116)

Bag

  • to_dataframe enforces consistent types (dask#2199)

DataFrame

Core

0.14.1 / 2017-03-22

Array

  • Micro-optimize optimizations (dask#2058)

  • Change slicing optimizations to avoid fusing raw numpy arrays (dask#2075) (dask#2080)

  • Dask.array operations now work on numpy arrays (dask#2079)

  • Reshape now works in a much broader set of cases (dask#2089)

  • Support deepcopy python protocol (dask#2090)

  • Allow user-provided FFT implementations in da.fft (dask#2093)

DataFrame

  • Fix to_parquet with empty partitions (dask#2020)

  • Optional npartitions='auto' mode in set_index (dask#2025)

  • Optimize shuffle performance (dask#2032)

  • Support efficient repartitioning along time windows like repartition(freq='12h') (dask#2059)

  • Improve speed of categorize (dask#2010)

  • Support single-row dataframe arithmetic (dask#2085)

  • Automatically avoid shuffle when setting index with a sorted column (dask#2091)

  • Improve handling of integer-na handling in read_csv (dask#2098)

Delayed

  • Repeated attribute access on delayed objects uses the same key (dask#2084)

Core

  • Improve naming of nodes in dot visuals to avoid generic apply (dask#2070)

  • Ensure that worker processes have different random seeds (dask#2094)

0.14.0 / 2017-02-24

Array

Bag

DataFrame

Delayed

  • Add traverse= keyword to delayed to optionally avoid traversing nested data structures (dask#1899)

  • Support Futures in from_delayed functions (dask#1961)

  • Improve serialization of decorated delayed functions (dask#1969)

Core

  • Improve windows path parsing in corner cases (dask#1910)

  • Rename tasks when fusing (dask#1919)

  • Add top level persist function (dask#1927)

  • Propagate errors= keyword in byte handling (dask#1954)

  • Dask.compute traverses Python collections (dask#1975)

  • Structural sharing between graphs in dask.array and dask.delayed (dask#1985)

0.13.0 / 2017-01-02

Array

  • Mandatory dtypes on dask.array. All operations maintain dtype information and UDF functions like map_blocks now require a dtype= keyword if it can not be inferred. (dask#1755)

  • Support arrays without known shapes, such as arises when slicing arrays with arrays or converting dataframes to arrays (dask#1838)

  • Support mutation by setting one array with another (dask#1840)

  • Tree reductions for covariance and correlations. (dask#1758)

  • Add SerializableLock for better use with distributed scheduling (dask#1766)

  • Improved atop support (dask#1800)

  • Rechunk optimization (dask#1737), (dask#1827)

Bag

  • Avoid wrong results when recomputing the same groupby twice (dask#1867)

DataFrame

Delayed

  • Changed behaviour for delayed(nout=0) and delayed(nout=1): delayed(nout=1) does not default to out=None anymore, and delayed(nout=0) is also enabled. I.e. functions with return tuples of length 1 or 0 can be handled correctly. This is especially handy, if functions with a variable amount of outputs are wrapped by delayed. E.g. a trivial example: delayed(lambda *args: args, nout=len(vals))(*vals)

Core

0.12.0 / 2016-11-03

DataFrame

  • Return a series when functions given to dataframe.map_partitions return scalars (dask#1515)

  • Fix type size inference for series (dask#1513)

  • dataframe.DataFrame.categorize no longer includes missing values in the categories. This is for compatibility with a pandas change (dask#1565)

  • Fix head parser error in dataframe.read_csv when some lines have quotes (dask#1495)

  • Add dataframe.reduction and series.reduction methods to apply generic row-wise reduction to dataframes and series (dask#1483)

  • Add dataframe.select_dtypes, which mirrors the pandas method (dask#1556)

  • dataframe.read_hdf now supports reading Series (dask#1564)

  • Support Pandas 0.19.0 (dask#1540)

  • Implement select_dtypes (dask#1556)

  • String accessor works with indexes (dask#1561)

  • Add pipe method to dask.dataframe (dask#1567)

  • Add indicator keyword to merge (dask#1575)

  • Support Series in read_hdf (dask#1575)

  • Support Categories with missing values (dask#1578)

  • Support inplace operators like df.x += 1 (dask#1585)

  • Str accessor passes through args and kwargs (dask#1621)

  • Improved groupby support for single-machine multiprocessing scheduler (dask#1625)

  • Tree reductions (dask#1663)

  • Pivot tables (dask#1665)

  • Add clip (dask#1667), align (dask#1668), combine_first (dask#1725), and any/all (dask#1724)

  • Improved handling of divisions on dask-pandas merges (dask#1666)

  • Add groupby.aggregate method (dask#1678)

  • Add dd.read_table function (dask#1682)

  • Improve support for multi-level columns (dask#1697) (dask#1712)

  • Support 2d indexing in loc (dask#1726)

  • Extend resample to include DataFrames (dask#1741)

  • Support dask.array ufuncs on dask.dataframe objects (dask#1669)

Array

  • Add information about how dask.array chunks argument work (dask#1504)

  • Fix field access with non-scalar fields in dask.array (dask#1484)

  • Add concatenate= keyword to atop to concatenate chunks of contracted dimensions

  • Optimized slicing performance (dask#1539) (dask#1731)

  • Extend atop with a concatenate= (dask#1609) new_axes= (dask#1612) and adjust_chunks= (dask#1716) keywords

  • Add clip (dask#1610) swapaxes (dask#1611) round (dask#1708) repeat

  • Automatically align chunks in atop-backed operations (dask#1644)

  • Cull dask.arrays on slicing (dask#1709)

Bag

  • Fix issue with callables in bag.from_sequence being interpreted as tasks (dask#1491)

  • Avoid non-lazy memory use in reductions (dask#1747)

Administration

  • Added changelog (dask#1526)

  • Create new threadpool when operating from thread (dask#1487)

  • Unify example documentation pages into one (dask#1520)

  • Add versioneer for git-commit based versions (dask#1569)

  • Pass through node_attr and edge_attr keywords in dot visualization (dask#1614)

  • Add continuous testing for Windows with Appveyor (dask#1648)

  • Remove use of multiprocessing.Manager (dask#1653)

  • Add global optimizations keyword to compute (dask#1675)

  • Micro-optimize get_dependencies (dask#1722)

0.11.0 / 2016-08-24

Major Points

DataFrames now enforce knowing full metadata (columns, dtypes) everywhere. Previously we would operate in an ambiguous state when functions lost dtype information (such as apply). Now all dataframes always know their dtypes and raise errors asking for information if they are unable to infer (which they usually can). Some internal attributes like _pd and _pd_nonempty have been moved.

The internals of the distributed scheduler have been refactored to transition tasks between explicit states. This improves resilience, reasoning about scheduling, plugin operation, and logging. It also makes the scheduler code easier to understand for newcomers.

Breaking Changes

  • The distributed.s3 and distributed.hdfs namespaces are gone. Use protocols in normal methods like read_text('s3://...' instead.

  • Dask.array.reshape now errs in some cases where previously it would have create a very large number of tasks

0.10.2 / 2016-07-27

  • More Dataframe shuffles now work in distributed settings, ranging from setting-index to hash joins, to sorted joins and groupbys.

  • Dask passes the full test suite when run when under in Python’s optimized-OO mode.

  • On-disk shuffles were found to produce wrong results in some highly-concurrent situations, especially on Windows. This has been resolved by a fix to the partd library.

  • Fixed a growth of open file descriptors that occurred under large data communications

  • Support ports in the --bokeh-whitelist option ot dask-scheduler to better routing of web interface messages behind non-trivial network settings

  • Some improvements to resilience to worker failure (though other known failures persist)

  • You can now start an IPython kernel on any worker for improved debugging and analysis

  • Improvements to dask.dataframe.read_hdf, especially when reading from multiple files and docs

0.10.0 / 2016-06-13

Major Changes

  • This version drops support for Python 2.6

  • Conda packages are built and served from conda-forge

  • The dask.distributed executables have been renamed from dfoo to dask-foo. For example dscheduler is renamed to dask-scheduler

  • Both Bag and DataFrame include a preliminary distributed shuffle.

Bag

  • Add task-based shuffle for distributed groupbys

  • Add accumulate for cumulative reductions

DataFrame

  • Add a task-based shuffle suitable for distributed joins, groupby-applys, and set_index operations. The single-machine shuffle remains untouched (and much more efficient.)

  • Add support for new Pandas rolling API with improved communication performance on distributed systems.

  • Add groupby.std/var

  • Pass through S3/HDFS storage options in read_csv

  • Improve categorical partitioning

  • Add eval, info, isnull, notnull for dataframes

Distributed

  • Rename executables like dscheduler to dask-scheduler

  • Improve scheduler performance in the many-fast-tasks case (important for shuffling)

  • Improve work stealing to be aware of expected function run-times and data sizes. The drastically increases the breadth of algorithms that can be efficiently run on the distributed scheduler without significant user expertise.

  • Support maximum buffer sizes in streaming queues

  • Improve Windows support when using the Bokeh diagnostic web interface

  • Support compression of very-large-bytestrings in protocol

  • Support clean cancellation of submitted futures in Joblib interface

Other

  • All dask-related projects (dask, distributed, s3fs, hdfs, partd) are now building conda packages on conda-forge.

  • Change credential handling in s3fs to only pass around delegated credentials if explicitly given secret/key. The default now is to rely on managed environments. This can be changed back by explicitly providing a keyword argument. Anonymous mode must be explicitly declared if desired.

0.9.0 / 2016-05-11

API Changes

  • dask.do and dask.value have been renamed to dask.delayed

  • dask.bag.from_filenames has been renamed to dask.bag.read_text

  • All S3/HDFS data ingest functions like db.from_s3 or distributed.s3.read_csv have been moved into the plain read_text, read_csv functions, which now support protocols, like dd.read_csv('s3://bucket/keys*.csv')

Array

  • Add support for scipy.LinearOperator

  • Improve optional locking to on-disk data structures

  • Change rechunk to expose the intermediate chunks

Bag

  • Rename from_filenames to read_text

  • Remove from_s3 in favor of read_text('s3://...')

DataFrame

  • Fixed numerical stability issue for correlation and covariance

  • Allow no-hash from_pandas for speedy round-trips to and from-pandas objects

  • Generally reengineered read_csv to be more in line with Pandas behavior

  • Support fast set_index operations for sorted columns

Delayed

  • Rename do/value to delayed

  • Rename to/from_imperative to to/from_delayed

Distributed

  • Move s3 and hdfs functionality into the dask repository

  • Adaptively oversubscribe workers for very fast tasks

  • Improve PyPy support

  • Improve work stealing for unbalanced workers

  • Scatter data efficiently with tree-scatters

Other

  • Add lzma/xz compression support

  • Raise a warning when trying to split unsplittable compression types, like gzip or bz2

  • Improve hashing for single-machine shuffle operations

  • Add new callback method for start state

  • General performance tuning

0.8.1 / 2016-03-11

Array

  • Bugfix for range slicing that could periodically lead to incorrect results.

  • Improved support and resiliency of arg reductions (argmin, argmax, etc.)

Bag

  • Add zip function

DataFrame

  • Add corr and cov functions

  • Add melt function

  • Bugfixes for io to bcolz and hdf5

0.8.0 / 2016-02-20

Array

  • Changed default array reduction split from 32 to 4

  • Linear algebra, tril, triu, LU, inv, cholesky, solve, solve_triangular, eye, lstsq, diag, corrcoef.

Bag

  • Add tree reductions

  • Add range function

  • drop from_hdfs function (better functionality now exists in hdfs3 and distributed projects)

DataFrame

  • Refactor dask.dataframe to include a full empty pandas dataframe as metadata. Drop the .columns attribute on Series

  • Add Series categorical accessor, series.nunique, drop the .columns attribute for series.

  • read_csv fixes (multi-column parse_dates, integer column names, etc. )

  • Internal changes to improve graph serialization

Other

  • Documentation updates

  • Add from_imperative and to_imperative functions for all collections

  • Aesthetic changes to profiler plots

  • Moved the dask project to a new dask organization

0.7.6 / 2016-01-05

Array

  • Improve thread safety

  • Tree reductions

  • Add view, compress, hstack, dstack, vstack methods

  • map_blocks can now remove and add dimensions

DataFrame

  • Improve thread safety

  • Extend sampling to include replacement options

Imperative

  • Removed optimization passes that fused results.

Core

  • Removed dask.distributed

  • Improved performance of blocked file reading

  • Serialization improvements

  • Test Python 3.5

0.7.4 / 2015-10-23

This was mostly a bugfix release. Some notable changes:

  • Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17

  • Fixed a bug with random number generation that would cause repeated blocks due to the birthday paradox

  • Use locks in dask.dataframe.read_hdf by default to avoid concurrency issues

  • Change dask.get to point to dask.async.get_sync by default

  • Allow visualization functions to accept general graphviz graph options like rankdir=’LR’

  • Add reshape and ravel to dask.array

  • Support the creation of dask.arrays from dask.imperative objects

Deprecation

This release also includes a deprecation warning for dask.distributed, which will be removed in the next version.

Future development in distributed computing for dask is happening here: https://distributed.dask.org . General feedback on that project is most welcome from this community.

0.7.3 / 2015-09-25

Diagnostics

  • A utility for profiling memory and cpu usage has been added to the dask.diagnostics module.

DataFrame

This release improves coverage of the pandas API. Among other things it includes nunique, nlargest, quantile. Fixes encoding issues with reading non-ascii csv files. Performance improvements and bug fixes with resample. More flexible read_hdf with globbing. And many more. Various bug fixes in dask.imperative and dask.bag.

0.7.0 / 2015-08-15

DataFrame

This release includes significant bugfixes and alignment with the Pandas API. This has resulted both from use and from recent involvement by Pandas core developers.

  • New operations: query, rolling operations, drop

  • Improved operations: quantiles, arithmetic on full dataframes, dropna, constructor logic, merge/join, elemwise operations, groupby aggregations

Bag

  • Fixed a bug in fold where with a null default argument

Array

  • New operations: da.fft module, da.image.imread

Infrastructure

  • The array and dataframe collections create graphs with deterministic keys. These tend to be longer (hash strings) but should be consistent between computations. This will be useful for caching in the future.

  • All collections (Array, Bag, DataFrame) inherit from common subclass

0.6.1 / 2015-07-23

Distributed

  • Improved (though not yet sufficient) resiliency for dask.distributed when workers die

DataFrame

  • Improved writing to various formats, including to_hdf, to_castra, and to_csv

  • Improved creation of dask DataFrames from dask Arrays and Bags

  • Improved support for categoricals and various other methods

Array

  • Various bug fixes

  • Histogram function

Scheduling

  • Added tie-breaking ordering of tasks within parallel workloads to better handle and clear intermediate results

Other

  • Added the dask.do function for explicit construction of graphs with normal python code

  • Traded pydot for graphviz library for graph printing to support Python3

  • There is also a gitter chat room and a stackoverflow tag