Dask enables parallel computing through task scheduling and blocked algorithms. This allows developers to write complex parallel algorithms and execute them in parallel either on a modern multi-core machine or on a distributed cluster.
On a single machine dask increases the scale of comfortable data from fits-in-memory to fits-on-disk by intelligently streaming data from disk and by leveraging all the cores of a modern CPU.
Users interact with dask either by making graphs directly or through the dask collections which provide larger-than-memory counterparts to existing popular libraries:
map, filter, toolz+
Dask primarily targets parallel computations that run on a single machine. It integrates nicely with the existing PyData ecosystem and is trivial to setup and use:
conda install dask or pip install dask
Operations on dask collections (array, bag, dataframe) produce task graphs that encode blocked algorithms. Task schedulers execute these task graphs in parallel in a variety of contexts.
Dask collections are the main interaction point for users. They look like NumPy and pandas but generate dask graphs internally. If you are a dask user then you should start here.
Dask graphs encode algorithms in a simple format involving Python dicts, tuples, and functions. This graph format can be used in isolation from the dask collections. If you are a developer then you should start here.
Schedulers execute task graphs. After a collection produces a graph we execute this graph in parallel, either using all of the cores on a single workstation or using a distributed cluster.
Inspecting and Diagnosing Graphs
Parallel code can be tricky to debug and profile. Dask provides a few tools to help make debugging and profiling graph execution easier.
- For user questions please tag StackOverflow questions with the #dask tag.
- For bug reports and feature requests please use the GitHub issue tracker
- For community discussion please use email@example.com