dask

Dask Parallel and Distributed Computing

Scale pandas/NumPy workflows beyond memory and across clusters.

When to Use

Datasets exceed available RAM
Need to parallelize pandas or NumPy operations
Processing multiple files efficiently (CSVs, Parquet)
Building custom parallel workflows
Distributing workloads across multiple cores/machines

Dask Collections

Collection Like Use Case

DataFrame pandas Tabular data, CSV/Parquet

Array NumPy Numerical arrays, matrices

Bag list Unstructured data, JSON logs

Delayed Custom Arbitrary Python functions

Key concept: All collections are lazy—computation happens only when you call .compute() .

Lazy Evaluation

Function Behavior Use

dd.read_csv()

Lazy load Large CSVs

dd.read_parquet()

Lazy load Large Parquet

Operations Build graph Chain transforms

.compute()

Execute Get final result

Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.

Schedulers

Scheduler Best For Start

threaded NumPy/Pandas (releases GIL) Default

processes Pure Python (GIL bound) scheduler='processes'

synchronous Debugging scheduler='synchronous'

distributed Monitoring, scaling, clusters Client()

Distributed Scheduler

Feature Benefit

Dashboard Real-time progress monitoring

Cluster scaling Add/remove workers

Fault tolerance Retry failed tasks

Worker resources Memory management

Chunking Concepts

DataFrame Partitions

Concept Description

Partition Subset of rows (like a mini DataFrame)

npartitions Number of partitions

divisions Index boundaries between partitions

Array Chunks

Concept Description

Chunk Subset of array (n-dimensional block)

chunks Tuple of chunk sizes per dimension

Optimal size ~100 MB per chunk

Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.

DataFrame Operations

Supported (parallel)

Category Operations

Selection filter , loc , column selection

Aggregation groupby , sum , mean , count

Transforms apply (row-wise), map_partitions

Joins merge , join (shuffles data)

I/O read_csv , read_parquet , to_parquet

Avoid or Use Carefully

Operation Issue Alternative

iterrows

Kills parallelism map_partitions

apply(axis=1)

Slow map_partitions

Repeated compute()

Inefficient Single compute() at end

sort_values

Expensive shuffle Avoid if possible

Common Patterns

ETL Pipeline

scan_* or read_* (lazy load)
Chain filters and transforms
Single .compute() or .to_parquet()

Multi-File Processing

Pattern Description

Glob patterns dd.read_csv('data/*.csv')

Partition per file Natural parallelism

Output partitioned to_parquet('output/')

Custom Operations

Method Use Case

map_partitions

Apply function to each partition

map_blocks

Apply function to each array block

delayed

Wrap arbitrary Python functions

Best Practices

Practice Why

Don't load locally first Let Dask handle loading

Single compute() at end Avoid redundant computation

Use Parquet Faster than CSV, columnar

Match partition to files One partition per file

Check task graph size len(ddf.dask_graph()) < 100k

Use distributed for debugging Dashboard shows progress

Common Pitfalls

Pitfall Solution

Loading with pandas first Use dd.read_* directly

compute() in loops Collect all, single compute()

Too many partitions Repartition to ~100 MB each

Memory errors Reduce chunk size, add workers

Slow shuffles Avoid sorts/joins when possible

vs Alternatives

Tool Best For Trade-off

Dask Scale pandas/NumPy, clusters Setup complexity

Polars Fast in-memory Must fit in RAM

Vaex Out-of-core single machine Limited operations

Spark Enterprise, SQL-heavy Infrastructure

Resources

Docs: https://docs.dask.org/
Best Practices: https://docs.dask.org/en/stable/best-practices.html
Examples: https://examples.dask.org/

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

document-processing

stripe-payments

file-organization

literature-review