Daft Distributed Scaling

Scale single-node workflows to distributed execution.

Core Strategies

Strategy API Use Case Pros/Cons

Shuffle repartition(N)

Light data (e.g. file paths), Joins Global balance. High memory usage (materializes data).

Streaming into_batches(N)

Heavy data (images, tensors) Low memory (streaming). High scheduling overhead if batches too small.

Quick Recipes

Best for distributing file paths before heavy reads.

Create enough partitions to saturate workers

df = daft.read_parquet("s3://metadata").repartition(100) df = df.with_column("data", read_heavy_data(df["path"]))

Best for processing large partitions without OOM.

Stream 1GB partition in 64-row chunks to control memory

df = df.read_parquet("heavy_data").into_batches(64) df = df.with_column("embed", model.predict(df["img"]))

Advanced Tuning: The ByteDance Formula

Target: Keep all actors busy without OOM or scheduling bottlenecks.

Formula 1: Repartitioning (Light Data / Paths)

Calculate the Max Partition Count to ensure each task has enough data to feed local actors.

Example:

df = df.repartition(1000) # Balanced fan-out

Formula 2: Streaming (Heavy Data / Images)

Avoid creating tiny partitions. Use into_batches to stream data within larger partitions.

Strategy: Keep partitions large (e.g. 1GB+), use into_batches(Batch Size) to control memory.

df = df.into_batches(64).with_column("preds", model(max_concurrency=16).predict(df["img"]))