daft-distributed-scaling

Daft Distributed Scaling

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "daft-distributed-scaling" with this command: npx skills add eventual-inc/daft/eventual-inc-daft-daft-distributed-scaling

Daft Distributed Scaling

Scale single-node workflows to distributed execution.

Core Strategies

Strategy API Use Case Pros/Cons

Shuffle repartition(N)

Light data (e.g. file paths), Joins Global balance. High memory usage (materializes data).

Streaming into_batches(N)

Heavy data (images, tensors) Low memory (streaming). High scheduling overhead if batches too small.

Quick Recipes

  1. Light Data: Repartitioning

Best for distributing file paths before heavy reads.

Create enough partitions to saturate workers

df = daft.read_parquet("s3://metadata").repartition(100) df = df.with_column("data", read_heavy_data(df["path"]))

  1. Heavy Data: Streaming Batches

Best for processing large partitions without OOM.

Stream 1GB partition in 64-row chunks to control memory

df = df.read_parquet("heavy_data").into_batches(64) df = df.with_column("embed", model.predict(df["img"]))

Advanced Tuning: The ByteDance Formula

Target: Keep all actors busy without OOM or scheduling bottlenecks.

Formula 1: Repartitioning (Light Data / Paths)

Calculate the Max Partition Count to ensure each task has enough data to feed local actors.

  • Min Rows Per Partition = Batch Size * (Total Concurrency / Nodes)

  • Max Partitions = Total Rows / Min Rows Per Partition

Example:

  • 1M rows, 4 nodes, 16 total concurrency, Batch Size 64.

  • Min Rows: 64 * (16/4) = 256 .

  • Max Partitions: 1,000,000 / 256 ≈ 3906 .

  • Recommendation: Use ~1000 partitions to run multiple batches per task.

df = df.repartition(1000) # Balanced fan-out

Formula 2: Streaming (Heavy Data / Images)

Avoid creating tiny partitions. Use into_batches to stream data within larger partitions.

Strategy: Keep partitions large (e.g. 1GB+), use into_batches(Batch Size) to control memory.

Stream batches to control memory usage per actor

df = df.into_batches(64).with_column("preds", model(max_concurrency=16).predict(df["img"]))

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

daft-docs-navigation

No summary provided by upstream source.

Repository SourceNeeds Review
General

daft-udf-tuning

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

daft-worktree-workflow

No summary provided by upstream source.

Repository SourceNeeds Review