Databricks Local Dev Loop
Overview
Set up a fast, reproducible local development workflow for Databricks.
Prerequisites
-
Completed databricks-install-auth setup
-
Python 3.8+ with pip
-
VS Code or PyCharm IDE
-
Access to a running cluster
Instructions
Step 1: Project Structure
my-databricks-project/ ├── src/ │ ├── init.py │ ├── pipelines/ │ │ ├── init.py │ │ ├── bronze.py # Raw data ingestion │ │ ├── silver.py # Data cleansing │ │ └── gold.py # Business aggregations │ └── utils/ │ ├── init.py │ └── helpers.py ├── tests/ │ ├── init.py │ ├── unit/ │ │ └── test_helpers.py │ └── integration/ │ └── test_pipelines.py ├── notebooks/ # Databricks notebooks │ └── exploration.py ├── resources/ # Asset Bundle configs │ └── jobs.yml ├── databricks.yml # Asset Bundle project config ├── .env.local # Local secrets (git-ignored) ├── .env.example # Template for team ├── pyproject.toml └── requirements.txt
Step 2: Install Development Tools
set -euo pipefail
Install Databricks SDK and CLI
pip install databricks-sdk databricks-cli
Install dbx for deployment
pip install dbx
Install Databricks Connect v2 (for local Spark)
pip install databricks-connect==14.3.*
Install testing tools
pip install pytest pytest-cov
Step 3: Configure Databricks Connect
Configure Databricks Connect for local development
databricks-connect configure
Or set environment variables
export DATABRICKS_HOST="https://adb-1234567890.1.azuredatabricks.net" export DATABRICKS_TOKEN="dapi..." export DATABRICKS_CLUSTER_ID="1234-567890-abcde123" # 567890: port 1234 - example/test
Step 4: Create databricks.yml (Asset Bundle)
databricks.yml
bundle: name: my-databricks-project
workspace: host: ${DATABRICKS_HOST}
variables: catalog: description: Unity Catalog name default: main schema: description: Schema name default: default
targets: dev: default: true mode: development workspace: root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging: mode: development workspace: root_path: /Shared/.bundle/${bundle.name}/staging
prod: mode: production workspace: root_path: /Shared/.bundle/${bundle.name}/prod
Step 5: Local Testing Setup
tests/conftest.py
import pytest from pyspark.sql import SparkSession
@pytest.fixture(scope="session")
def spark():
"""Create local SparkSession for unit tests."""
return SparkSession.builder
.master("local[*]")
.appName("unit-tests")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
@pytest.fixture(scope="session") def dbx_spark(): """Connect to Databricks cluster for integration tests.""" from databricks.connect import DatabricksSession return DatabricksSession.builder.getOrCreate()
Step 6: VS Code Configuration
// .vscode/settings.json { "python.defaultInterpreterPath": "${workspaceFolder}/.venv/bin/python", "python.testing.pytestEnabled": true, "python.testing.pytestArgs": ["tests"], "python.linting.enabled": true, "python.linting.pylintEnabled": true, "editor.formatOnSave": true, "[python]": { "editor.defaultFormatter": "ms-python.black-formatter" }, "databricks.python.envFile": "${workspaceFolder}/.env.local" }
// .vscode/launch.json { "version": "0.2.0", "configurations": [ { "name": "Python: Current File (Databricks Connect)", "type": "python", "request": "launch", "program": "${file}", "console": "integratedTerminal", "env": { "DATABRICKS_HOST": "${env:DATABRICKS_HOST}", "DATABRICKS_TOKEN": "${env:DATABRICKS_TOKEN}", "DATABRICKS_CLUSTER_ID": "${env:DATABRICKS_CLUSTER_ID}" } } ] }
Output
-
Working local development environment
-
Databricks Connect configured for remote execution
-
Unit and integration test setup
-
VS Code/PyCharm integration ready
Error Handling
Error Cause Solution
Cluster not running
Auto-terminated Start cluster first
Version mismatch
DBR vs Connect version Match databricks-connect version to DBR
Module not found
Missing local install Run pip install -e .
Connection timeout
Network/firewall Check VPN and firewall rules
SparkSession already exists
Multiple sessions Use getOrCreate() pattern
Examples
Run Tests Locally
Unit tests (local Spark)
pytest tests/unit/ -v
Integration tests (Databricks Connect)
pytest tests/integration/ -v --tb=short
With coverage
pytest tests/ --cov=src --cov-report=html
Deploy with Asset Bundles
Validate bundle
databricks bundle validate
Deploy to dev
databricks bundle deploy -t dev
Run job
databricks bundle run -t dev my-job
Interactive Development
src/pipelines/bronze.py
from pyspark.sql import SparkSession, DataFrame
def ingest_raw_data(spark: SparkSession, source_path: str) -> DataFrame: """Ingest raw data from source.""" return spark.read.format("json").load(source_path)
if name == "main": # Works locally with Databricks Connect from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate()
df = ingest_raw_data(spark, "/mnt/raw/events")
df.show()
Hot Reload with dbx
Watch for changes and sync
dbx sync --watch
Or use Asset Bundles
databricks bundle sync -t dev --watch
Resources
-
Databricks Connect
-
Asset Bundles
-
VS Code Extension
-
Testing Notebooks
Next Steps
See databricks-sdk-patterns for production-ready code patterns.