monitoring-dashboard

Purpose: Provide complete monitoring dashboard templates and setup scripts for ML training with TensorBoard and Weights & Biases (WandB).

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-dashboard" with this command: npx skills add vanman2024/ai-dev-marketplace/vanman2024-ai-dev-marketplace-monitoring-dashboard

Monitoring Dashboard

Purpose: Provide complete monitoring dashboard templates and setup scripts for ML training with TensorBoard and Weights & Biases (WandB).

Activation Triggers:

  • Setting up training monitoring dashboards

  • Tracking experiments and metrics in real-time

  • Comparing multiple training runs

  • Visualizing hyperparameters and results

  • Integrating monitoring into existing training pipelines

  • Logging custom metrics, images, and model artifacts

Key Resources:

  • scripts/setup-tensorboard.sh

  • Install and configure TensorBoard

  • scripts/setup-wandb.sh

  • Install and configure Weights & Biases

  • scripts/launch-monitoring.sh

  • Launch monitoring dashboards

  • templates/tensorboard-config.yaml

  • TensorBoard configuration template

  • templates/wandb-config.py

  • WandB integration template

  • templates/logging-config.json

  • Unified logging configuration

  • examples/tensorboard-integration.md

  • Complete TensorBoard integration guide

  • examples/wandb-integration.md

  • Complete WandB integration guide

Quick Start

  1. Choose Monitoring Solution

TensorBoard (Local/Open Source):

  • Free, runs locally

  • Best for: Single-user development, offline work

  • Features: Metrics, histograms, graphs, images, embeddings

  • Storage: Local filesystem

Weights & Biases (Cloud/Collaboration):

  • Free tier available, cloud-hosted

  • Best for: Team collaboration, experiment comparison, production

  • Features: All TensorBoard features + collaboration, alerts, reports

  • Storage: Cloud with unlimited history

Both (Recommended for Production):

  • Use TensorBoard for local development

  • Use WandB for team collaboration and production tracking

  1. Setup TensorBoard

Install and configure TensorBoard

./scripts/setup-tensorboard.sh

Launch TensorBoard

./scripts/launch-monitoring.sh tensorboard --logdir ./runs

Access: Open browser to http://localhost:6006

  1. Setup Weights & Biases

Install and configure WandB

./scripts/setup-wandb.sh

Login with API key

wandb login

Launch monitoring

./scripts/launch-monitoring.sh wandb

Access: Dashboard at https://wandb.ai/your-username/your-project

TensorBoard Integration

Basic Setup

Template: templates/tensorboard-config.yaml

from torch.utils.tensorboard import SummaryWriter import datetime

Create TensorBoard writer

log_dir = f"runs/experiment_{datetime.datetime.now().strftime('%Y%m%d-%H%M%S')}" writer = SummaryWriter(log_dir=log_dir)

Log scalar metrics

writer.add_scalar('Loss/train', train_loss, epoch) writer.add_scalar('Loss/validation', val_loss, epoch) writer.add_scalar('Accuracy/train', train_acc, epoch) writer.add_scalar('Accuracy/validation', val_acc, epoch)

Log learning rate

writer.add_scalar('Learning_Rate', optimizer.param_groups[0]['lr'], epoch)

Close writer when done

writer.close()

Advanced Logging

Histograms (Weight Distributions):

Log model weights

for name, param in model.named_parameters(): writer.add_histogram(f'weights/{name}', param, epoch) writer.add_histogram(f'gradients/{name}', param.grad, epoch)

Images:

Log sample predictions

writer.add_image('predictions', image_grid, epoch) writer.add_images('batch_samples', image_batch, epoch)

Text:

Log hyperparameters as text

config_text = '\n'.join([f'{k}: {v}' for k, v in config.items()]) writer.add_text('hyperparameters', config_text, 0)

Model Graph:

Log model architecture

writer.add_graph(model, input_tensor)

Embeddings (t-SNE, PCA):

Visualize embeddings

writer.add_embedding(embeddings, metadata=labels, label_img=images)

Launch TensorBoard

Basic launch

tensorboard --logdir runs

Specify port

tensorboard --logdir runs --port 6007

Load faster (sample data)

tensorboard --logdir runs --samples_per_plugin scalars=1000

Enable reload

tensorboard --logdir runs --reload_interval 5

Weights & Biases Integration

Basic Setup

Template: templates/wandb-config.py

import wandb

Initialize WandB run

wandb.init( project="my-ml-project", name=f"experiment-{datetime.now().strftime('%Y%m%d-%H%M%S')}", config={ "learning_rate": 0.001, "epochs": 100, "batch_size": 32, "model": "resnet50", "dataset": "imagenet" } )

Log metrics

wandb.log({ "train_loss": train_loss, "val_loss": val_loss, "train_acc": train_acc, "val_acc": val_acc, "epoch": epoch })

Finish run

wandb.finish()

Advanced Features

Log Media:

Log images

wandb.log({"predictions": [wandb.Image(img, caption=f"Pred: {pred}")]})

Log tables

table = wandb.Table(columns=["epoch", "loss", "accuracy"], data=data) wandb.log({"results_table": table})

Log audio

wandb.log({"audio": wandb.Audio(audio_array, sample_rate=16000)})

Log videos

wandb.log({"video": wandb.Video(video_path, fps=30)})

Track Model Artifacts:

Save model checkpoint

artifact = wandb.Artifact('model-checkpoint', type='model') artifact.add_file('model.pth') wandb.log_artifact(artifact)

Load model from artifact

artifact = wandb.use_artifact('model-checkpoint:latest') model_path = artifact.download()

Hyperparameter Sweeps:

Define sweep configuration

sweep_config = { 'method': 'bayes', 'metric': {'name': 'val_loss', 'goal': 'minimize'}, 'parameters': { 'learning_rate': {'min': 0.0001, 'max': 0.1}, 'batch_size': {'values': [16, 32, 64]}, 'optimizer': {'values': ['adam', 'sgd', 'adamw']} } }

Initialize sweep

sweep_id = wandb.sweep(sweep_config, project="my-project")

Run sweep agent

wandb.agent(sweep_id, function=train_model, count=10)

Custom Charts:

Create custom plot

data = [[x, y] for (x, y) in zip(x_values, y_values)] table = wandb.Table(data=data, columns=["x", "y"]) wandb.log({ "custom_plot": wandb.plot.line(table, "x", "y", title="Custom Plot") })

Alerts:

Alert on metric threshold

if val_loss < 0.1: wandb.alert( title="Low Validation Loss", text=f"Validation loss dropped to {val_loss:.4f}", level=wandb.AlertLevel.INFO )

Unified Logging Configuration

Template: templates/logging-config.json

Use this configuration to log to both TensorBoard and WandB simultaneously:

import wandb from torch.utils.tensorboard import SummaryWriter

class UnifiedLogger: def init(self, project_name, experiment_name, config): # TensorBoard self.tb_writer = SummaryWriter( log_dir=f"runs/{experiment_name}" )

    # WandB
    wandb.init(
        project=project_name,
        name=experiment_name,
        config=config
    )

def log_metrics(self, metrics_dict, step):
    """Log to both TensorBoard and WandB"""
    # TensorBoard
    for key, value in metrics_dict.items():
        self.tb_writer.add_scalar(key, value, step)

    # WandB
    wandb.log(metrics_dict, step=step)

def log_images(self, images_dict, step):
    """Log images to both platforms"""
    for key, image in images_dict.items():
        # TensorBoard
        self.tb_writer.add_image(key, image, step)

        # WandB
        wandb.log({key: wandb.Image(image)}, step=step)

def log_model(self, model, input_sample):
    """Log model architecture"""
    # TensorBoard graph
    self.tb_writer.add_graph(model, input_sample)

    # WandB watches gradients
    wandb.watch(model, log="all", log_freq=100)

def close(self):
    """Close both loggers"""
    self.tb_writer.close()
    wandb.finish()

Usage

logger = UnifiedLogger( project_name="my-project", experiment_name="exp-001", config={"lr": 0.001, "batch_size": 32} )

logger.log_metrics({ "train_loss": 0.5, "val_loss": 0.6 }, step=epoch)

logger.close()

Common Monitoring Patterns

  1. Training Loop Integration

for epoch in range(num_epochs): # Training phase model.train() train_loss = 0 for batch_idx, (data, target) in enumerate(train_loader): loss = train_step(model, data, target, optimizer) train_loss += loss.item()

    # Log batch-level metrics
    global_step = epoch * len(train_loader) + batch_idx
    logger.log_metrics({
        "batch_loss": loss.item(),
        "learning_rate": optimizer.param_groups[0]['lr']
    }, step=global_step)

# Validation phase
model.eval()
val_loss, val_acc = validate(model, val_loader)

# Log epoch-level metrics
logger.log_metrics({
    "epoch": epoch,
    "train_loss": train_loss / len(train_loader),
    "val_loss": val_loss,
    "val_acc": val_acc
}, step=epoch)

# Log model weights distribution
for name, param in model.named_parameters():
    logger.tb_writer.add_histogram(f'weights/{name}', param, epoch)

2. Experiment Comparison

TensorBoard:

Compare multiple runs

tensorboard --logdir_spec
exp1:runs/experiment_1,
exp2:runs/experiment_2,
exp3:runs/experiment_3

WandB:

Automatically compares all runs in project dashboard

Filter and group runs by tags, config values, or custom fields

  1. Real-Time Monitoring

TensorBoard:

Auto-reload new data

tensorboard --logdir runs --reload_interval 5

WandB:

Real-time by default

Enable email/slack alerts for key metrics

wandb.alert( title="Training Alert", text=f"Accuracy reached {acc:.2%}", level=wandb.AlertLevel.INFO )

Best Practices

  1. Metric Naming Conventions

Organize by category:

Good: Hierarchical naming

"Loss/train" "Loss/validation" "Accuracy/train" "Accuracy/validation" "Metrics/precision" "Metrics/recall"

Bad: Flat naming

"train_loss" "validation_loss" "train_accuracy"

  1. Logging Frequency

Guidelines:

  • Scalars: Every batch or every N batches

  • Histograms: Every epoch

  • Images: Every epoch or every N epochs

  • Model graph: Once at start

  • Embeddings: Once per major checkpoint

Log batch metrics every 10 batches

if batch_idx % 10 == 0: logger.log_metrics({"batch_loss": loss}, step)

Log epoch metrics

if batch_idx == len(train_loader) - 1: logger.log_metrics({"epoch_loss": epoch_loss}, epoch)

Log images every 5 epochs

if epoch % 5 == 0: logger.log_images({"samples": sample_images}, epoch)

  1. Disk Space Management

TensorBoard:

Limit log retention

find runs/ -type d -mtime +30 -exec rm -rf {} +

Compress old logs

tar -czf archive_$(date +%Y%m%d).tar.gz runs/old_experiments/ rm -rf runs/old_experiments/

WandB:

Cloud storage handles retention

Configure retention in project settings

Download important runs for local backup

wandb.restore('model.pth', run_path="user/project/run_id")

  1. Security & Privacy

TensorBoard:

Restrict access to localhost only

tensorboard --logdir runs --host 127.0.0.1

Or use SSH tunnel for remote access

ssh -L 6006:localhost:6006 user@remote-server

WandB:

Use private projects

wandb.init(project="my-project", entity="private-team")

Disable cloud sync for sensitive data

wandb.init(mode="offline") # Logs locally only

Troubleshooting

TensorBoard Issues

Problem: Dashboard not updating

Force reload

tensorboard --logdir runs --reload_interval 1

Clear cache

rm -rf /tmp/.tensorboard-info/

Problem: Port already in use

Use different port

tensorboard --logdir runs --port 6007

Or kill existing process

pkill -f tensorboard

WandB Issues

Problem: Login fails

Re-login with API key

wandb login --relogin

Or set via environment

export WANDB_API_KEY=your_api_key

Problem: Slow logging

Reduce logging frequency

wandb.init(settings=wandb.Settings( _disable_stats=True, # Disable system metrics _disable_meta=True # Disable metadata ))

Scripts Usage

Setup TensorBoard

./scripts/setup-tensorboard.sh

Verifies:

- Python environment

- TensorBoard installation

- Creates default log directory structure

Setup WandB

./scripts/setup-wandb.sh

Verifies:

- WandB installation

- API key configuration

- Creates wandb config file

Launch Monitoring

TensorBoard

./scripts/launch-monitoring.sh tensorboard --logdir ./runs --port 6006

WandB (opens browser to dashboard)

./scripts/launch-monitoring.sh wandb --project my-project

Both

./scripts/launch-monitoring.sh both --logdir ./runs --project my-project

Resources

Scripts:

  • setup-tensorboard.sh

  • Install and configure TensorBoard

  • setup-wandb.sh

  • Install and configure WandB

  • launch-monitoring.sh

  • Launch monitoring dashboards

Templates:

  • tensorboard-config.yaml

  • TensorBoard setup configuration

  • wandb-config.py

  • WandB integration template

  • logging-config.json

  • Unified logging configuration

Examples:

  • tensorboard-integration.md

  • Complete TensorBoard integration

  • wandb-integration.md

  • Complete WandB integration with sweeps

Supported Frameworks: PyTorch, TensorFlow, JAX, Hugging Face Transformers Python Version: 3.8+ Best Practice: Use both TensorBoard (local dev) and WandB (team collaboration)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

document-parsers

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

stt-integration

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

model-routing-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

react-email-templates

No summary provided by upstream source.

Repository SourceNeeds Review