scVelo — RNA Velocity Analysis

Overview

scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.

Installation: pip install scvelo

Key resources:

Documentation: https://scvelo.readthedocs.io/
GitHub: https://github.com/theislab/scvelo
Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759

When to Use This Skill

Use scVelo when:

Trajectory inference from snapshot data: Determine which direction cells are differentiating
Cell fate prediction: Identify progenitor cells and their downstream fates
Driver gene identification: Find genes whose dynamics best explain observed trajectories
Developmental biology: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
Latent time estimation: Order cells along a pseudotime derived from splicing dynamics
Complement to Scanpy: Add directional information to UMAP embeddings

Prerequisites

scVelo requires count matrices for both unspliced and spliced RNA. These are generated by:

STARsolo or kallisto|bustools with lamanno mode
velocyto CLI: velocyto run10x / velocyto run
alevin-fry / simpleaf with spliced/unspliced output

Data is stored in an AnnData object with layers["spliced"] and layers["unspliced"] .

Standard RNA Velocity Workflow

Setup and Data Loading

import scvelo as scv import scanpy as sc import numpy as np import matplotlib.pyplot as plt

Configure settings

scv.settings.verbosity = 3 # Show computation steps scv.settings.presenter_view = True scv.settings.set_figure_params('scvelo')

Load data (AnnData with spliced/unspliced layers)

Option A: Load from loom (velocyto output)

adata = scv.read("cellranger_output.loom", cache=True)

Option B: Merge velocyto loom with Scanpy-processed AnnData

adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters adata_velocity = scv.read("velocyto.loom") adata = scv.utils.merge(adata_processed, adata_velocity)

Verify layers

print(adata)

obs × var: N × G

layers: 'spliced', 'unspliced' (required)

obsm['X_umap'] (required for visualization)

Preprocessing

Filter and normalize (follows Scanpy conventions)

scv.pp.filter_and_normalize( adata, min_shared_counts=20, # Minimum counts in spliced+unspliced n_top_genes=2000 # Top highly variable genes )

Compute first and second order moments (means and variances)

knn_connectivities must be computed first

sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30) scv.pp.moments( adata, n_pcs=30, n_neighbors=30 )

Velocity Estimation — Stochastic Model

The stochastic model is fast and suitable for exploratory analysis:

Stochastic velocity (faster, less accurate)

scv.tl.velocity(adata, mode='stochastic') scv.tl.velocity_graph(adata)

Visualize

scv.pl.velocity_embedding_stream( adata, basis='umap', color='leiden', title="RNA Velocity (Stochastic)" )

Velocity Estimation — Dynamical Model (Recommended)

The dynamical model fits the full splicing kinetics and is more accurate:

Recover dynamics (computationally intensive; ~10-30 min for 10K cells)

scv.tl.recover_dynamics(adata, n_jobs=4)

Compute velocity from dynamical model

scv.tl.velocity(adata, mode='dynamical') scv.tl.velocity_graph(adata)

Latent Time

The dynamical model enables computation of a shared latent time (pseudotime):

Compute latent time

scv.tl.latent_time(adata)

Visualize latent time on UMAP

scv.pl.scatter( adata, color='latent_time', color_map='gnuplot', size=80, title='Latent time' )

Identify top genes ordered by latent time

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300] scv.pl.heatmap( adata, var_names=top_genes, sortby='latent_time', col_color='leiden', n_convolve=100 )

Driver Gene Analysis

Identify genes with highest velocity fit

scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3) df = scv.DataFrame(adata.uns['rank_velocity_genes']['names']) print(df.head(10))

Speed and coherence

scv.tl.velocity_confidence(adata) scv.pl.scatter( adata, c=['velocity_length', 'velocity_confidence'], cmap='coolwarm', perc=[5, 95] )

Phase portraits for specific genes

scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'], ncols=3, figsize=(16, 4))

Velocity Arrows and Pseudotime

Arrow plot on UMAP

scv.pl.velocity_embedding( adata, arrow_length=3, arrow_size=2, color='leiden', basis='umap' )

Stream plot (cleaner visualization)

scv.pl.velocity_embedding_stream( adata, basis='umap', color='leiden', smooth=0.8, min_mass=4 )

Velocity pseudotime (alternative to latent time)

scv.tl.velocity_pseudotime(adata) scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')

PAGA Trajectory Graph

PAGA graph with velocity-informed transitions

scv.tl.paga(adata, groups='leiden') df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T df.style.background_gradient(cmap='Blues').format('{:.2g}')

Plot PAGA with velocity

scv.pl.paga( adata, basis='umap', size=50, alpha=0.1, min_edge_width=2, node_size_scale=1.5 )

Complete Workflow Script

import scvelo as scv import scanpy as sc

def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4): """ Complete RNA velocity workflow.

Args:
    adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
    n_top_genes: Number of top HVGs for velocity
    mode: 'stochastic' (fast) or 'dynamical' (accurate)
    n_jobs: Parallel jobs for dynamical model

Returns:
    Processed AnnData with velocity information
"""
scv.settings.verbosity = 2

# 1. Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)

if 'neighbors' not in adata.uns:
    sc.pp.neighbors(adata, n_neighbors=30)

scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# 2. Velocity estimation
if mode == 'dynamical':
    scv.tl.recover_dynamics(adata, n_jobs=n_jobs)

scv.tl.velocity(adata, mode=mode)
scv.tl.velocity_graph(adata)

# 3. Downstream analyses
if mode == 'dynamical':
    scv.tl.latent_time(adata)
    scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)

scv.tl.velocity_confidence(adata)
scv.tl.velocity_pseudotime(adata)

return adata

Key Output Fields in AnnData

After running the workflow, the following fields are added:

Location Key Description

adata.layers

velocity

RNA velocity per gene per cell

adata.layers

fit_t

Fitted latent time per gene per cell

adata.obsm

velocity_umap

2D velocity vectors on UMAP

adata.obs

velocity_pseudotime

Pseudotime from velocity

adata.obs

latent_time

Latent time from dynamical model

adata.obs

velocity_length

Speed of each cell

adata.obs

velocity_confidence

Confidence score per cell

adata.var

fit_likelihood

Gene-level model fit quality

adata.var

fit_alpha

Transcription rate

adata.var

fit_beta

Splicing rate

adata.var

fit_gamma

Degradation rate

adata.uns

velocity_graph

Cell-cell transition probability matrix

Velocity Models Comparison

Model Speed Accuracy When to Use

stochastic

Fast Moderate Exploratory; large datasets

deterministic

Medium Moderate Simple linear kinetics

dynamical

Slow High Publication-quality; identifies driver genes

Best Practices

Start with stochastic mode for exploration; switch to dynamical for final analysis
Need good coverage of unspliced reads: Short reads (< 100 bp) may miss intron coverage
Minimum 2,000 cells: RNA velocity is noisy with fewer cells
Velocity should be coherent: Arrows should follow known biology; randomness indicates issues
k-NN bandwidth matters: Too few neighbors → noisy velocity; too many → oversmoothed
Sanity check: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
Dynamical model requires distinct kinetic states: Works best for clear differentiation processes

Troubleshooting

Problem Solution

Missing unspliced layer Re-run velocyto or use STARsolo with --soloFeatures Gene Velocyto

Very few velocity genes Lower min_shared_counts ; check sequencing depth

Random-looking arrows Try different n_neighbors or velocity model

Memory error with dynamical Set n_jobs=1 ; reduce n_top_genes

Negative velocity everywhere Check that spliced/unspliced layers are not swapped

Additional Resources

scVelo documentation: https://scvelo.readthedocs.io/
Tutorial notebooks: https://scvelo.readthedocs.io/tutorials/
GitHub: https://github.com/theislab/scvelo
Paper: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
velocyto (preprocessing): http://velocyto.org/
CellRank (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
dynamo (metabolic labeling alternative): https://dynamo-release.readthedocs.io/

scvelo

Safety Notice

Copy this and send it to your AI assistant to learn

Configure settings

Load data (AnnData with spliced/unspliced layers)

Option A: Load from loom (velocyto output)

Option B: Merge velocyto loom with Scanpy-processed AnnData

Verify layers

obs × var: N × G

layers: 'spliced', 'unspliced' (required)

obsm['X_umap'] (required for visualization)

Filter and normalize (follows Scanpy conventions)

Compute first and second order moments (means and variances)

knn_connectivities must be computed first

Stochastic velocity (faster, less accurate)

Visualize

Recover dynamics (computationally intensive; ~10-30 min for 10K cells)

Compute velocity from dynamical model

Compute latent time

Visualize latent time on UMAP

Identify top genes ordered by latent time

Identify genes with highest velocity fit

Speed and coherence

Phase portraits for specific genes

Arrow plot on UMAP

Stream plot (cleaner visualization)

Velocity pseudotime (alternative to latent time)

PAGA graph with velocity-informed transitions

Plot PAGA with velocity

Source Transparency

Related Skills

research-lookup

market-research-reports

statistical-analysis

research-grants