TCGA Bulk Data Preprocessing with OmicVerse
Overview
Use this skill for loading TCGA data from GDC downloads, building normalised expression matrices, attaching clinical metadata, and running survival analyses through ov.bulk.pyTCGA .
Instructions
- Gather required downloads
Confirm the user has three items from the GDC Data Portal:
-
gdc_sample_sheet.<date>.tsv — the sample sheet export
-
Decompressed gdc_download_xxxxx/ directory with expression archives
-
clinical.cart.<date>/ directory with clinical XML/JSON files
- Initialise the TCGA helper
import omicverse as ov import scanpy as sc ov.plot_set()
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) aml_tcga.adata_init() # Builds AnnData with raw counts, FPKM, and TPM layers
- Persist and reload
aml_tcga.adata.write_h5ad('data/ov_tcga_raw.h5ad', compression='gzip')
To reload later:
new_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) new_tcga.adata_read('data/ov_tcga_raw.h5ad')
- Initialise metadata and survival
aml_tcga.adata_meta_init() # Gene ID → symbol mapping, patient info aml_tcga.survial_init() # NOTE: "survial" spelling — see Critical API Reference below
- Run survival analysis
Single gene
aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True)
All genes (can take minutes for large gene sets)
aml_tcga.survial_analysis_all() # NOTE: "survial" spelling
- Export results
aml_tcga.adata.write_h5ad('data/ov_tcga_survival.h5ad', compression='gzip')
Critical API Reference
IMPORTANT: Method Name Spelling Inconsistency
The pyTCGA API has an intentional spelling inconsistency. Two methods use "survial" (missing the 'v') while one uses the correct "survival":
Method Spelling Purpose
survial_init()
survial (no 'v') Initialize survival metadata columns
survival_analysis(gene, layer, plot)
survival (correct) Single-gene Kaplan-Meier curve
survial_analysis_all()
survial (no 'v') Sweep all genes for survival significance
CORRECT — use the exact method names as documented
aml_tcga.survial_init() # "survial" — no 'v' aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True) # "survival" — correct aml_tcga.survial_analysis_all() # "survial" — no 'v'
WRONG — these will raise AttributeError
aml_tcga.survival_init() # AttributeError! Use survial_init()
aml_tcga.survival_analysis_all() # AttributeError! Use survial_analysis_all()
Survival Analysis Methodology
survival_analysis() performs Kaplan-Meier analysis:
-
Splits patients into high/low expression groups using the median as cutoff
-
Computes a log-rank test p-value to assess significance
-
If plot=True , renders survival curves with confidence intervals
Layer selection matters: Use layer='deseq_normalize' (recommended) because DESeq2 normalization accounts for library size and composition bias, making expression comparable across samples. Alternative: layer='tpm' for TPM-normalized values.
Defensive Validation Patterns
import os
Before pyTCGA init: verify all paths exist
for name, path in [('sample_sheet', sample_sheet_path), ('downloads', download_dir), ('clinical', clinical_dir)]: if not os.path.exists(path): raise FileNotFoundError(f"TCGA {name} path not found: {path}")
After adata_init(): verify expected layers were created
expected_layers = ['counts', 'fpkm', 'tpm'] for layer in expected_layers: if layer not in aml_tcga.adata.layers: print(f"WARNING: Missing layer '{layer}' — check if TCGA archives are fully extracted")
Before survival analysis: verify metadata is initialized
if 'survial_init' not in dir(aml_tcga) or aml_tcga.adata.obs.shape[1] < 5: print("WARNING: Run adata_meta_init() and survial_init() before survival analysis")
Troubleshooting
-
AttributeError: 'pyTCGA' object has no attribute 'survival_init' : Use the misspelled name survial_init() (missing 'v'). Same for survial_analysis_all() . See Critical API Reference above.
-
KeyError during adata_meta_init() : Gene IDs in the expression matrix don't match expected format. TCGA uses ENSG IDs; the method maps them to symbols internally. Ensure archives are from the same GDC download.
-
Empty survival plot or NaN p-values: Clinical XML files are missing date fields (days_to_death, days_to_last_follow_up). Check that the clinical.cart.* directory contains complete XML files, not just metadata JSONs.
-
survial_analysis_all() runs very slowly: This tests every gene individually. For a genome with ~20,000 genes, expect 5-15 minutes. Consider filtering to genes of interest first.
-
Sample sheet column mismatch: Verify the TSV uses tab separators and the header row matches GDC's expected format. Re-download from GDC if column names differ.
-
Missing deseq_normalize layer: This layer is created during adata_meta_init() . If absent, re-run the metadata initialization step.
Examples
-
"Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
-
"Reload a saved AnnData file, attach survival annotations, and export the updated .h5ad ."
-
"Run survival analysis for all genes and store the enriched dataset."
References
-
Tutorial notebook: t_tcga.ipynb
-
Quick copy/paste commands: reference.md