AlphaFold Database

Overview

AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access structure predictions with confidence metrics, download coordinate files, retrieve bulk datasets, and integrate predictions into computational workflows.

When to Use This Skill

This skill should be used when working with AI-predicted protein structures in scenarios such as:

Retrieving protein structure predictions by UniProt ID or protein name
Downloading PDB/mmCIF coordinate files for structural analysis
Analyzing prediction confidence metrics (pLDDT, PAE) to assess reliability
Accessing bulk proteome datasets via Google Cloud Platform
Comparing predicted structures with experimental data
Performing structure-based drug discovery or protein engineering
Building structural models for proteins lacking experimental structures
Integrating AlphaFold predictions into computational pipelines

Core Capabilities

Searching and Retrieving Predictions

Using Biopython (Recommended):

The Biopython library provides the simplest interface for retrieving AlphaFold structures:

from Bio.PDB import alphafold_db

Get all predictions for a UniProt accession

predictions = list(alphafold_db.get_predictions("P00520"))

Download structure file (mmCIF format)

for prediction in predictions: cif_file = alphafold_db.download_cif_for(prediction, directory="./structures") print(f"Downloaded: {cif_file}")

Get Structure objects directly

from Bio.PDB import MMCIFParser structures = list(alphafold_db.get_structural_models_for("P00520"))

Direct API Access:

Query predictions using REST endpoints:

import requests

Get prediction metadata for a UniProt accession

uniprot_id = "P00520" api_url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}" response = requests.get(api_url) prediction_data = response.json()

Extract AlphaFold ID

alphafold_id = prediction_data[0]['entryId'] print(f"AlphaFold ID: {alphafold_id}")

Using UniProt to Find Accessions:

Search UniProt to find protein accessions first:

import urllib.parse, urllib.request

def get_uniprot_ids(query, query_type='PDB_ID'): """Query UniProt to get accession IDs""" url = 'https://www.uniprot.org/uploadlists/' params = { 'from': query_type, 'to': 'ACC', 'format': 'txt', 'query': query } data = urllib.parse.urlencode(params).encode('ascii') with urllib.request.urlopen(urllib.request.Request(url, data)) as response: return response.read().decode('utf-8').splitlines()

Example: Find UniProt IDs for a protein name

protein_ids = get_uniprot_ids("hemoglobin", query_type="GENE_NAME")

Downloading Structure Files

AlphaFold provides multiple file formats for each prediction:

File Types Available:

Model coordinates (model_v4.cif ): Atomic coordinates in mmCIF/PDBx format
Confidence scores (confidence_v4.json ): Per-residue pLDDT scores (0-100)
Predicted Aligned Error (predicted_aligned_error_v4.json ): PAE matrix for residue pair confidence

Download URLs:

import requests

alphafold_id = "AF-P00520-F1" version = "v4"

Model coordinates (mmCIF)

model_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.cif" response = requests.get(model_url) with open(f"{alphafold_id}.cif", "w") as f: f.write(response.text)

Confidence scores (JSON)

confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_{version}.json" response = requests.get(confidence_url) confidence_data = response.json()

Predicted Aligned Error (JSON)

pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_{version}.json" response = requests.get(pae_url) pae_data = response.json()

PDB Format (Alternative):

Download as PDB format instead of mmCIF

pdb_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.pdb" response = requests.get(pdb_url) with open(f"{alphafold_id}.pdb", "wb") as f: f.write(response.content)

Working with Confidence Metrics

AlphaFold predictions include confidence estimates critical for interpretation:

pLDDT (per-residue confidence):

import json import requests

Load confidence scores

alphafold_id = "AF-P00520-F1" confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json" confidence = requests.get(confidence_url).json()

Extract pLDDT scores

plddt_scores = confidence['confidenceScore']

Interpret confidence levels

pLDDT > 90: Very high confidence

pLDDT 70-90: High confidence

pLDDT 50-70: Low confidence

pLDDT < 50: Very low confidence

high_confidence_residues = [i for i, score in enumerate(plddt_scores) if score > 90] print(f"High confidence residues: {len(high_confidence_residues)}/{len(plddt_scores)}")

PAE (Predicted Aligned Error):

PAE indicates confidence in relative domain positions:

import numpy as np import matplotlib.pyplot as plt

Load PAE matrix

pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_v4.json" pae = requests.get(pae_url).json()

Visualize PAE matrix

pae_matrix = np.array(pae['distance']) plt.figure(figsize=(10, 8)) plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30) plt.colorbar(label='PAE (Å)') plt.title(f'Predicted Aligned Error: {alphafold_id}') plt.xlabel('Residue') plt.ylabel('Residue') plt.savefig(f'{alphafold_id}_pae.png', dpi=300, bbox_inches='tight')

Low PAE values (<5 Å) indicate confident relative positioning

High PAE values (>15 Å) suggest uncertain domain arrangements

Bulk Data Access via Google Cloud

For large-scale analyses, use Google Cloud datasets:

Google Cloud Storage:

Install gsutil

uv pip install gsutil

List available data

gsutil ls gs://public-datasets-deepmind-alphafold-v4/

Download entire proteomes (by taxonomy ID)

gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*.tar .

Download specific files

gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .

BigQuery Metadata Access:

from google.cloud import bigquery

Initialize client

client = bigquery.Client()

Query metadata

query = """ SELECT entryId, uniprotAccession, organismScientificName, globalMetricValue, fractionPlddtVeryHigh FROM bigquery-public-data.deepmind_alphafold.metadata WHERE organismScientificName = 'Homo sapiens' AND fractionPlddtVeryHigh > 0.8 LIMIT 100 """

results = client.query(query).to_dataframe() print(f"Found {len(results)} high-confidence human proteins")

Download by Species:

import subprocess

def download_proteome(taxonomy_id, output_dir="./proteomes"): """Download all AlphaFold predictions for a species""" pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar" cmd = f"gsutil -m cp {pattern} {output_dir}/" subprocess.run(cmd, shell=True, check=True)

Download E. coli proteome (tax ID: 83333)

download_proteome(83333)

Download human proteome (tax ID: 9606)

download_proteome(9606)

Parsing and Analyzing Structures

Work with downloaded AlphaFold structures using BioPython:

from Bio.PDB import MMCIFParser, PDBIO import numpy as np

Parse mmCIF file

parser = MMCIFParser(QUIET=True) structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")

Extract coordinates

coords = [] for model in structure: for chain in model: for residue in chain: if 'CA' in residue: # Alpha carbons only coords.append(residue['CA'].get_coord())

coords = np.array(coords) print(f"Structure has {len(coords)} residues")

Calculate distances

from scipy.spatial.distance import pdist, squareform distance_matrix = squareform(pdist(coords))

Identify contacts (< 8 Å)

contacts = np.where((distance_matrix > 0) & (distance_matrix < 8)) print(f"Number of contacts: {len(contacts[0]) // 2}")

Extract B-factors (pLDDT values):

AlphaFold stores pLDDT scores in the B-factor column:

from Bio.PDB import MMCIFParser

parser = MMCIFParser(QUIET=True) structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")

Extract pLDDT from B-factors

plddt_scores = [] for model in structure: for chain in model: for residue in chain: if 'CA' in residue: plddt_scores.append(residue['CA'].get_bfactor())

Identify high-confidence regions

high_conf_regions = [(i, score) for i, score in enumerate(plddt_scores, 1) if score > 90] print(f"High confidence residues: {len(high_conf_regions)}")

Batch Processing Multiple Proteins

Process multiple predictions efficiently:

from Bio.PDB import alphafold_db import pandas as pd

uniprot_ids = ["P00520", "P12931", "P04637"] # Multiple proteins results = []

for uniprot_id in uniprot_ids: try: # Get prediction predictions = list(alphafold_db.get_predictions(uniprot_id))

    if predictions:
        pred = predictions[0]

        # Download structure
        cif_file = alphafold_db.download_cif_for(pred, directory="./batch_structures")

        # Get confidence data
        alphafold_id = pred['entryId']
        conf_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
        conf_data = requests.get(conf_url).json()

        # Calculate statistics
        plddt_scores = conf_data['confidenceScore']
        avg_plddt = np.mean(plddt_scores)
        high_conf_fraction = sum(1 for s in plddt_scores if s > 90) / len(plddt_scores)

        results.append({
            'uniprot_id': uniprot_id,
            'alphafold_id': alphafold_id,
            'avg_plddt': avg_plddt,
            'high_conf_fraction': high_conf_fraction,
            'length': len(plddt_scores)
        })
except Exception as e:
    print(f"Error processing {uniprot_id}: {e}")

Create summary DataFrame

df = pd.DataFrame(results) print(df)

Installation and Setup

Python Libraries

Install Biopython for structure access

uv pip install biopython

Install requests for API access

uv pip install requests

For visualization and analysis

uv pip install numpy matplotlib pandas scipy

For Google Cloud access (optional)

uv pip install google-cloud-bigquery gsutil

3D-Beacons API Alternative

AlphaFold can also be accessed via the 3D-Beacons federated API:

import requests

Query via 3D-Beacons

uniprot_id = "P00520" url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json" response = requests.get(url) data = response.json()

Filter for AlphaFold structures

af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']

Common Use Cases

Structural Proteomics

Download complete proteome predictions for analysis
Identify high-confidence structural regions across proteins
Compare predicted structures with experimental data
Build structural models for protein families

Drug Discovery

Retrieve target protein structures for docking studies
Analyze binding site conformations
Identify druggable pockets in predicted structures
Compare structures across homologs

Protein Engineering

Identify stable/unstable regions using pLDDT
Design mutations in high-confidence regions
Analyze domain architectures using PAE
Model protein variants and mutations

Evolutionary Studies

Compare ortholog structures across species
Analyze conservation of structural features
Study domain evolution patterns
Identify functionally important regions

Key Concepts

UniProt Accession: Primary identifier for proteins (e.g., "P00520"). Required for querying AlphaFold DB.

AlphaFold ID: Internal identifier format: AF-[UniProt accession]-F[fragment number] (e.g., "AF-P00520-F1").

pLDDT (predicted Local Distance Difference Test): Per-residue confidence metric (0-100). Higher values indicate more confident predictions.

PAE (Predicted Aligned Error): Matrix indicating confidence in relative positions between residue pairs. Low values (<5 Å) suggest confident relative positioning.

Database Version: Current version is v4. File URLs include version suffix (e.g., model_v4.cif ).

Fragment Number: Large proteins may be split into fragments. Fragment number appears in AlphaFold ID (e.g., F1, F2).

Confidence Interpretation Guidelines

pLDDT Thresholds:

90: Very high confidence - suitable for detailed analysis
70-90: High confidence - generally reliable backbone structure
50-70: Low confidence - use with caution, flexible regions
<50: Very low confidence - likely disordered or unreliable

PAE Guidelines:

<5 Å: Confident relative positioning of domains
5-10 Å: Moderate confidence in arrangement
15 Å: Uncertain relative positions, domains may be mobile

Resources

references/api_reference.md

Comprehensive API documentation covering:

Complete REST API endpoint specifications
File format details and data schemas
Google Cloud dataset structure and access patterns
Advanced query examples and batch processing strategies
Rate limiting, caching, and best practices
Troubleshooting common issues

Consult this reference for detailed API information, bulk download strategies, or when working with large-scale datasets.

Important Notes

Data Usage and Attribution

AlphaFold DB is freely available under CC-BY-4.0 license
Cite: Jumper et al. (2021) Nature and Varadi et al. (2022) Nucleic Acids Research
Predictions are computational models, not experimental structures
Always assess confidence metrics before downstream analysis

Version Management

Current database version: v4 (as of 2024-2025)
File URLs include version suffix (e.g., _v4.cif )
Check for database updates regularly
Older versions may be deprecated over time

Data Quality Considerations

High pLDDT doesn't guarantee functional accuracy
Low confidence regions may be disordered in vivo
PAE indicates relative domain confidence, not absolute positioning
Predictions lack ligands, post-translational modifications, and cofactors
Multi-chain complexes are not predicted (single chains only)

Performance Tips

Use Biopython for simple single-protein access
Use Google Cloud for bulk downloads (much faster than individual files)
Cache downloaded files locally to avoid repeated downloads
BigQuery free tier: 1 TB processed data per month
Consider network bandwidth for large-scale downloads

Additional Resources

AlphaFold DB Website: https://alphafold.ebi.ac.uk/
API Documentation: https://alphafold.ebi.ac.uk/api-docs
Google Cloud Dataset: https://cloud.google.com/blog/products/ai-machine-learning/alphafold-protein-structure-database
3D-Beacons API: https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/
AlphaFold Papers:
Nature (2021): https://doi.org/10.1038/s41586-021-03819-2
Nucleic Acids Research (2024): https://doi.org/10.1093/nar/gkad1011
Biopython Documentation: https://biopython.org/docs/dev/api/Bio.PDB.alphafold_db.html
GitHub Repository: https://github.com/google-deepmind/alphafold

alphafold-database

Safety Notice

Copy this and send it to your AI assistant to learn

Get all predictions for a UniProt accession

Download structure file (mmCIF format)

Get Structure objects directly

Get prediction metadata for a UniProt accession

Extract AlphaFold ID

Example: Find UniProt IDs for a protein name

Model coordinates (mmCIF)

Confidence scores (JSON)

Predicted Aligned Error (JSON)

Download as PDB format instead of mmCIF

Load confidence scores

Extract pLDDT scores

Interpret confidence levels

pLDDT > 90: Very high confidence

pLDDT 70-90: High confidence

pLDDT 50-70: Low confidence

pLDDT < 50: Very low confidence

Load PAE matrix

Visualize PAE matrix

Low PAE values (<5 Å) indicate confident relative positioning

High PAE values (>15 Å) suggest uncertain domain arrangements

Install gsutil

List available data

Download entire proteomes (by taxonomy ID)

Download specific files

Initialize client

Query metadata

Download E. coli proteome (tax ID: 83333)

Download human proteome (tax ID: 9606)

Parse mmCIF file

Extract coordinates

Calculate distances

Identify contacts (< 8 Å)

Extract pLDDT from B-factors

Identify high-confidence regions

Create summary DataFrame

Install Biopython for structure access

Install requests for API access

For visualization and analysis

For Google Cloud access (optional)

Query via 3D-Beacons

Filter for AlphaFold structures

Source Transparency

Related Skills

senior-data-scientist

senior-backend

senior-frontend

excel analysis