GEO Data

Query and access Gene Expression Omnibus datasets using Biopython's Entrez module.

Required Setup

from Bio import Entrez

Entrez.email = 'your.email@example.com' # Required by NCBI Entrez.api_key = 'your_api_key' # Optional

GEO Database Types

Database db value Description

GEO DataSets gds

Curated datasets (GDS*)

GEO Profiles geoprofiles

Individual gene profiles

GEO Record Types:

Prefix Type Description

GSE Series Complete study/experiment

GSM Sample Individual sample

GPL Platform Array/sequencing platform

GDS DataSet Curated, normalized dataset

Searching GEO

Search GEO DataSets (GDS)

from Bio import Entrez

Entrez.email = 'your.email@example.com'

Search curated datasets

handle = Entrez.esearch(db='gds', term='breast cancer AND Homo sapiens[orgn]', retmax=10) record = Entrez.read(handle) handle.close()

print(f"Found {record['Count']} datasets") print(f"IDs: {record['IdList']}")

Search GEO Series (GSE)

Search GEO Series via gds database

Use entry_type filter

handle = Entrez.esearch(db='gds', term='RNA-seq[title] AND human[orgn] AND gse[entry_type]', retmax=10) record = Entrez.read(handle) handle.close()

Common Search Fields

Field Description Example

[orgn]

Organism human[orgn]

[title]

Dataset title breast cancer[title]

[description]

Description text stem cell[description]

[platform]

Platform GPL GPL570[platform]

[entry_type]

Record type gse[entry_type] , gds[entry_type]

[gdstype]

Study type expression profiling[gdstype]

[pubmed]

PubMed ID 35412348[pubmed]

[pdat]

Publication date 2024[pdat]

GDS Types

Expression profiling by array

term = 'expression profiling by array[gdstype] AND cancer'

RNA-seq expression

term = 'expression profiling by high throughput sequencing[gdstype]'

ChIP-seq

term = 'genome binding/occupancy profiling[gdstype]'

Fetching GEO Information

Get GEO DataSet Summary

Fetch summary for GDS records

handle = Entrez.esummary(db='gds', id='200024320') record = Entrez.read(handle) handle.close()

summary = record[0] print(f"Accession: {summary['Accession']}") print(f"Title: {summary['title']}") print(f"Summary: {summary['summary'][:200]}...") print(f"Organism: {summary['taxon']}") print(f"Platform: {summary['GPL']}") print(f"Samples: {summary['n_samples']}")

Summary Fields

summary['Accession'] # GSE/GDS accession summary['title'] # Dataset title summary['summary'] # Description summary['taxon'] # Organism summary['GPL'] # Platform ID summary['n_samples'] # Number of samples summary['FTPLink'] # FTP download link summary['PubMedIds'] # Associated publications summary['gdsType'] # Dataset type summary['ptechType'] # Platform technology

Code Patterns

Search and List GEO Series

from Bio import Entrez

Entrez.email = 'your.email@example.com'

def search_geo(term, entry_type='gse', max_results=20): full_term = f'{term} AND {entry_type}[entry_type]' handle = Entrez.esearch(db='gds', term=full_term, retmax=max_results) search = Entrez.read(handle) handle.close()

if not search['IdList']:
    return []

handle = Entrez.esummary(db='gds', id=','.join(search['IdList']))
summaries = Entrez.read(handle)
handle.close()

results = []
for s in summaries:
    results.append({
        'accession': s['Accession'],
        'title': s['title'],
        'organism': s['taxon'],
        'samples': s['n_samples'],
        'platform': s['GPL']
    })
return results

datasets = search_geo('breast cancer RNA-seq AND human[orgn]') for ds in datasets: print(f"{ds['accession']}: {ds['title'][:60]}... ({ds['samples']} samples)")

Find RNA-Seq Datasets

def find_rnaseq_datasets(organism, keywords, max_results=20): term = f'{keywords} AND {organism}[orgn] AND expression profiling by high throughput sequencing[gdstype] AND gse[entry_type]'

handle = Entrez.esearch(db='gds', term=term, retmax=max_results)
search = Entrez.read(handle)
handle.close()

if not search['IdList']:
    return []

handle = Entrez.esummary(db='gds', id=','.join(search['IdList']))
summaries = Entrez.read(handle)
handle.close()

return summaries

datasets = find_rnaseq_datasets('Homo sapiens', 'COVID-19') for ds in datasets: print(f"{ds['Accession']}: {ds['n_samples']} samples - {ds['title'][:50]}...")

Get GSE Download Link

def get_geo_ftp(gse_accession): '''Get FTP download link for a GSE''' handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]') search = Entrez.read(handle) handle.close()

if not search['IdList']:
    return None

handle = Entrez.esummary(db='gds', id=search['IdList'][0])
summary = Entrez.read(handle)[0]
handle.close()

return summary.get('FTPLink')

ftp_link = get_geo_ftp('GSE123456') print(f"Download from: {ftp_link}")

Link GEO to SRA

Many GEO RNA-seq datasets have associated SRA data.

def geo_to_sra(gse_accession): '''Find SRA runs associated with a GEO series''' # Search GEO handle = Entrez.esearch(db='gds', term=f'{gse_accession}[accn]') search = Entrez.read(handle) handle.close()

if not search['IdList']:
    return []

# Link to SRA
handle = Entrez.elink(dbfrom='gds', db='sra', id=search['IdList'][0])
links = Entrez.read(handle)
handle.close()

if not links[0]['LinkSetDb']:
    return []

sra_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']]

# Get SRA accessions
handle = Entrez.esummary(db='sra', id=','.join(sra_ids[:50]))
summaries = Entrez.read(handle)
handle.close()

runs = []
for s in summaries:
    expxml = s.get('ExpXml', '')
    if 'SRR' in str(expxml) or 'SRX' in str(expxml):
        runs.append(s)
return runs

sra_data = geo_to_sra('GSE123456') print(f"Found {len(sra_data)} SRA records")

Search by PubMed ID

def geo_from_pubmed(pmid): '''Find GEO datasets associated with a publication''' handle = Entrez.elink(dbfrom='pubmed', db='gds', id=pmid) links = Entrez.read(handle) handle.close()

if not links[0]['LinkSetDb']:
    return []

gds_ids = [link['Id'] for link in links[0]['LinkSetDb'][0]['Link']]

handle = Entrez.esummary(db='gds', id=','.join(gds_ids))
summaries = Entrez.read(handle)
handle.close()

return summaries

datasets = geo_from_pubmed('35412348') for ds in datasets: print(f"{ds['Accession']}: {ds['title']}")

Download GEO Data (GEOparse)

For full data parsing, use the GEOparse library:

pip install GEOparse

import GEOparse

Download and parse GSE

gse = GEOparse.get_GEO('GSE123456')

Access metadata

print(f"Title: {gse.metadata['title'][0]}") print(f"Samples: {len(gse.gsms)}")

Get sample metadata

for gsm_name, gsm in gse.gsms.items(): print(f"{gsm_name}: {gsm.metadata['title'][0]}")

Get expression table

if gse.gpls: gpl_name = list(gse.gpls.keys())[0] expression_table = gse.pivot_samples('VALUE')

Download Options

Direct FTP Download

Download entire GSE

wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/

Download specific file types

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/counts.txt.gz

Series Matrix Files

import gzip import urllib.request

def download_series_matrix(gse): '''Download series matrix file''' gse_prefix = gse[:len(gse)-3] + 'nnn' url = f'https://ftp.ncbi.nlm.nih.gov/geo/series/{gse_prefix}/{gse}/matrix/{gse}_series_matrix.txt.gz'

filename = f'{gse}_series_matrix.txt.gz'
urllib.request.urlretrieve(url, filename)
return filename

Common Errors

Error Cause Solution

Empty results Wrong entry_type Add gse[entry_type] or gds[entry_type]

No FTPLink Superseries or no data Check if series has supplementary files

No SRA link Microarray data SRA only for sequencing data

Decision Tree

Need GEO expression data? ├── Looking for curated datasets? │ └── Search gds with [entry_type]=gds ├── Looking for any experiment? │ └── Search gds with [entry_type]=gse ├── Want RNA-seq specifically? │ └── Add 'expression profiling by high throughput sequencing[gdstype]' ├── Have a publication? │ └── Link pubmed -> gds ├── Need raw sequencing data? │ └── Link gds -> sra, then use sra-data skill ├── Need processed expression matrix? │ └── Download series matrix or use GEOparse └── Need full metadata? └── Use GEOparse library

Related Skills

entrez-search - General database searching
entrez-link - Link GEO to SRA and other databases
sra-data - Download raw sequencing data from linked SRA
batch-downloads - Download multiple GEO records

bio-geo-data

Safety Notice

Copy this and send it to your AI assistant to learn

Search curated datasets

Search GEO Series via gds database

Use entry_type filter

Expression profiling by array

RNA-seq expression

ChIP-seq

Fetch summary for GDS records

pip install GEOparse

Download and parse GSE

Access metadata

Get sample metadata

Get expression table

Download entire GSE

Download specific file types

Source Transparency

Related Skills

bio-clinical-databases-dbsnp-queries

bio-workflows-clip-pipeline

bio-clinical-databases-variant-prioritization

bio-clip-seq-clip-peak-calling