tcga-downloader/README.md
yunpeng.zhang a01a59b371
Some checks failed
CI / Lint (push) Failing after 9m32s
CI / Test (3.11) (push) Successful in 6m41s
CI / Test (3.12) (push) Successful in 4m21s
feat: add interactive cli
2026-02-09 13:13:39 +08:00

4.2 KiB

TCGA Downloader

Python package + CLI to query public TCGA (The Cancer Genome Atlas) files and download via gdc-client with retry logic, checksum verification, and progress logging.

Features

  • Query TCGA files by project, data type, sample type, and platform
  • Generate reproducible manifests (TSV/JSON) for downloads
  • Reliable downloads with automatic retries and checksum verification
  • Data statistics (file count, total size, data types breakdown)
  • Comprehensive logging with verbose mode support
  • Resumable and concurrent downloads via gdc-client

Requirements

  • Python 3.11+
  • gdc-client (external tool, see Installation)
  • pip or uv for package installation

Installation

1. Install gdc-client

Download and install the GDC Data Transfer Tool from https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

Make sure gdc-client is in your PATH:

gdc-client version

2. Install this package

git clone <repository-url>
cd tcga-downloader
pip install -e .

For development:

pip install -e ".[dev]"

This installs additional development tools:

  • pytest (testing)
  • pytest-cov (coverage)
  • black (code formatting)
  • ruff (linting)
  • mypy (type checking)

Usage

Basic Query

Query TCGA files by project and data type:

tcga-downloader query \
  --project TCGA-BRCA \
  --data-type "Gene Expression" \
  --out manifest.tsv

Advanced Query with Filters

Filter by sample type and platform:

tcga-downloader query \
  --project TCGA-BRCA \
  --data-type "Gene Expression" \
  --sample-type "Primary Tumor" \
  --platform "Illumina HiSeq" \
  --out manifest.tsv

Download Files

Download files using a manifest:

tcga-downloader download \
  --manifest manifest.tsv \
  --out-dir ./data

One-Step Query and Download

Query and download in a single command:

tcga-downloader run \
  --project TCGA-BRCA \
  --data-type "Gene Expression" \
  --out manifest.tsv \
  --out-dir ./data

Verbose Mode

Enable detailed logging:

tcga-downloader query \
  --project TCGA-BRCA \
  --data-type "Gene Expression" \
  --out manifest.tsv \
  --verbose

Log to File

Save logs to a file:

tcga-downloader query \
  --project TCGA-BRCA \
  --data-type "Gene Expression" \
  --out manifest.tsv \
  --log-file download.log

Custom Download Options

Adjust concurrency and retry settings:

tcga-downloader download \
  --manifest manifest.tsv \
  --out-dir ./data \
  --processes 8 \
  --retries 5

Downloading Controlled-Access Data

Some TCGA data types (e.g., Clinical Supplement) require controlled access. To download these files, you need:

  1. Obtain an authentication token from GDC Data Portal:

  2. Use the --token parameter when downloading:

tcga-downloader download \
  --manifest manifest.tsv \
  --out-dir ./data \
  --token /path/to/gdc-user-token.txt

Or add token to your config file:

{
  "download": {
    "out_dir": "./data",
    "processes": 4,
    "retries": 3,
    "token": "/path/to/gdc-user-token.txt"
  }
}

Note: Open-access data does not require a token. Only controlled-access files (marked in GDC Data Portal) need authentication.

Common TCGA Projects

  • TCGA-BRCA: Breast invasive carcinoma
  • TCGA-LUAD: Lung adenocarcinoma
  • TCGA-COAD: Colon adenocarcinoma
  • TCGA-PRAD: Prostate adenocarcinoma
  • TCGA-SKCM: Skin cutaneous melanoma

Common Data Types

  • Gene Expression
  • Copy Number Variation
  • DNA Methylation
  • miRNA Expression
  • Protein Expression
  • Somatic Mutation

Development

Setup pre-commit hooks

pip install pre-commit
pre-commit install

Run tests

pytest

With coverage:

pytest --cov=tcga_downloader --cov-report=html

Code formatting

black .
ruff check --fix .
mypy tcga_downloader

CI/CD

The project uses GitHub Actions for CI:

  • Linting with black, ruff, mypy
  • Testing with pytest
  • Coverage reporting

License

MIT License