4.2 KiB
TCGA Downloader
Python package + CLI to query public TCGA (The Cancer Genome Atlas) files and download via gdc-client with retry logic, checksum verification, and progress logging.
Features
- Query TCGA files by project, data type, sample type, and platform
- Generate reproducible manifests (TSV/JSON) for downloads
- Reliable downloads with automatic retries and checksum verification
- Data statistics (file count, total size, data types breakdown)
- Comprehensive logging with verbose mode support
- Resumable and concurrent downloads via gdc-client
Requirements
- Python 3.11+
- gdc-client (external tool, see Installation)
- pip or uv for package installation
Installation
1. Install gdc-client
Download and install the GDC Data Transfer Tool from https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
Make sure gdc-client is in your PATH:
gdc-client version
2. Install this package
git clone <repository-url>
cd tcga-downloader
pip install -e .
For development:
pip install -e ".[dev]"
This installs additional development tools:
- pytest (testing)
- pytest-cov (coverage)
- black (code formatting)
- ruff (linting)
- mypy (type checking)
Usage
Basic Query
Query TCGA files by project and data type:
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv
Advanced Query with Filters
Filter by sample type and platform:
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--sample-type "Primary Tumor" \
--platform "Illumina HiSeq" \
--out manifest.tsv
Download Files
Download files using a manifest:
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data
One-Step Query and Download
Query and download in a single command:
tcga-downloader run \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--out-dir ./data
Verbose Mode
Enable detailed logging:
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--verbose
Log to File
Save logs to a file:
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--log-file download.log
Custom Download Options
Adjust concurrency and retry settings:
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data \
--processes 8 \
--retries 5
Downloading Controlled-Access Data
Some TCGA data types (e.g., Clinical Supplement) require controlled access. To download these files, you need:
-
Obtain an authentication token from GDC Data Portal:
- Log in to https://portal.gdc.cancer.gov/
- Click on your username → "Download Token"
- Save the token file to a secure location
-
Use the
--tokenparameter when downloading:
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data \
--token /path/to/gdc-user-token.txt
Or add token to your config file:
{
"download": {
"out_dir": "./data",
"processes": 4,
"retries": 3,
"token": "/path/to/gdc-user-token.txt"
}
}
Note: Open-access data does not require a token. Only controlled-access files (marked in GDC Data Portal) need authentication.
Common TCGA Projects
- TCGA-BRCA: Breast invasive carcinoma
- TCGA-LUAD: Lung adenocarcinoma
- TCGA-COAD: Colon adenocarcinoma
- TCGA-PRAD: Prostate adenocarcinoma
- TCGA-SKCM: Skin cutaneous melanoma
Common Data Types
- Gene Expression
- Copy Number Variation
- DNA Methylation
- miRNA Expression
- Protein Expression
- Somatic Mutation
Development
Setup pre-commit hooks
pip install pre-commit
pre-commit install
Run tests
pytest
With coverage:
pytest --cov=tcga_downloader --cov-report=html
Code formatting
black .
ruff check --fix .
mypy tcga_downloader
CI/CD
The project uses GitHub Actions for CI:
- Linting with black, ruff, mypy
- Testing with pytest
- Coverage reporting
License
MIT License