tcga-downloader/README.md
yunpeng.zhang a01a59b371
Some checks failed
CI / Lint (push) Failing after 9m32s
CI / Test (3.11) (push) Successful in 6m41s
CI / Test (3.12) (push) Successful in 4m21s
feat: add interactive cli
2026-02-09 13:13:39 +08:00

226 lines
4.2 KiB
Markdown

# TCGA Downloader
Python package + CLI to query public TCGA (The Cancer Genome Atlas) files and download via gdc-client with retry logic, checksum verification, and progress logging.
## Features
- Query TCGA files by project, data type, sample type, and platform
- Generate reproducible manifests (TSV/JSON) for downloads
- Reliable downloads with automatic retries and checksum verification
- Data statistics (file count, total size, data types breakdown)
- Comprehensive logging with verbose mode support
- Resumable and concurrent downloads via gdc-client
## Requirements
- Python 3.11+
- gdc-client (external tool, see [Installation](#installation))
- pip or uv for package installation
## Installation
### 1. Install gdc-client
Download and install the GDC Data Transfer Tool from [https://gdc.cancer.gov/access-data/gdc-data-transfer-tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
Make sure `gdc-client` is in your PATH:
```bash
gdc-client version
```
### 2. Install this package
```bash
git clone <repository-url>
cd tcga-downloader
pip install -e .
```
For development:
```bash
pip install -e ".[dev]"
```
This installs additional development tools:
- pytest (testing)
- pytest-cov (coverage)
- black (code formatting)
- ruff (linting)
- mypy (type checking)
## Usage
### Basic Query
Query TCGA files by project and data type:
```bash
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv
```
### Advanced Query with Filters
Filter by sample type and platform:
```bash
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--sample-type "Primary Tumor" \
--platform "Illumina HiSeq" \
--out manifest.tsv
```
### Download Files
Download files using a manifest:
```bash
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data
```
### One-Step Query and Download
Query and download in a single command:
```bash
tcga-downloader run \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--out-dir ./data
```
### Verbose Mode
Enable detailed logging:
```bash
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--verbose
```
### Log to File
Save logs to a file:
```bash
tcga-downloader query \
--project TCGA-BRCA \
--data-type "Gene Expression" \
--out manifest.tsv \
--log-file download.log
```
### Custom Download Options
Adjust concurrency and retry settings:
```bash
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data \
--processes 8 \
--retries 5
```
### Downloading Controlled-Access Data
Some TCGA data types (e.g., Clinical Supplement) require controlled access. To download these files, you need:
1. **Obtain an authentication token** from GDC Data Portal:
- Log in to https://portal.gdc.cancer.gov/
- Click on your username → "Download Token"
- Save the token file to a secure location
2. **Use the `--token` parameter** when downloading:
```bash
tcga-downloader download \
--manifest manifest.tsv \
--out-dir ./data \
--token /path/to/gdc-user-token.txt
```
Or add `token` to your config file:
```json
{
"download": {
"out_dir": "./data",
"processes": 4,
"retries": 3,
"token": "/path/to/gdc-user-token.txt"
}
}
```
**Note**: Open-access data does not require a token. Only controlled-access files (marked in GDC Data Portal) need authentication.
## Common TCGA Projects
- TCGA-BRCA: Breast invasive carcinoma
- TCGA-LUAD: Lung adenocarcinoma
- TCGA-COAD: Colon adenocarcinoma
- TCGA-PRAD: Prostate adenocarcinoma
- TCGA-SKCM: Skin cutaneous melanoma
## Common Data Types
- Gene Expression
- Copy Number Variation
- DNA Methylation
- miRNA Expression
- Protein Expression
- Somatic Mutation
## Development
### Setup pre-commit hooks
```bash
pip install pre-commit
pre-commit install
```
### Run tests
```bash
pytest
```
With coverage:
```bash
pytest --cov=tcga_downloader --cov-report=html
```
### Code formatting
```bash
black .
ruff check --fix .
mypy tcga_downloader
```
### CI/CD
The project uses GitHub Actions for CI:
- Linting with black, ruff, mypy
- Testing with pytest
- Coverage reporting
## License
MIT License