Add TCGA downloader design

This commit is contained in:
yunpeng.zhang 2026-01-16 14:05:26 +08:00
commit 88383a168a

View File

@ -0,0 +1,64 @@
# TCGA Downloader Design
## Summary
Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: `GenomicDataCommons` for queries/manifest generation and `gdc-client` for reliable downloading.
## Goals
- Provide a Python API and CLI for querying TCGA files by project/cancer type and data type.
- Generate reproducible manifests (TSV/JSON) for download.
- Support resumable downloads, checksum verification, concurrency, and retries.
- Work on macOS and Linux.
## Non-Goals
- Controlled-access data support.
- Raw sequencing data (FASTQ/BAM/CRAM).
- Rich GUI.
## Architecture
### Modules
1) `tcga_downloader.query`
- Uses `GenomicDataCommons` to query GDC Files endpoint.
- Filters by `project` and `data_type`.
- Returns file metadata: `file_id`, `file_name`, `data_type`, `data_format`, `size`, `md5`.
2) `tcga_downloader.manifest`
- Normalizes metadata into manifest (TSV/JSON).
- Validates required fields and types.
- Loads/saves manifest for reproducible downloads.
3) `tcga_downloader.download`
- Calls `gdc-client download -m <manifest>`.
- Enables concurrency, resume, checksum verification, and retries.
- Parses output to report failures and optionally retry.
4) `tcga_downloader.cli`
- `tcga-downloader query`: build manifest from filters.
- `tcga-downloader download`: download from manifest.
- `tcga-downloader run`: query + download in one step.
## Data Flow
1) User inputs filters (project + data_type).
2) `query` fetches metadata from GDC.
3) `manifest` writes TSV/JSON.
4) `download` reads manifest and calls `gdc-client`.
5) Files are stored in the target directory with logs.
## Error Handling
- Query errors: surface HTTP error + echo filters; handle empty results gracefully.
- Manifest errors: validate schema; fail fast with actionable messages.
- Download errors: capture `gdc-client` exit code and stderr; output failed file list; allow retry.
## Testing Strategy
- Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses).
- Integration tests: query -> manifest generation with mocked API.
- Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled).
## Delivery Plan
- `pyproject.toml` with dependencies (`GenomicDataCommons`) and CLI entry point.
- Package structure: `tcga_downloader/` with `query.py`, `manifest.py`, `download.py`, `cli.py`.
- `docs/` with installation and usage.
- `examples/` with typical workflows.
## Open Questions
- Specific data types list and default mapping for common user inputs.
- Default concurrency and retry policy for `gdc-client`.