From 88383a168a567eec17f4512a5a046283138fd19e Mon Sep 17 00:00:00 2001 From: "yunpeng.zhang" Date: Fri, 16 Jan 2026 14:05:26 +0800 Subject: [PATCH] Add TCGA downloader design --- .../2026-01-16-tcga-downloader-design.md | 64 +++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/plans/2026-01-16-tcga-downloader-design.md diff --git a/docs/plans/2026-01-16-tcga-downloader-design.md b/docs/plans/2026-01-16-tcga-downloader-design.md new file mode 100644 index 0000000..8a50d25 --- /dev/null +++ b/docs/plans/2026-01-16-tcga-downloader-design.md @@ -0,0 +1,64 @@ +# TCGA Downloader Design + +## Summary +Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: `GenomicDataCommons` for queries/manifest generation and `gdc-client` for reliable downloading. + +## Goals +- Provide a Python API and CLI for querying TCGA files by project/cancer type and data type. +- Generate reproducible manifests (TSV/JSON) for download. +- Support resumable downloads, checksum verification, concurrency, and retries. +- Work on macOS and Linux. + +## Non-Goals +- Controlled-access data support. +- Raw sequencing data (FASTQ/BAM/CRAM). +- Rich GUI. + +## Architecture +### Modules +1) `tcga_downloader.query` +- Uses `GenomicDataCommons` to query GDC Files endpoint. +- Filters by `project` and `data_type`. +- Returns file metadata: `file_id`, `file_name`, `data_type`, `data_format`, `size`, `md5`. + +2) `tcga_downloader.manifest` +- Normalizes metadata into manifest (TSV/JSON). +- Validates required fields and types. +- Loads/saves manifest for reproducible downloads. + +3) `tcga_downloader.download` +- Calls `gdc-client download -m `. +- Enables concurrency, resume, checksum verification, and retries. +- Parses output to report failures and optionally retry. + +4) `tcga_downloader.cli` +- `tcga-downloader query`: build manifest from filters. +- `tcga-downloader download`: download from manifest. +- `tcga-downloader run`: query + download in one step. + +## Data Flow +1) User inputs filters (project + data_type). +2) `query` fetches metadata from GDC. +3) `manifest` writes TSV/JSON. +4) `download` reads manifest and calls `gdc-client`. +5) Files are stored in the target directory with logs. + +## Error Handling +- Query errors: surface HTTP error + echo filters; handle empty results gracefully. +- Manifest errors: validate schema; fail fast with actionable messages. +- Download errors: capture `gdc-client` exit code and stderr; output failed file list; allow retry. + +## Testing Strategy +- Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses). +- Integration tests: query -> manifest generation with mocked API. +- Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled). + +## Delivery Plan +- `pyproject.toml` with dependencies (`GenomicDataCommons`) and CLI entry point. +- Package structure: `tcga_downloader/` with `query.py`, `manifest.py`, `download.py`, `cli.py`. +- `docs/` with installation and usage. +- `examples/` with typical workflows. + +## Open Questions +- Specific data types list and default mapping for common user inputs. +- Default concurrency and retry policy for `gdc-client`.