# TCGA Downloader Design ## Summary Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: `GenomicDataCommons` for queries/manifest generation and `gdc-client` for reliable downloading. ## Goals - Provide a Python API and CLI for querying TCGA files by project/cancer type and data type. - Generate reproducible manifests (TSV/JSON) for download. - Support resumable downloads, checksum verification, concurrency, and retries. - Work on macOS and Linux. ## Non-Goals - Controlled-access data support. - Raw sequencing data (FASTQ/BAM/CRAM). - Rich GUI. ## Architecture ### Modules 1) `tcga_downloader.query` - Uses `GenomicDataCommons` to query GDC Files endpoint. - Filters by `project` and `data_type`. - Returns file metadata: `file_id`, `file_name`, `data_type`, `data_format`, `size`, `md5`. 2) `tcga_downloader.manifest` - Normalizes metadata into manifest (TSV/JSON). - Validates required fields and types. - Loads/saves manifest for reproducible downloads. 3) `tcga_downloader.download` - Calls `gdc-client download -m `. - Enables concurrency, resume, checksum verification, and retries. - Parses output to report failures and optionally retry. 4) `tcga_downloader.cli` - `tcga-downloader query`: build manifest from filters. - `tcga-downloader download`: download from manifest. - `tcga-downloader run`: query + download in one step. ## Data Flow 1) User inputs filters (project + data_type). 2) `query` fetches metadata from GDC. 3) `manifest` writes TSV/JSON. 4) `download` reads manifest and calls `gdc-client`. 5) Files are stored in the target directory with logs. ## Error Handling - Query errors: surface HTTP error + echo filters; handle empty results gracefully. - Manifest errors: validate schema; fail fast with actionable messages. - Download errors: capture `gdc-client` exit code and stderr; output failed file list; allow retry. ## Testing Strategy - Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses). - Integration tests: query -> manifest generation with mocked API. - Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled). ## Delivery Plan - `pyproject.toml` with dependencies (`GenomicDataCommons`) and CLI entry point. - Package structure: `tcga_downloader/` with `query.py`, `manifest.py`, `download.py`, `cli.py`. - `docs/` with installation and usage. - `examples/` with typical workflows. ## Open Questions - Specific data types list and default mapping for common user inputs. - Default concurrency and retry policy for `gdc-client`.