Add TCGA downloader design
This commit is contained in:
commit
88383a168a
64
docs/plans/2026-01-16-tcga-downloader-design.md
Normal file
64
docs/plans/2026-01-16-tcga-downloader-design.md
Normal file
@ -0,0 +1,64 @@
|
||||
# TCGA Downloader Design
|
||||
|
||||
## Summary
|
||||
Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: `GenomicDataCommons` for queries/manifest generation and `gdc-client` for reliable downloading.
|
||||
|
||||
## Goals
|
||||
- Provide a Python API and CLI for querying TCGA files by project/cancer type and data type.
|
||||
- Generate reproducible manifests (TSV/JSON) for download.
|
||||
- Support resumable downloads, checksum verification, concurrency, and retries.
|
||||
- Work on macOS and Linux.
|
||||
|
||||
## Non-Goals
|
||||
- Controlled-access data support.
|
||||
- Raw sequencing data (FASTQ/BAM/CRAM).
|
||||
- Rich GUI.
|
||||
|
||||
## Architecture
|
||||
### Modules
|
||||
1) `tcga_downloader.query`
|
||||
- Uses `GenomicDataCommons` to query GDC Files endpoint.
|
||||
- Filters by `project` and `data_type`.
|
||||
- Returns file metadata: `file_id`, `file_name`, `data_type`, `data_format`, `size`, `md5`.
|
||||
|
||||
2) `tcga_downloader.manifest`
|
||||
- Normalizes metadata into manifest (TSV/JSON).
|
||||
- Validates required fields and types.
|
||||
- Loads/saves manifest for reproducible downloads.
|
||||
|
||||
3) `tcga_downloader.download`
|
||||
- Calls `gdc-client download -m <manifest>`.
|
||||
- Enables concurrency, resume, checksum verification, and retries.
|
||||
- Parses output to report failures and optionally retry.
|
||||
|
||||
4) `tcga_downloader.cli`
|
||||
- `tcga-downloader query`: build manifest from filters.
|
||||
- `tcga-downloader download`: download from manifest.
|
||||
- `tcga-downloader run`: query + download in one step.
|
||||
|
||||
## Data Flow
|
||||
1) User inputs filters (project + data_type).
|
||||
2) `query` fetches metadata from GDC.
|
||||
3) `manifest` writes TSV/JSON.
|
||||
4) `download` reads manifest and calls `gdc-client`.
|
||||
5) Files are stored in the target directory with logs.
|
||||
|
||||
## Error Handling
|
||||
- Query errors: surface HTTP error + echo filters; handle empty results gracefully.
|
||||
- Manifest errors: validate schema; fail fast with actionable messages.
|
||||
- Download errors: capture `gdc-client` exit code and stderr; output failed file list; allow retry.
|
||||
|
||||
## Testing Strategy
|
||||
- Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses).
|
||||
- Integration tests: query -> manifest generation with mocked API.
|
||||
- Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled).
|
||||
|
||||
## Delivery Plan
|
||||
- `pyproject.toml` with dependencies (`GenomicDataCommons`) and CLI entry point.
|
||||
- Package structure: `tcga_downloader/` with `query.py`, `manifest.py`, `download.py`, `cli.py`.
|
||||
- `docs/` with installation and usage.
|
||||
- `examples/` with typical workflows.
|
||||
|
||||
## Open Questions
|
||||
- Specific data types list and default mapping for common user inputs.
|
||||
- Default concurrency and retry policy for `gdc-client`.
|
||||
Loading…
Reference in New Issue
Block a user