2.6 KiB
2.6 KiB
TCGA Downloader Design
Summary
Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: GenomicDataCommons for queries/manifest generation and gdc-client for reliable downloading.
Goals
- Provide a Python API and CLI for querying TCGA files by project/cancer type and data type.
- Generate reproducible manifests (TSV/JSON) for download.
- Support resumable downloads, checksum verification, concurrency, and retries.
- Work on macOS and Linux.
Non-Goals
- Controlled-access data support.
- Raw sequencing data (FASTQ/BAM/CRAM).
- Rich GUI.
Architecture
Modules
tcga_downloader.query
- Uses
GenomicDataCommonsto query GDC Files endpoint. - Filters by
projectanddata_type. - Returns file metadata:
file_id,file_name,data_type,data_format,size,md5.
tcga_downloader.manifest
- Normalizes metadata into manifest (TSV/JSON).
- Validates required fields and types.
- Loads/saves manifest for reproducible downloads.
tcga_downloader.download
- Calls
gdc-client download -m <manifest>. - Enables concurrency, resume, checksum verification, and retries.
- Parses output to report failures and optionally retry.
tcga_downloader.cli
tcga-downloader query: build manifest from filters.tcga-downloader download: download from manifest.tcga-downloader run: query + download in one step.
Data Flow
- User inputs filters (project + data_type).
queryfetches metadata from GDC.manifestwrites TSV/JSON.downloadreads manifest and callsgdc-client.- Files are stored in the target directory with logs.
Error Handling
- Query errors: surface HTTP error + echo filters; handle empty results gracefully.
- Manifest errors: validate schema; fail fast with actionable messages.
- Download errors: capture
gdc-clientexit code and stderr; output failed file list; allow retry.
Testing Strategy
- Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses).
- Integration tests: query -> manifest generation with mocked API.
- Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled).
Delivery Plan
pyproject.tomlwith dependencies (GenomicDataCommons) and CLI entry point.- Package structure:
tcga_downloader/withquery.py,manifest.py,download.py,cli.py. docs/with installation and usage.examples/with typical workflows.
Open Questions
- Specific data types list and default mapping for common user inputs.
- Default concurrency and retry policy for
gdc-client.