65 lines
2.6 KiB
Markdown
65 lines
2.6 KiB
Markdown
# TCGA Downloader Design
|
|
|
|
## Summary
|
|
Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: `GenomicDataCommons` for queries/manifest generation and `gdc-client` for reliable downloading.
|
|
|
|
## Goals
|
|
- Provide a Python API and CLI for querying TCGA files by project/cancer type and data type.
|
|
- Generate reproducible manifests (TSV/JSON) for download.
|
|
- Support resumable downloads, checksum verification, concurrency, and retries.
|
|
- Work on macOS and Linux.
|
|
|
|
## Non-Goals
|
|
- Controlled-access data support.
|
|
- Raw sequencing data (FASTQ/BAM/CRAM).
|
|
- Rich GUI.
|
|
|
|
## Architecture
|
|
### Modules
|
|
1) `tcga_downloader.query`
|
|
- Uses `GenomicDataCommons` to query GDC Files endpoint.
|
|
- Filters by `project` and `data_type`.
|
|
- Returns file metadata: `file_id`, `file_name`, `data_type`, `data_format`, `size`, `md5`.
|
|
|
|
2) `tcga_downloader.manifest`
|
|
- Normalizes metadata into manifest (TSV/JSON).
|
|
- Validates required fields and types.
|
|
- Loads/saves manifest for reproducible downloads.
|
|
|
|
3) `tcga_downloader.download`
|
|
- Calls `gdc-client download -m <manifest>`.
|
|
- Enables concurrency, resume, checksum verification, and retries.
|
|
- Parses output to report failures and optionally retry.
|
|
|
|
4) `tcga_downloader.cli`
|
|
- `tcga-downloader query`: build manifest from filters.
|
|
- `tcga-downloader download`: download from manifest.
|
|
- `tcga-downloader run`: query + download in one step.
|
|
|
|
## Data Flow
|
|
1) User inputs filters (project + data_type).
|
|
2) `query` fetches metadata from GDC.
|
|
3) `manifest` writes TSV/JSON.
|
|
4) `download` reads manifest and calls `gdc-client`.
|
|
5) Files are stored in the target directory with logs.
|
|
|
|
## Error Handling
|
|
- Query errors: surface HTTP error + echo filters; handle empty results gracefully.
|
|
- Manifest errors: validate schema; fail fast with actionable messages.
|
|
- Download errors: capture `gdc-client` exit code and stderr; output failed file list; allow retry.
|
|
|
|
## Testing Strategy
|
|
- Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses).
|
|
- Integration tests: query -> manifest generation with mocked API.
|
|
- Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled).
|
|
|
|
## Delivery Plan
|
|
- `pyproject.toml` with dependencies (`GenomicDataCommons`) and CLI entry point.
|
|
- Package structure: `tcga_downloader/` with `query.py`, `manifest.py`, `download.py`, `cli.py`.
|
|
- `docs/` with installation and usage.
|
|
- `examples/` with typical workflows.
|
|
|
|
## Open Questions
|
|
- Specific data types list and default mapping for common user inputs.
|
|
- Default concurrency and retry policy for `gdc-client`.
|