tcga-downloader/docs/plans/2026-01-16-tcga-downloader-design.md
2026-01-16 14:05:26 +08:00

2.6 KiB

TCGA Downloader Design

Summary

Build a Python package + CLI to download public TCGA data (non-controlled) using a mixed approach: GenomicDataCommons for queries/manifest generation and gdc-client for reliable downloading.

Goals

  • Provide a Python API and CLI for querying TCGA files by project/cancer type and data type.
  • Generate reproducible manifests (TSV/JSON) for download.
  • Support resumable downloads, checksum verification, concurrency, and retries.
  • Work on macOS and Linux.

Non-Goals

  • Controlled-access data support.
  • Raw sequencing data (FASTQ/BAM/CRAM).
  • Rich GUI.

Architecture

Modules

  1. tcga_downloader.query
  • Uses GenomicDataCommons to query GDC Files endpoint.
  • Filters by project and data_type.
  • Returns file metadata: file_id, file_name, data_type, data_format, size, md5.
  1. tcga_downloader.manifest
  • Normalizes metadata into manifest (TSV/JSON).
  • Validates required fields and types.
  • Loads/saves manifest for reproducible downloads.
  1. tcga_downloader.download
  • Calls gdc-client download -m <manifest>.
  • Enables concurrency, resume, checksum verification, and retries.
  • Parses output to report failures and optionally retry.
  1. tcga_downloader.cli
  • tcga-downloader query: build manifest from filters.
  • tcga-downloader download: download from manifest.
  • tcga-downloader run: query + download in one step.

Data Flow

  1. User inputs filters (project + data_type).
  2. query fetches metadata from GDC.
  3. manifest writes TSV/JSON.
  4. download reads manifest and calls gdc-client.
  5. Files are stored in the target directory with logs.

Error Handling

  • Query errors: surface HTTP error + echo filters; handle empty results gracefully.
  • Manifest errors: validate schema; fail fast with actionable messages.
  • Download errors: capture gdc-client exit code and stderr; output failed file list; allow retry.

Testing Strategy

  • Unit tests: manifest read/write and schema validation; query parameter construction (mocked responses).
  • Integration tests: query -> manifest generation with mocked API.
  • Smoke tests: download a small public file on macOS/Linux (manual or CI-labeled).

Delivery Plan

  • pyproject.toml with dependencies (GenomicDataCommons) and CLI entry point.
  • Package structure: tcga_downloader/ with query.py, manifest.py, download.py, cli.py.
  • docs/ with installation and usage.
  • examples/ with typical workflows.

Open Questions

  • Specific data types list and default mapping for common user inputs.
  • Default concurrency and retry policy for gdc-client.