tcga-downloader/docs/plans/2026-01-16-tcga-downloader-implementation.md
2026-01-23 16:48:14 +08:00

14 KiB

TCGA Downloader Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Build a Python package + CLI to query public TCGA files, generate manifests, and download via gdc-client with retries, resume, and checksums.

Architecture: A small Python package with three core modules (query, manifest, download) and an argparse CLI. Query uses the GDC REST API via requests to avoid hard-coupling to a specific SDK, while keeping the API surface thin for later swapping to GenomicDataCommons if desired. Downloads are delegated to gdc-client.

Tech Stack: Python 3.11+, requests, pytest, gdc-client (external binary).

Task 1: Package Skeleton + Version

Files:

  • Create: pyproject.toml
  • Create: tcga_downloader/__init__.py
  • Create: tests/test_version.py

Step 1: Write the failing test

# tests/test_version.py
from tcga_downloader import __version__

def test_version_present():
    assert __version__

Step 2: Run test to verify it fails

Run: pytest tests/test_version.py -q Expected: FAIL with ModuleNotFoundError or missing __version__.

Step 3: Write minimal implementation

# pyproject.toml
[project]
name = "tcga-downloader"
version = "0.1.0"
description = "TCGA public data downloader"
requires-python = ">=3.11"
dependencies = ["requests>=2.31"]

[project.optional-dependencies]
dev = ["pytest>=7.4"]

[project.scripts]
tcga-downloader = "tcga_downloader.cli:main"
# tcga_downloader/__init__.py
__version__ = "0.1.0"

Step 4: Run test to verify it passes

Run: pytest tests/test_version.py -q Expected: PASS.

Step 5: Commit

git add pyproject.toml tcga_downloader/__init__.py tests/test_version.py
git commit -m "feat: add package skeleton"

Task 2: Manifest Read/Write + Validation

Files:

  • Create: tcga_downloader/manifest.py
  • Create: tests/test_manifest.py

Step 1: Write the failing test

# tests/test_manifest.py
from tcga_downloader.manifest import ManifestRecord, write_manifest, load_manifest

def test_manifest_roundtrip_tsv(tmp_path):
    records = [
        ManifestRecord(
            file_id="f1",
            file_name="a.tsv",
            data_type="Gene Expression",
            data_format="TSV",
            size=123,
            md5="abc",
        )
    ]
    path = tmp_path / "m.tsv"
    write_manifest(records, path, fmt="tsv")
    loaded = load_manifest(path)
    assert loaded == records

Step 2: Run test to verify it fails

Run: pytest tests/test_manifest.py -q Expected: FAIL with ModuleNotFoundError.

Step 3: Write minimal implementation

# tcga_downloader/manifest.py
from __future__ import annotations

import csv
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, List

REQUIRED_FIELDS = ["file_id", "file_name", "data_type", "data_format", "size", "md5"]

@dataclass(frozen=True)
class ManifestRecord:
    file_id: str
    file_name: str
    data_type: str
    data_format: str
    size: int
    md5: str


def _validate_record(rec: ManifestRecord) -> None:
    if not rec.file_id or not rec.file_name:
        raise ValueError("file_id and file_name are required")
    if rec.size < 0:
        raise ValueError("size must be non-negative")


def write_manifest(records: Iterable[ManifestRecord], path: Path, fmt: str = "tsv") -> None:
    path = Path(path)
    if fmt not in {"tsv", "json"}:
        raise ValueError("fmt must be 'tsv' or 'json'")
    records = list(records)
    for rec in records:
        _validate_record(rec)

    if fmt == "json":
        data = [rec.__dict__ for rec in records]
        path.write_text(json.dumps(data, indent=2))
        return

    with path.open("w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=REQUIRED_FIELDS, delimiter="\t")
        writer.writeheader()
        for rec in records:
            writer.writerow(rec.__dict__)


def load_manifest(path: Path) -> List[ManifestRecord]:
    path = Path(path)
    if path.suffix.lower() == ".json":
        data = json.loads(path.read_text())
        return [ManifestRecord(**row) for row in data]

    with path.open("r", newline="") as f:
        reader = csv.DictReader(f, delimiter="\t")
        return [ManifestRecord(**row) for row in reader]

Step 4: Run test to verify it passes

Run: pytest tests/test_manifest.py -q Expected: PASS.

Step 5: Commit

git add tcga_downloader/manifest.py tests/test_manifest.py
git commit -m "feat: add manifest read/write"

Task 3: GDC Query (REST API)

Files:

  • Create: tcga_downloader/query.py
  • Create: tests/test_query.py

Step 1: Write the failing test

# tests/test_query.py
from tcga_downloader.query import build_filters

def test_build_filters_project_and_type():
    filters = build_filters(project="TCGA-BRCA", data_type="Gene Expression")
    assert filters["op"] == "and"
    assert filters["content"][0]["content"]["field"] == "cases.project.project_id"

Step 2: Run test to verify it fails

Run: pytest tests/test_query.py -q Expected: FAIL with ModuleNotFoundError.

Step 3: Write minimal implementation

# tcga_downloader/query.py
from __future__ import annotations

import requests

GDC_FILES_URL = "https://api.gdc.cancer.gov/files"


def build_filters(project: str, data_type: str) -> dict:
    return {
        "op": "and",
        "content": [
            {
                "op": "in",
                "content": {
                    "field": "cases.project.project_id",
                    "value": [project],
                },
            },
            {
                "op": "in",
                "content": {
                    "field": "data_type",
                    "value": [data_type],
                },
            },
        ],
    }


def query_files(project: str, data_type: str, fields: list[str] | None = None, size: int = 1000) -> list[dict]:
    if fields is None:
        fields = ["file_id", "file_name", "data_type", "data_format", "file_size", "md5sum"]
    payload = {
        "filters": build_filters(project, data_type),
        "fields": ",".join(fields),
        "format": "JSON",
        "size": size,
    }
    resp = requests.post(GDC_FILES_URL, json=payload, timeout=30)
    resp.raise_for_status()
    data = resp.json()
    return data.get("data", {}).get("hits", [])

Step 4: Run test to verify it passes

Run: pytest tests/test_query.py -q Expected: PASS.

Step 5: Commit

git add tcga_downloader/query.py tests/test_query.py
git commit -m "feat: add GDC query via REST"

Task 4: Download Runner (gdc-client)

Files:

  • Create: tcga_downloader/download.py
  • Create: tests/test_download.py

Step 1: Write the failing test

# tests/test_download.py
from pathlib import Path
from unittest.mock import patch

from tcga_downloader.download import build_gdc_command


def test_build_gdc_command():
    cmd = build_gdc_command(Path("/tmp/m.tsv"), Path("/data"), processes=4, retries=3)
    assert "gdc-client" in cmd[0]
    assert "-m" in cmd

Step 2: Run test to verify it fails

Run: pytest tests/test_download.py -q Expected: FAIL with ModuleNotFoundError.

Step 3: Write minimal implementation

# tcga_downloader/download.py
from __future__ import annotations

import shutil
import subprocess
from pathlib import Path


def build_gdc_command(manifest_path: Path, out_dir: Path, processes: int, retries: int) -> list[str]:
    return [
        "gdc-client",
        "download",
        "-m",
        str(manifest_path),
        "-d",
        str(out_dir),
        "--n-processes",
        str(processes),
        "--retry-count",
        str(retries),
        "--checksum",
    ]


def run_gdc_download(manifest_path: Path, out_dir: Path, processes: int = 4, retries: int = 3) -> None:
    if not shutil.which("gdc-client"):
        raise RuntimeError("gdc-client not found in PATH")
    cmd = build_gdc_command(manifest_path, out_dir, processes, retries)
    subprocess.run(cmd, check=True)

Step 4: Run test to verify it passes

Run: pytest tests/test_download.py -q Expected: PASS.

Step 5: Commit

git add tcga_downloader/download.py tests/test_download.py
git commit -m "feat: add gdc-client downloader"

Task 5: CLI Wiring

Files:

  • Create: tcga_downloader/cli.py
  • Create: tests/test_cli.py

Step 1: Write the failing test

# tests/test_cli.py
from tcga_downloader.cli import build_parser


def test_cli_has_subcommands():
    parser = build_parser()
    subparsers = parser._subparsers
    assert subparsers is not None

Step 2: Run test to verify it fails

Run: pytest tests/test_cli.py -q Expected: FAIL with ModuleNotFoundError.

Step 3: Write minimal implementation

# tcga_downloader/cli.py
from __future__ import annotations

import argparse
from pathlib import Path

from tcga_downloader.download import run_gdc_download
from tcga_downloader.manifest import ManifestRecord, write_manifest
from tcga_downloader.query import query_files


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(prog="tcga-downloader")
    sub = parser.add_subparsers(dest="command", required=True)

    q = sub.add_parser("query")
    q.add_argument("--project", required=True)
    q.add_argument("--data-type", required=True)
    q.add_argument("--out", required=True)
    q.add_argument("--format", choices=["tsv", "json"], default="tsv")

    d = sub.add_parser("download")
    d.add_argument("--manifest", required=True)
    d.add_argument("--out-dir", required=True)
    d.add_argument("--processes", type=int, default=4)
    d.add_argument("--retries", type=int, default=3)

    r = sub.add_parser("run")
    r.add_argument("--project", required=True)
    r.add_argument("--data-type", required=True)
    r.add_argument("--out", required=True)
    r.add_argument("--format", choices=["tsv", "json"], default="tsv")
    r.add_argument("--out-dir", required=True)
    r.add_argument("--processes", type=int, default=4)
    r.add_argument("--retries", type=int, default=3)

    return parser


def _records_from_hits(hits: list[dict]) -> list[ManifestRecord]:
    records = []
    for h in hits:
        records.append(
            ManifestRecord(
                file_id=h["file_id"],
                file_name=h["file_name"],
                data_type=h["data_type"],
                data_format=h["data_format"],
                size=int(h["file_size"]),
                md5=h["md5sum"],
            )
        )
    return records


def main() -> None:
    parser = build_parser()
    args = parser.parse_args()

    if args.command in {"query", "run"}:
        hits = query_files(args.project, args.data_type)
        records = _records_from_hits(hits)
        write_manifest(records, Path(args.out), fmt=args.format)

    if args.command in {"download", "run"}:
        run_gdc_download(Path(args.manifest if args.command == "download" else args.out), Path(args.out_dir), args.processes, args.retries)

Step 4: Run test to verify it passes

Run: pytest tests/test_cli.py -q Expected: PASS.

Step 5: Commit

git add tcga_downloader/cli.py tests/test_cli.py
git commit -m "feat: add CLI entry points"

Task 6: End-to-End Sanity Tests (Mocked)

Files:

  • Modify: tests/test_cli.py

Step 1: Write the failing test

# tests/test_cli.py
from unittest.mock import patch

from tcga_downloader.cli import main


def test_cli_query_writes_manifest(tmp_path, monkeypatch):
    args = [
        "tcga-downloader",
        "query",
        "--project",
        "TCGA-BRCA",
        "--data-type",
        "Gene Expression",
        "--out",
        str(tmp_path / "m.tsv"),
    ]
    monkeypatch.setattr("sys.argv", args)
    with patch("tcga_downloader.cli.query_files") as q:
        q.return_value = [
            {
                "file_id": "f1",
                "file_name": "a.tsv",
                "data_type": "Gene Expression",
                "data_format": "TSV",
                "file_size": 123,
                "md5sum": "abc",
            }
        ]
        main()
    assert (tmp_path / "m.tsv").exists()

Step 2: Run test to verify it fails

Run: pytest tests/test_cli.py::test_cli_query_writes_manifest -q Expected: FAIL if behavior not implemented or module missing.

Step 3: Write minimal implementation

# tcga_downloader/cli.py
# (No code changes expected if Task 5 is complete. If failing, fix args handling.)

Step 4: Run test to verify it passes

Run: pytest tests/test_cli.py::test_cli_query_writes_manifest -q Expected: PASS.

Step 5: Commit

git add tests/test_cli.py
git commit -m "test: add CLI query smoke test"

Task 7: Documentation

Files:

  • Create: README.md

Step 1: Write the failing test

# tests/test_readme.py
from pathlib import Path

def test_readme_present():
    assert Path("README.md").exists()

Step 2: Run test to verify it fails

Run: pytest tests/test_readme.py -q Expected: FAIL with missing README.

Step 3: Write minimal implementation

# README.md

## TCGA Downloader

Python package + CLI to query public TCGA files and download via gdc-client.

### Install

```bash
pip install -e .

Example

tcga-downloader query --project TCGA-BRCA --data-type "Gene Expression" --out manifest.tsv

tcga-downloader download --manifest manifest.tsv --out-dir ./data

**Step 4: Run test to verify it passes**

Run: `pytest tests/test_readme.py -q`
Expected: PASS.

**Step 5: Commit**

```bash
git add README.md tests/test_readme.py
git commit -m "docs: add README"