docs: add implementation plan

This commit is contained in:
yunpeng.zhang 2026-01-23 16:48:14 +08:00
parent 9217a3f20b
commit cd5cf41661

View File

@ -0,0 +1,558 @@
# TCGA Downloader Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a Python package + CLI to query public TCGA files, generate manifests, and download via `gdc-client` with retries, resume, and checksums.
**Architecture:** A small Python package with three core modules (`query`, `manifest`, `download`) and an `argparse` CLI. Query uses the GDC REST API via `requests` to avoid hard-coupling to a specific SDK, while keeping the API surface thin for later swapping to `GenomicDataCommons` if desired. Downloads are delegated to `gdc-client`.
**Tech Stack:** Python 3.11+, `requests`, `pytest`, `gdc-client` (external binary).
### Task 1: Package Skeleton + Version
**Files:**
- Create: `pyproject.toml`
- Create: `tcga_downloader/__init__.py`
- Create: `tests/test_version.py`
**Step 1: Write the failing test**
```python
# tests/test_version.py
from tcga_downloader import __version__
def test_version_present():
assert __version__
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_version.py -q`
Expected: FAIL with `ModuleNotFoundError` or missing `__version__`.
**Step 3: Write minimal implementation**
```toml
# pyproject.toml
[project]
name = "tcga-downloader"
version = "0.1.0"
description = "TCGA public data downloader"
requires-python = ">=3.11"
dependencies = ["requests>=2.31"]
[project.optional-dependencies]
dev = ["pytest>=7.4"]
[project.scripts]
tcga-downloader = "tcga_downloader.cli:main"
```
```python
# tcga_downloader/__init__.py
__version__ = "0.1.0"
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_version.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add pyproject.toml tcga_downloader/__init__.py tests/test_version.py
git commit -m "feat: add package skeleton"
```
### Task 2: Manifest Read/Write + Validation
**Files:**
- Create: `tcga_downloader/manifest.py`
- Create: `tests/test_manifest.py`
**Step 1: Write the failing test**
```python
# tests/test_manifest.py
from tcga_downloader.manifest import ManifestRecord, write_manifest, load_manifest
def test_manifest_roundtrip_tsv(tmp_path):
records = [
ManifestRecord(
file_id="f1",
file_name="a.tsv",
data_type="Gene Expression",
data_format="TSV",
size=123,
md5="abc",
)
]
path = tmp_path / "m.tsv"
write_manifest(records, path, fmt="tsv")
loaded = load_manifest(path)
assert loaded == records
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_manifest.py -q`
Expected: FAIL with `ModuleNotFoundError`.
**Step 3: Write minimal implementation**
```python
# tcga_downloader/manifest.py
from __future__ import annotations
import csv
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, List
REQUIRED_FIELDS = ["file_id", "file_name", "data_type", "data_format", "size", "md5"]
@dataclass(frozen=True)
class ManifestRecord:
file_id: str
file_name: str
data_type: str
data_format: str
size: int
md5: str
def _validate_record(rec: ManifestRecord) -> None:
if not rec.file_id or not rec.file_name:
raise ValueError("file_id and file_name are required")
if rec.size < 0:
raise ValueError("size must be non-negative")
def write_manifest(records: Iterable[ManifestRecord], path: Path, fmt: str = "tsv") -> None:
path = Path(path)
if fmt not in {"tsv", "json"}:
raise ValueError("fmt must be 'tsv' or 'json'")
records = list(records)
for rec in records:
_validate_record(rec)
if fmt == "json":
data = [rec.__dict__ for rec in records]
path.write_text(json.dumps(data, indent=2))
return
with path.open("w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=REQUIRED_FIELDS, delimiter="\t")
writer.writeheader()
for rec in records:
writer.writerow(rec.__dict__)
def load_manifest(path: Path) -> List[ManifestRecord]:
path = Path(path)
if path.suffix.lower() == ".json":
data = json.loads(path.read_text())
return [ManifestRecord(**row) for row in data]
with path.open("r", newline="") as f:
reader = csv.DictReader(f, delimiter="\t")
return [ManifestRecord(**row) for row in reader]
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_manifest.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add tcga_downloader/manifest.py tests/test_manifest.py
git commit -m "feat: add manifest read/write"
```
### Task 3: GDC Query (REST API)
**Files:**
- Create: `tcga_downloader/query.py`
- Create: `tests/test_query.py`
**Step 1: Write the failing test**
```python
# tests/test_query.py
from tcga_downloader.query import build_filters
def test_build_filters_project_and_type():
filters = build_filters(project="TCGA-BRCA", data_type="Gene Expression")
assert filters["op"] == "and"
assert filters["content"][0]["content"]["field"] == "cases.project.project_id"
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_query.py -q`
Expected: FAIL with `ModuleNotFoundError`.
**Step 3: Write minimal implementation**
```python
# tcga_downloader/query.py
from __future__ import annotations
import requests
GDC_FILES_URL = "https://api.gdc.cancer.gov/files"
def build_filters(project: str, data_type: str) -> dict:
return {
"op": "and",
"content": [
{
"op": "in",
"content": {
"field": "cases.project.project_id",
"value": [project],
},
},
{
"op": "in",
"content": {
"field": "data_type",
"value": [data_type],
},
},
],
}
def query_files(project: str, data_type: str, fields: list[str] | None = None, size: int = 1000) -> list[dict]:
if fields is None:
fields = ["file_id", "file_name", "data_type", "data_format", "file_size", "md5sum"]
payload = {
"filters": build_filters(project, data_type),
"fields": ",".join(fields),
"format": "JSON",
"size": size,
}
resp = requests.post(GDC_FILES_URL, json=payload, timeout=30)
resp.raise_for_status()
data = resp.json()
return data.get("data", {}).get("hits", [])
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_query.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add tcga_downloader/query.py tests/test_query.py
git commit -m "feat: add GDC query via REST"
```
### Task 4: Download Runner (gdc-client)
**Files:**
- Create: `tcga_downloader/download.py`
- Create: `tests/test_download.py`
**Step 1: Write the failing test**
```python
# tests/test_download.py
from pathlib import Path
from unittest.mock import patch
from tcga_downloader.download import build_gdc_command
def test_build_gdc_command():
cmd = build_gdc_command(Path("/tmp/m.tsv"), Path("/data"), processes=4, retries=3)
assert "gdc-client" in cmd[0]
assert "-m" in cmd
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_download.py -q`
Expected: FAIL with `ModuleNotFoundError`.
**Step 3: Write minimal implementation**
```python
# tcga_downloader/download.py
from __future__ import annotations
import shutil
import subprocess
from pathlib import Path
def build_gdc_command(manifest_path: Path, out_dir: Path, processes: int, retries: int) -> list[str]:
return [
"gdc-client",
"download",
"-m",
str(manifest_path),
"-d",
str(out_dir),
"--n-processes",
str(processes),
"--retry-count",
str(retries),
"--checksum",
]
def run_gdc_download(manifest_path: Path, out_dir: Path, processes: int = 4, retries: int = 3) -> None:
if not shutil.which("gdc-client"):
raise RuntimeError("gdc-client not found in PATH")
cmd = build_gdc_command(manifest_path, out_dir, processes, retries)
subprocess.run(cmd, check=True)
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_download.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add tcga_downloader/download.py tests/test_download.py
git commit -m "feat: add gdc-client downloader"
```
### Task 5: CLI Wiring
**Files:**
- Create: `tcga_downloader/cli.py`
- Create: `tests/test_cli.py`
**Step 1: Write the failing test**
```python
# tests/test_cli.py
from tcga_downloader.cli import build_parser
def test_cli_has_subcommands():
parser = build_parser()
subparsers = parser._subparsers
assert subparsers is not None
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_cli.py -q`
Expected: FAIL with `ModuleNotFoundError`.
**Step 3: Write minimal implementation**
```python
# tcga_downloader/cli.py
from __future__ import annotations
import argparse
from pathlib import Path
from tcga_downloader.download import run_gdc_download
from tcga_downloader.manifest import ManifestRecord, write_manifest
from tcga_downloader.query import query_files
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(prog="tcga-downloader")
sub = parser.add_subparsers(dest="command", required=True)
q = sub.add_parser("query")
q.add_argument("--project", required=True)
q.add_argument("--data-type", required=True)
q.add_argument("--out", required=True)
q.add_argument("--format", choices=["tsv", "json"], default="tsv")
d = sub.add_parser("download")
d.add_argument("--manifest", required=True)
d.add_argument("--out-dir", required=True)
d.add_argument("--processes", type=int, default=4)
d.add_argument("--retries", type=int, default=3)
r = sub.add_parser("run")
r.add_argument("--project", required=True)
r.add_argument("--data-type", required=True)
r.add_argument("--out", required=True)
r.add_argument("--format", choices=["tsv", "json"], default="tsv")
r.add_argument("--out-dir", required=True)
r.add_argument("--processes", type=int, default=4)
r.add_argument("--retries", type=int, default=3)
return parser
def _records_from_hits(hits: list[dict]) -> list[ManifestRecord]:
records = []
for h in hits:
records.append(
ManifestRecord(
file_id=h["file_id"],
file_name=h["file_name"],
data_type=h["data_type"],
data_format=h["data_format"],
size=int(h["file_size"]),
md5=h["md5sum"],
)
)
return records
def main() -> None:
parser = build_parser()
args = parser.parse_args()
if args.command in {"query", "run"}:
hits = query_files(args.project, args.data_type)
records = _records_from_hits(hits)
write_manifest(records, Path(args.out), fmt=args.format)
if args.command in {"download", "run"}:
run_gdc_download(Path(args.manifest if args.command == "download" else args.out), Path(args.out_dir), args.processes, args.retries)
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_cli.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add tcga_downloader/cli.py tests/test_cli.py
git commit -m "feat: add CLI entry points"
```
### Task 6: End-to-End Sanity Tests (Mocked)
**Files:**
- Modify: `tests/test_cli.py`
**Step 1: Write the failing test**
```python
# tests/test_cli.py
from unittest.mock import patch
from tcga_downloader.cli import main
def test_cli_query_writes_manifest(tmp_path, monkeypatch):
args = [
"tcga-downloader",
"query",
"--project",
"TCGA-BRCA",
"--data-type",
"Gene Expression",
"--out",
str(tmp_path / "m.tsv"),
]
monkeypatch.setattr("sys.argv", args)
with patch("tcga_downloader.cli.query_files") as q:
q.return_value = [
{
"file_id": "f1",
"file_name": "a.tsv",
"data_type": "Gene Expression",
"data_format": "TSV",
"file_size": 123,
"md5sum": "abc",
}
]
main()
assert (tmp_path / "m.tsv").exists()
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_cli.py::test_cli_query_writes_manifest -q`
Expected: FAIL if behavior not implemented or module missing.
**Step 3: Write minimal implementation**
```python
# tcga_downloader/cli.py
# (No code changes expected if Task 5 is complete. If failing, fix args handling.)
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_cli.py::test_cli_query_writes_manifest -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add tests/test_cli.py
git commit -m "test: add CLI query smoke test"
```
### Task 7: Documentation
**Files:**
- Create: `README.md`
**Step 1: Write the failing test**
```python
# tests/test_readme.py
from pathlib import Path
def test_readme_present():
assert Path("README.md").exists()
```
**Step 2: Run test to verify it fails**
Run: `pytest tests/test_readme.py -q`
Expected: FAIL with missing README.
**Step 3: Write minimal implementation**
```markdown
# README.md
## TCGA Downloader
Python package + CLI to query public TCGA files and download via gdc-client.
### Install
```bash
pip install -e .
```
### Example
```bash
tcga-downloader query --project TCGA-BRCA --data-type "Gene Expression" --out manifest.tsv
tcga-downloader download --manifest manifest.tsv --out-dir ./data
```
```
**Step 4: Run test to verify it passes**
Run: `pytest tests/test_readme.py -q`
Expected: PASS.
**Step 5: Commit**
```bash
git add README.md tests/test_readme.py
git commit -m "docs: add README"
```