# TCGA Downloader Python package + CLI to query public TCGA (The Cancer Genome Atlas) files and download via gdc-client with retry logic, checksum verification, and progress logging. ## Features - Query TCGA files by project, data type, sample type, and platform - Generate reproducible manifests (TSV/JSON) for downloads - Reliable downloads with automatic retries and checksum verification - Data statistics (file count, total size, data types breakdown) - Comprehensive logging with verbose mode support - Resumable and concurrent downloads via gdc-client ## Requirements - Python 3.11+ - gdc-client (external tool, see [Installation](#installation)) - pip or uv for package installation ## Installation ### 1. Install gdc-client Download and install the GDC Data Transfer Tool from [https://gdc.cancer.gov/access-data/gdc-data-transfer-tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool) Make sure `gdc-client` is in your PATH: ```bash gdc-client version ``` ### 2. Install this package ```bash git clone cd tcga-downloader pip install -e . ``` For development: ```bash pip install -e ".[dev]" ``` This installs additional development tools: - pytest (testing) - pytest-cov (coverage) - black (code formatting) - ruff (linting) - mypy (type checking) ## Usage ### Basic Query Query TCGA files by project and data type: ```bash tcga-downloader query \ --project TCGA-BRCA \ --data-type "Gene Expression" \ --out manifest.tsv ``` ### Advanced Query with Filters Filter by sample type and platform: ```bash tcga-downloader query \ --project TCGA-BRCA \ --data-type "Gene Expression" \ --sample-type "Primary Tumor" \ --platform "Illumina HiSeq" \ --out manifest.tsv ``` ### Download Files Download files using a manifest: ```bash tcga-downloader download \ --manifest manifest.tsv \ --out-dir ./data ``` ### One-Step Query and Download Query and download in a single command: ```bash tcga-downloader run \ --project TCGA-BRCA \ --data-type "Gene Expression" \ --out manifest.tsv \ --out-dir ./data ``` ### Verbose Mode Enable detailed logging: ```bash tcga-downloader query \ --project TCGA-BRCA \ --data-type "Gene Expression" \ --out manifest.tsv \ --verbose ``` ### Log to File Save logs to a file: ```bash tcga-downloader query \ --project TCGA-BRCA \ --data-type "Gene Expression" \ --out manifest.tsv \ --log-file download.log ``` ### Custom Download Options Adjust concurrency and retry settings: ```bash tcga-downloader download \ --manifest manifest.tsv \ --out-dir ./data \ --processes 8 \ --retries 5 ``` ### Downloading Controlled-Access Data Some TCGA data types (e.g., Clinical Supplement) require controlled access. To download these files, you need: 1. **Obtain an authentication token** from GDC Data Portal: - Log in to https://portal.gdc.cancer.gov/ - Click on your username → "Download Token" - Save the token file to a secure location 2. **Use the `--token` parameter** when downloading: ```bash tcga-downloader download \ --manifest manifest.tsv \ --out-dir ./data \ --token /path/to/gdc-user-token.txt ``` Or add `token` to your config file: ```json { "download": { "out_dir": "./data", "processes": 4, "retries": 3, "token": "/path/to/gdc-user-token.txt" } } ``` **Note**: Open-access data does not require a token. Only controlled-access files (marked in GDC Data Portal) need authentication. ## Common TCGA Projects - TCGA-BRCA: Breast invasive carcinoma - TCGA-LUAD: Lung adenocarcinoma - TCGA-COAD: Colon adenocarcinoma - TCGA-PRAD: Prostate adenocarcinoma - TCGA-SKCM: Skin cutaneous melanoma ## Common Data Types - Gene Expression - Copy Number Variation - DNA Methylation - miRNA Expression - Protein Expression - Somatic Mutation ## Development ### Setup pre-commit hooks ```bash pip install pre-commit pre-commit install ``` ### Run tests ```bash pytest ``` With coverage: ```bash pytest --cov=tcga_downloader --cov-report=html ``` ### Code formatting ```bash black . ruff check --fix . mypy tcga_downloader ``` ### CI/CD The project uses GitHub Actions for CI: - Linting with black, ruff, mypy - Testing with pytest - Coverage reporting ## License MIT License