226 lines
4.2 KiB
Markdown
226 lines
4.2 KiB
Markdown
# TCGA Downloader
|
|
|
|
Python package + CLI to query public TCGA (The Cancer Genome Atlas) files and download via gdc-client with retry logic, checksum verification, and progress logging.
|
|
|
|
## Features
|
|
|
|
- Query TCGA files by project, data type, sample type, and platform
|
|
- Generate reproducible manifests (TSV/JSON) for downloads
|
|
- Reliable downloads with automatic retries and checksum verification
|
|
- Data statistics (file count, total size, data types breakdown)
|
|
- Comprehensive logging with verbose mode support
|
|
- Resumable and concurrent downloads via gdc-client
|
|
|
|
## Requirements
|
|
|
|
- Python 3.11+
|
|
- gdc-client (external tool, see [Installation](#installation))
|
|
- pip or uv for package installation
|
|
|
|
## Installation
|
|
|
|
### 1. Install gdc-client
|
|
|
|
Download and install the GDC Data Transfer Tool from [https://gdc.cancer.gov/access-data/gdc-data-transfer-tool](https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)
|
|
|
|
Make sure `gdc-client` is in your PATH:
|
|
|
|
```bash
|
|
gdc-client version
|
|
```
|
|
|
|
### 2. Install this package
|
|
|
|
```bash
|
|
git clone <repository-url>
|
|
cd tcga-downloader
|
|
pip install -e .
|
|
```
|
|
|
|
For development:
|
|
|
|
```bash
|
|
pip install -e ".[dev]"
|
|
```
|
|
|
|
This installs additional development tools:
|
|
- pytest (testing)
|
|
- pytest-cov (coverage)
|
|
- black (code formatting)
|
|
- ruff (linting)
|
|
- mypy (type checking)
|
|
|
|
## Usage
|
|
|
|
### Basic Query
|
|
|
|
Query TCGA files by project and data type:
|
|
|
|
```bash
|
|
tcga-downloader query \
|
|
--project TCGA-BRCA \
|
|
--data-type "Gene Expression" \
|
|
--out manifest.tsv
|
|
```
|
|
|
|
### Advanced Query with Filters
|
|
|
|
Filter by sample type and platform:
|
|
|
|
```bash
|
|
tcga-downloader query \
|
|
--project TCGA-BRCA \
|
|
--data-type "Gene Expression" \
|
|
--sample-type "Primary Tumor" \
|
|
--platform "Illumina HiSeq" \
|
|
--out manifest.tsv
|
|
```
|
|
|
|
### Download Files
|
|
|
|
Download files using a manifest:
|
|
|
|
```bash
|
|
tcga-downloader download \
|
|
--manifest manifest.tsv \
|
|
--out-dir ./data
|
|
```
|
|
|
|
### One-Step Query and Download
|
|
|
|
Query and download in a single command:
|
|
|
|
```bash
|
|
tcga-downloader run \
|
|
--project TCGA-BRCA \
|
|
--data-type "Gene Expression" \
|
|
--out manifest.tsv \
|
|
--out-dir ./data
|
|
```
|
|
|
|
### Verbose Mode
|
|
|
|
Enable detailed logging:
|
|
|
|
```bash
|
|
tcga-downloader query \
|
|
--project TCGA-BRCA \
|
|
--data-type "Gene Expression" \
|
|
--out manifest.tsv \
|
|
--verbose
|
|
```
|
|
|
|
### Log to File
|
|
|
|
Save logs to a file:
|
|
|
|
```bash
|
|
tcga-downloader query \
|
|
--project TCGA-BRCA \
|
|
--data-type "Gene Expression" \
|
|
--out manifest.tsv \
|
|
--log-file download.log
|
|
```
|
|
|
|
### Custom Download Options
|
|
|
|
Adjust concurrency and retry settings:
|
|
|
|
```bash
|
|
tcga-downloader download \
|
|
--manifest manifest.tsv \
|
|
--out-dir ./data \
|
|
--processes 8 \
|
|
--retries 5
|
|
```
|
|
|
|
### Downloading Controlled-Access Data
|
|
|
|
Some TCGA data types (e.g., Clinical Supplement) require controlled access. To download these files, you need:
|
|
|
|
1. **Obtain an authentication token** from GDC Data Portal:
|
|
- Log in to https://portal.gdc.cancer.gov/
|
|
- Click on your username → "Download Token"
|
|
- Save the token file to a secure location
|
|
|
|
2. **Use the `--token` parameter** when downloading:
|
|
|
|
```bash
|
|
tcga-downloader download \
|
|
--manifest manifest.tsv \
|
|
--out-dir ./data \
|
|
--token /path/to/gdc-user-token.txt
|
|
```
|
|
|
|
Or add `token` to your config file:
|
|
|
|
```json
|
|
{
|
|
"download": {
|
|
"out_dir": "./data",
|
|
"processes": 4,
|
|
"retries": 3,
|
|
"token": "/path/to/gdc-user-token.txt"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Note**: Open-access data does not require a token. Only controlled-access files (marked in GDC Data Portal) need authentication.
|
|
|
|
## Common TCGA Projects
|
|
|
|
- TCGA-BRCA: Breast invasive carcinoma
|
|
- TCGA-LUAD: Lung adenocarcinoma
|
|
- TCGA-COAD: Colon adenocarcinoma
|
|
- TCGA-PRAD: Prostate adenocarcinoma
|
|
- TCGA-SKCM: Skin cutaneous melanoma
|
|
|
|
## Common Data Types
|
|
|
|
- Gene Expression
|
|
- Copy Number Variation
|
|
- DNA Methylation
|
|
- miRNA Expression
|
|
- Protein Expression
|
|
- Somatic Mutation
|
|
|
|
## Development
|
|
|
|
### Setup pre-commit hooks
|
|
|
|
```bash
|
|
pip install pre-commit
|
|
pre-commit install
|
|
```
|
|
|
|
### Run tests
|
|
|
|
```bash
|
|
pytest
|
|
```
|
|
|
|
With coverage:
|
|
|
|
```bash
|
|
pytest --cov=tcga_downloader --cov-report=html
|
|
```
|
|
|
|
### Code formatting
|
|
|
|
```bash
|
|
black .
|
|
ruff check --fix .
|
|
mypy tcga_downloader
|
|
```
|
|
|
|
### CI/CD
|
|
|
|
The project uses GitHub Actions for CI:
|
|
- Linting with black, ruff, mypy
|
|
- Testing with pytest
|
|
- Coverage reporting
|
|
|
|
## License
|
|
|
|
MIT License
|