# Example workflows — the 19K-RGP access cookbook

Runnable, cross-platform example workflows for the 19K Rice Genome Project, plus the
`oryza19k` helper that orchestrates the platforms' existing interfaces. This is the
*example workflows* deliverable for Reviewer #1.6.

| File | Workflow |
|------|----------|
| `oryza19k.py` | the access-cookbook helper (Ensembl REST, remote `tabix`, CLIMtools tables, precomputed summaries, pre-trained models) |
| `workflow_A_mads14_heading_date.py` | **A** — from the *MADS50/MADS14* locus to a heading-date hypothesis (all platforms) |
| `workflow_B_tb1_tiller_number.py` | **B** — a rare *TB1/FC1* high-effect variant → tiller number (with effect-size CI) |
| `workflow_C_imputation_benchmark.py` | **C** — whole-collection trait imputation & benchmarking vs rrBLUP/GBLUP/BayesA–C |

## What `oryza19k` is

`oryza19k` (`oryza19k.py`) is a single-file Python **helper module** — an *access cookbook*, not a
server or a new API. Each function wraps an interface that **already exists**, so a workflow reads as
a few clean calls instead of boilerplate. `import oryza19k as o19` loads it; everything is then
`o19.<function>(...)`.

| Function | Wraps | Needs network? |
|---|---|---|
| `lookup_gene`, `region_features`, `vep_effects` | Ensembl REST (GrameneOryza) | yes |
| `region_genotypes(source="tabix")` | stream a bgzipped VCF with tabix | yes (a public VCF host) |
| `region_genotypes(source="local")` | slice a local parquet/CSV matrix | no |
| `climate_for_gene`, `accession_climate` | read CLIMtools result tables | no (local table) |
| `predict_trait` | the repo's pre-trained model (joblib) | no (local model files) |
| `summary_table` | the precomputed summary tables | no (local); yes for Zenodo |
| `choose_platform` | the "which platform?" decision table | no |

It is **import-safe and offline-friendly**: it imports with only pandas + numpy, and every networked
call degrades gracefully (no hard dependency on `requests` — it falls back to the standard library).

## What it needs — libraries, data, environment

**Libraries** (`pip install -r requirements.txt`) — only what each feature uses:

| Feature you use | Packages |
|---|---|
| Import the module; local tables/genotypes | `pandas`, `numpy` (+ `pyarrow` for parquet) |
| Ensembl REST / Zenodo fetches | nothing extra (`requests` optional → falls back to stdlib `urllib`) |
| `predict_trait` (pre-trained models) | `joblib`, `scikit-learn`, **`xgboost`, `lightgbm`** (the model is an XGBoost+LightGBM ensemble) |
| `region_genotypes(source="tabix")` | `pysam` **or** the `tabix`/`bcftools` CLIs (htslib) |

**Data** (only for the functions you call):

| To use… | You need… | Where it comes from |
|---|---|---|
| `summary_table(...)` | the precomputed `.tsv` files | run `../precomputed_tables/generate_tables.py`, or the Zenodo DOI |
| `predict_trait(...)` | the `.pkl` model + imputer + features CSV | clone `github.com/YongZhou2019/19K-RGP`; point `models_dir` at `AI-drive Predictive Phenotype Modeling` |
| `region_genotypes(source="tabix")` | a public bgzipped + tabix-indexed VCF URL | set `ORYZA19K_VCF_BASE`, or pass `vcf_url=` |
| `region_genotypes(source="local")` | the genotype matrix `.parquet` | your local data, or `local_parquet=` |
| `climate_for_gene(...)` | a CLIMtools GenoCLIM export | download from the app, or ship under `precomputed_tables/` |

**Environments**, simplest → fullest:
- **Minimal local** — `pip install pandas numpy pyarrow`. Gets you Ensembl, tabix, and the precomputed tables. (`predict_trait` needs the ML stack + model files.)
- **Full local** — also `pip install -r requirements.txt` and clone the repo for the models.
- **Google Colab** — `pip install -r requirements.txt`, clone the repo, run.
- **Docker (recommended for prediction)** — the project image already bundles scikit-learn / xgboost / lightgbm / RAPIDS / Jupyter; just clone the repo for the model files.

## Always runs for a reader
Every networked step is wrapped so that if a live service is unavailable it **falls back to a
shipped precomputed table** (see `../precomputed_tables/`). The Ensembl REST steps are verified
live; the SNP-Seek REST paths are pending confirmation with IRRI and default to the `tabix`/
precomputed path meanwhile.

## How to run

**Google Colab (recommended for reviewers).** The workflows are stored as
[jupytext](https://jupytext.readthedocs.io) *percent-format* `.py` files, which are valid notebook
sources. Convert to `.ipynb` (one command) and open in Colab:
```bash
pip install jupytext
jupytext --to notebook workflow_*.py     # -> workflow_*.ipynb (then add an "Open in Colab" badge)
```
The published repository ships the executed `.ipynb` with outputs committed.

**Project Docker image** (bundles RAPIDS/cuDF, PyTorch, scikit-learn, Jupyter):
```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 jupyter lab --ip=0.0.0.0 --allow-root
```

**Locally** (Ensembl steps + precomputed fallbacks; no GPU needed):
```bash
pip install -r requirements.txt        # or minimal: pip install pandas numpy pyarrow
python workflow_A_mads14_heading_date.py
```

## Trait prediction — `predict_trait` *is* the GitHub demo
`predict_trait` does **not** train or replace anything — it loads and runs the team's own
pre-trained `.pkl` files (the same `joblib.load(...)` + `model.predict(...)` as the repo's
`heading_date_predictor.py`), following the `Prediction_on_new_samples/hdg_80head2025/` flow
(model + imputer + the top-1,000 feature list).

The wrapper only adds convenience so the demo composes with the other steps: it auto-detects the ID
column (`ID` vs `IID`), **selects the 1,000 SNP columns the model expects** (from
`combined_selected_features_top_1000.csv`) so you can hand it the full 165,640-SNP matrix, imputes
with `most_freq_imputer_*.pkl`, and returns a tidy `{accession, <trait>_pred}` frame.

```python
import oryza19k as o19
preds = o19.predict_trait(genotypes, trait="hdg_80head",
                          models_dir="path/to/AI-drive Predictive Phenotype Modeling")
```

`models_dir` must point at a clone of
[github.com/YongZhou2019/19K-RGP](https://github.com/YongZhou2019/19K-RGP); loading the model needs
`joblib`, `scikit-learn`, `xgboost`, and `lightgbm`. Only the heading-date model is public so far;
the other four traits are added at resubmission.

## Configuration (no credentials, ever)
`oryza19k.CONFIG` reads from environment variables — `ORYZA19K_VCF_BASE` (public bgzipped VCF host
for `tabix`), `ORYZA19K_MODELS_DIR`, `ORYZA19K_DATA_DIR`, `ORYZA19K_ZENODO`. Review-only credentials
are never placed here; the public VCF host is set at publication.