Example workflows — the 19K-RGP access cookbook

Runnable, cross-platform example workflows for the 19K Rice Genome Project, plus the oryza19k helper that orchestrates the platforms' existing interfaces. This is the example workflows deliverable for Reviewer #1.6.

File	Workflow
`oryza19k.py`	the access-cookbook helper (Ensembl REST, remote `tabix`, CLIMtools tables, precomputed summaries, pre-trained models)
`workflow_A_mads14_heading_date.py`	A — from the MADS50/MADS14 locus to a heading-date hypothesis (all platforms)
`workflow_B_tb1_tiller_number.py`	B — a rare TB1/FC1 high-effect variant → tiller number (with effect-size CI)
`workflow_C_imputation_benchmark.py`	C — whole-collection trait imputation & benchmarking vs rrBLUP/GBLUP/BayesA–C

What `oryza19k` is

oryza19k (oryza19k.py) is a single-file Python helper module — an access cookbook, not a server or a new API. Each function wraps an interface that already exists, so a workflow reads as a few clean calls instead of boilerplate. import oryza19k as o19 loads it; everything is then o19.<function>(...).

Function	Wraps	Needs network?
`lookup_gene`, `region_features`, `vep_effects`	Ensembl REST (GrameneOryza)	yes
`region_genotypes(source="tabix")`	stream a bgzipped VCF with tabix	yes (a public VCF host)
`region_genotypes(source="local")`	slice a local parquet/CSV matrix	no
`climate_for_gene`, `accession_climate`	read CLIMtools result tables	no (local table)
`predict_trait`	the repo's pre-trained model (joblib)	no (local model files)
`summary_table`	the precomputed summary tables	no (local); yes for Zenodo
`choose_platform`	the "which platform?" decision table	no

It is import-safe and offline-friendly: it imports with only pandas + numpy, and every networked call degrades gracefully (no hard dependency on requests — it falls back to the standard library).

What it needs — libraries, data, environment

Libraries (pip install -r requirements.txt) — only what each feature uses:

Feature you use	Packages
Import the module; local tables/genotypes	`pandas`, `numpy` (+ `pyarrow` for parquet)
Ensembl REST / Zenodo fetches	nothing extra (`requests` optional → falls back to stdlib `urllib`)
`predict_trait` (pre-trained models)	`joblib`, `scikit-learn`, `xgboost`, `lightgbm` (the model is an XGBoost+LightGBM ensemble)
`region_genotypes(source="tabix")`	`pysam` or the `tabix`/`bcftools` CLIs (htslib)

Data (only for the functions you call):

To use…	You need…	Where it comes from
`summary_table(...)`	the precomputed `.tsv` files	run `../precomputed_tables/generate_tables.py`, or the Zenodo DOI
`predict_trait(...)`	the `.pkl` model + imputer + features CSV	clone `github.com/YongZhou2019/19K-RGP`; point `models_dir` at `AI-drive Predictive Phenotype Modeling`
`region_genotypes(source="tabix")`	a public bgzipped + tabix-indexed VCF URL	set `ORYZA19K_VCF_BASE`, or pass `vcf_url=`
`region_genotypes(source="local")`	the genotype matrix `.parquet`	your local data, or `local_parquet=`
`climate_for_gene(...)`	a CLIMtools GenoCLIM export	download from the app, or ship under `precomputed_tables/`

Environments, simplest → fullest:

Minimal local — pip install pandas numpy pyarrow. Gets you Ensembl, tabix, and the precomputed tables. (predict_trait needs the ML stack + model files.)
Full local — also pip install -r requirements.txt and clone the repo for the models.
Google Colab — pip install -r requirements.txt, clone the repo, run.
Docker (recommended for prediction) — the project image already bundles scikit-learn / xgboost / lightgbm / RAPIDS / Jupyter; just clone the repo for the model files.

Always runs for a reader

Every networked step is wrapped so that if a live service is unavailable it falls back to a shipped precomputed table (see ../precomputed_tables/). The Ensembl REST steps are verified live; the SNP-Seek REST paths are pending confirmation with IRRI and default to the tabix/ precomputed path meanwhile.

How to run

Google Colab (recommended for reviewers). The workflows are stored as jupytext percent-format .py files, which are valid notebook sources. Convert to .ipynb (one command) and open in Colab:

pip install jupytext
jupytext --to notebook workflow_*.py     # -> workflow_*.ipynb (then add an "Open in Colab" badge)

The published repository ships the executed .ipynb with outputs committed.

Project Docker image (bundles RAPIDS/cuDF, PyTorch, scikit-learn, Jupyter):

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 jupyter lab --ip=0.0.0.0 --allow-root

Locally (Ensembl steps + precomputed fallbacks; no GPU needed):

pip install -r requirements.txt        # or minimal: pip install pandas numpy pyarrow
python workflow_A_mads14_heading_date.py

Trait prediction — `predict_trait` is the GitHub demo

See it live → Code & Models (the heading-date demo, three ways to run it).

predict_trait does not train or replace anything — it loads and runs the team's own pre-trained .pkl files (the same joblib.load(...) + model.predict(...) as the repo's heading_date_predictor.py), following the Prediction_on_new_samples/hdg_80head2025/ flow (model + imputer + the top-1,000 feature list).

The wrapper only adds convenience so the demo composes with the other steps: it auto-detects the ID column (ID vs IID), selects the 1,000 SNP columns the model expects (from combined_selected_features_top_1000.csv) so you can hand it the full 165,640-SNP matrix, imputes with most_freq_imputer_*.pkl, and returns a tidy {accession, <trait>_pred} frame.

import oryza19k as o19
preds = o19.predict_trait(genotypes, trait="hdg_80head",
                          models_dir="path/to/AI-drive Predictive Phenotype Modeling")

models_dir must point at a clone of github.com/YongZhou2019/19K-RGP; loading the model needs joblib, scikit-learn, xgboost, and lightgbm. Only the heading-date model is public so far; the other four traits are added at resubmission.

Configuration (no credentials, ever)

oryza19k.CONFIG reads from environment variables — ORYZA19K_VCF_BASE (public bgzipped VCF host for tabix), ORYZA19K_MODELS_DIR, ORYZA19K_DATA_DIR, ORYZA19K_ZENODO. Review-only credentials are never placed here; the public VCF host is set at publication.