Download the Markdown source ↓
Example workflows — the 19K-RGP access cookbook
Runnable, cross-platform example workflows for the 19K Rice Genome Project, plus the oryza19k helper that orchestrates the platforms' existing interfaces. This is the example workflows deliverable for Reviewer #1.6.
| File | Workflow |
|---|---|
oryza19k.py | the access-cookbook helper (Ensembl REST, remote tabix, CLIMtools tables, precomputed summaries, pre-trained models) |
workflow_A_mads14_heading_date.py | A — from the MADS50/MADS14 locus to a heading-date hypothesis (all platforms) |
workflow_B_tb1_tiller_number.py | B — a rare TB1/FC1 high-effect variant → tiller number (with effect-size CI) |
workflow_C_imputation_benchmark.py | C — whole-collection trait imputation & benchmarking vs rrBLUP/GBLUP/BayesA–C |
What oryza19k is
oryza19k (oryza19k.py) is a single-file Python helper module — an access cookbook, not a server or a new API. Each function wraps an interface that already exists, so a workflow reads as a few clean calls instead of boilerplate. import oryza19k as o19 loads it; everything is then o19.<function>(...).
| Function | Wraps | Needs network? |
|---|---|---|
lookup_gene, region_features, vep_effects | Ensembl REST (GrameneOryza) | yes |
region_genotypes(source="tabix") | stream a bgzipped VCF with tabix | yes (a public VCF host) |
region_genotypes(source="local") | slice a local parquet/CSV matrix | no |
climate_for_gene, accession_climate | read CLIMtools result tables | no (local table) |
predict_trait | the repo's pre-trained model (joblib) | no (local model files) |
summary_table | the precomputed summary tables | no (local); yes for Zenodo |
choose_platform | the "which platform?" decision table | no |
It is import-safe and offline-friendly: it imports with only pandas + numpy, and every networked call degrades gracefully (no hard dependency on requests — it falls back to the standard library).
What it needs — libraries, data, environment
Libraries (pip install -r requirements.txt) — only what each feature uses:
| Feature you use | Packages |
|---|---|
| Import the module; local tables/genotypes | pandas, numpy (+ pyarrow for parquet) |
| Ensembl REST / Zenodo fetches | nothing extra (requests optional → falls back to stdlib urllib) |
predict_trait (pre-trained models) | joblib, scikit-learn, xgboost, lightgbm (the model is an XGBoost+LightGBM ensemble) |
region_genotypes(source="tabix") | pysam or the tabix/bcftools CLIs (htslib) |
Data (only for the functions you call):
| To use… | You need… | Where it comes from |
|---|---|---|
summary_table(...) | the precomputed .tsv files | run ../precomputed_tables/generate_tables.py, or the Zenodo DOI |
predict_trait(...) | the .pkl model + imputer + features CSV | clone github.com/YongZhou2019/19K-RGP; point models_dir at AI-drive Predictive Phenotype Modeling |
region_genotypes(source="tabix") | a public bgzipped + tabix-indexed VCF URL | set ORYZA19K_VCF_BASE, or pass vcf_url= |
region_genotypes(source="local") | the genotype matrix .parquet | your local data, or local_parquet= |
climate_for_gene(...) | a CLIMtools GenoCLIM export | download from the app, or ship under precomputed_tables/ |
Environments, simplest → fullest:
- Minimal local —
pip install pandas numpy pyarrow. Gets you Ensembl, tabix, and the precomputed tables. (predict_traitneeds the ML stack + model files.) - Full local — also
pip install -r requirements.txtand clone the repo for the models. - Google Colab —
pip install -r requirements.txt, clone the repo, run. - Docker (recommended for prediction) — the project image already bundles scikit-learn / xgboost / lightgbm / RAPIDS / Jupyter; just clone the repo for the model files.
Always runs for a reader
Every networked step is wrapped so that if a live service is unavailable it falls back to a shipped precomputed table (see ../precomputed_tables/). The Ensembl REST steps are verified live; the SNP-Seek REST paths are pending confirmation with IRRI and default to the tabix/ precomputed path meanwhile.
How to run
Google Colab (recommended for reviewers). The workflows are stored as jupytext percent-format .py files, which are valid notebook sources. Convert to .ipynb (one command) and open in Colab:
pip install jupytext
jupytext --to notebook workflow_*.py # -> workflow_*.ipynb (then add an "Open in Colab" badge)The published repository ships the executed .ipynb with outputs committed.
Project Docker image (bundles RAPIDS/cuDF, PyTorch, scikit-learn, Jupyter):
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
-p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 jupyter lab --ip=0.0.0.0 --allow-rootLocally (Ensembl steps + precomputed fallbacks; no GPU needed):
pip install -r requirements.txt # or minimal: pip install pandas numpy pyarrow
python workflow_A_mads14_heading_date.pyTrait prediction — predict_trait is the GitHub demo
See it live → Code & Models (the heading-date demo, three ways to run it).
predict_trait does not train or replace anything — it loads and runs the team's own pre-trained .pkl files (the same joblib.load(...) + model.predict(...) as the repo's heading_date_predictor.py), following the Prediction_on_new_samples/hdg_80head2025/ flow (model + imputer + the top-1,000 feature list).
The wrapper only adds convenience so the demo composes with the other steps: it auto-detects the ID column (ID vs IID), selects the 1,000 SNP columns the model expects (from combined_selected_features_top_1000.csv) so you can hand it the full 165,640-SNP matrix, imputes with most_freq_imputer_*.pkl, and returns a tidy {accession, <trait>_pred} frame.
import oryza19k as o19
preds = o19.predict_trait(genotypes, trait="hdg_80head",
models_dir="path/to/AI-drive Predictive Phenotype Modeling")models_dir must point at a clone of github.com/YongZhou2019/19K-RGP; loading the model needs joblib, scikit-learn, xgboost, and lightgbm. Only the heading-date model is public so far; the other four traits are added at resubmission.
Configuration (no credentials, ever)
oryza19k.CONFIG reads from environment variables — ORYZA19K_VCF_BASE (public bgzipped VCF host for tabix), ORYZA19K_MODELS_DIR, ORYZA19K_DATA_DIR, ORYZA19K_ZENODO. Review-only credentials are never placed here; the public VCF host is set at publication.