Download the Markdown source ↓

Example workflows — the 19K-RGP access cookbook

Runnable, cross-platform example workflows for the 19K Rice Genome Project, plus the oryza19k helper that orchestrates the platforms' existing interfaces. This is the example workflows deliverable for Reviewer #1.6.

FileWorkflow
oryza19k.pythe access-cookbook helper (Ensembl REST, remote tabix, CLIMtools tables, precomputed summaries, pre-trained models)
workflow_A_mads14_heading_date.pyA — from the MADS50/MADS14 locus to a heading-date hypothesis (all platforms)
workflow_B_tb1_tiller_number.pyB — a rare TB1/FC1 high-effect variant → tiller number (with effect-size CI)
workflow_C_imputation_benchmark.pyC — whole-collection trait imputation & benchmarking vs rrBLUP/GBLUP/BayesA–C

What oryza19k is

oryza19k (oryza19k.py) is a single-file Python helper module — an access cookbook, not a server or a new API. Each function wraps an interface that already exists, so a workflow reads as a few clean calls instead of boilerplate. import oryza19k as o19 loads it; everything is then o19.<function>(...).

FunctionWrapsNeeds network?
lookup_gene, region_features, vep_effectsEnsembl REST (GrameneOryza)yes
region_genotypes(source="tabix")stream a bgzipped VCF with tabixyes (a public VCF host)
region_genotypes(source="local")slice a local parquet/CSV matrixno
climate_for_gene, accession_climateread CLIMtools result tablesno (local table)
predict_traitthe repo's pre-trained model (joblib)no (local model files)
summary_tablethe precomputed summary tablesno (local); yes for Zenodo
choose_platformthe "which platform?" decision tableno

It is import-safe and offline-friendly: it imports with only pandas + numpy, and every networked call degrades gracefully (no hard dependency on requests — it falls back to the standard library).

What it needs — libraries, data, environment

Libraries (pip install -r requirements.txt) — only what each feature uses:

Feature you usePackages
Import the module; local tables/genotypespandas, numpy (+ pyarrow for parquet)
Ensembl REST / Zenodo fetchesnothing extra (requests optional → falls back to stdlib urllib)
predict_trait (pre-trained models)joblib, scikit-learn, xgboost, lightgbm (the model is an XGBoost+LightGBM ensemble)
region_genotypes(source="tabix")pysam or the tabix/bcftools CLIs (htslib)

Data (only for the functions you call):

To use…You need…Where it comes from
summary_table(...)the precomputed .tsv filesrun ../precomputed_tables/generate_tables.py, or the Zenodo DOI
predict_trait(...)the .pkl model + imputer + features CSVclone github.com/YongZhou2019/19K-RGP; point models_dir at AI-drive Predictive Phenotype Modeling
region_genotypes(source="tabix")a public bgzipped + tabix-indexed VCF URLset ORYZA19K_VCF_BASE, or pass vcf_url=
region_genotypes(source="local")the genotype matrix .parquetyour local data, or local_parquet=
climate_for_gene(...)a CLIMtools GenoCLIM exportdownload from the app, or ship under precomputed_tables/

Environments, simplest → fullest:

  • Minimal localpip install pandas numpy pyarrow. Gets you Ensembl, tabix, and the precomputed tables. (predict_trait needs the ML stack + model files.)
  • Full local — also pip install -r requirements.txt and clone the repo for the models.
  • Google Colabpip install -r requirements.txt, clone the repo, run.
  • Docker (recommended for prediction) — the project image already bundles scikit-learn / xgboost / lightgbm / RAPIDS / Jupyter; just clone the repo for the model files.

Always runs for a reader

Every networked step is wrapped so that if a live service is unavailable it falls back to a shipped precomputed table (see ../precomputed_tables/). The Ensembl REST steps are verified live; the SNP-Seek REST paths are pending confirmation with IRRI and default to the tabix/ precomputed path meanwhile.

How to run

Google Colab (recommended for reviewers). The workflows are stored as jupytext percent-format .py files, which are valid notebook sources. Convert to .ipynb (one command) and open in Colab:

pip install jupytext
jupytext --to notebook workflow_*.py     # -> workflow_*.ipynb (then add an "Open in Colab" badge)

The published repository ships the executed .ipynb with outputs committed.

Project Docker image (bundles RAPIDS/cuDF, PyTorch, scikit-learn, Jupyter):

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 jupyter lab --ip=0.0.0.0 --allow-root

Locally (Ensembl steps + precomputed fallbacks; no GPU needed):

pip install -r requirements.txt        # or minimal: pip install pandas numpy pyarrow
python workflow_A_mads14_heading_date.py

Trait prediction — predict_trait is the GitHub demo

predict_trait does not train or replace anything — it loads and runs the team's own pre-trained .pkl files (the same joblib.load(...) + model.predict(...) as the repo's heading_date_predictor.py), following the Prediction_on_new_samples/hdg_80head2025/ flow (model + imputer + the top-1,000 feature list).

The wrapper only adds convenience so the demo composes with the other steps: it auto-detects the ID column (ID vs IID), selects the 1,000 SNP columns the model expects (from combined_selected_features_top_1000.csv) so you can hand it the full 165,640-SNP matrix, imputes with most_freq_imputer_*.pkl, and returns a tidy {accession, <trait>_pred} frame.

import oryza19k as o19
preds = o19.predict_trait(genotypes, trait="hdg_80head",
                          models_dir="path/to/AI-drive Predictive Phenotype Modeling")

models_dir must point at a clone of github.com/YongZhou2019/19K-RGP; loading the model needs joblib, scikit-learn, xgboost, and lightgbm. Only the heading-date model is public so far; the other four traits are added at resubmission.

Configuration (no credentials, ever)

oryza19k.CONFIG reads from environment variables — ORYZA19K_VCF_BASE (public bgzipped VCF host for tabix), ORYZA19K_MODELS_DIR, ORYZA19K_DATA_DIR, ORYZA19K_ZENODO. Review-only credentials are never placed here; the public VCF host is set at publication.