Example workflows

Three runnable, cross-platform recipes anchored to results in the paper. Each ships as a notebook (Colab / Docker) with committed outputs, and falls back to precomputed tables so it always runs.

Notebooks on GitHub ↗

A — From the MADS50/MADS14 locus to a heading-date hypothesis

Question: does natural variation at MADS14 in Xian/Indica track local temperature, and can I turn that into a testable heading-date prediction?

  1. Resolve the gene (GrameneOryza / Ensembl REST) → MADS14, chr3:31,031,753–31,041,563.
  2. Variants & effects in the locus (overlap + VEP).
  3. Genotypes across XI via remote tabix (or SNP-Seek).
  4. Climate association (Oryza CLIMtools) → top variable BIO6 (min temp, coldest month).
  5. Join climate + phenotype, reproduce the haplotype–climate signal.
  6. Predict heading date genome-wide and rank candidates (pre-trained model).
import oryza19k as o19
gene  = o19.lookup_gene("Os03g0752800")                  # MADS14
vars  = o19.region_features("3:31031753-31041563", feature="variation")
clim  = o19.climate_for_gene("MADS14", group="XI")       # -> BIO6
preds = o19.predict_trait(genotypes, "hdg_80head")

B — A rare TB1/FC1 high-effect variant and tiller number

Question: find carriers of the rare TB1/FC1 promoter variant that strengthens an OsBZR1 motif, and check the tiller-number association.

  1. Locate TB1/FC1 (Ensembl) — FC1, chr3:28,428,504–28,430,438.
  2. Genotype the site across the panel; get allele frequency (≈38 carriers, 0.95%).
  3. Confirm the regulatory consequence (VEP + the high-effect-variant table).
  4. Pull tiller number + group/geography for carriers.
  5. Effect size with a 95% CI, plus a relatedness/geography confounding check.

With ~38 carriers, always report the confidence interval and confirm the carriers are not a single clade or locale before interpreting the effect.

C — Whole-collection trait imputation & benchmarking

Question: I have genotypes for accessions with no measured grain weight — predict it, and show the method's standing against GBLUP.

  1. Get the core-SNP matrix (precomputed table or stream regions).
  2. Load a pre-trained model and predict on the missing set.
  3. Benchmark against rrBLUP / GBLUP / BayesA–C and tabular deep models.
  4. Interpret via the SHAP-top-SNP → gene table.
import oryza19k as o19
bench = o19.summary_table("benchmark")                   # 23 models x 5 traits
bench[bench["trait"] == "Heading date"].sort_values("spearman", ascending=False).head()
#  CatBoost 0.838 · XGBoost 0.837 · RandomForest 0.834 · LightGBM 0.833 · GBLUP 0.830