Example workflows
Three runnable, cross-platform recipes anchored to results in the paper. Each ships as a notebook (Colab / Docker) with committed outputs, and falls back to precomputed tables so it always runs.
A — From the MADS50/MADS14 locus to a heading-date hypothesis
Question: does natural variation at MADS14 in Xian/Indica track local temperature, and can I turn that into a testable heading-date prediction?
- Resolve the gene (GrameneOryza / Ensembl REST) →
MADS14, chr3:31,031,753–31,041,563. - Variants & effects in the locus (overlap + VEP).
- Genotypes across XI via remote tabix (or SNP-Seek).
- Climate association (Oryza CLIMtools) → top variable BIO6 (min temp, coldest month).
- Join climate + phenotype, reproduce the haplotype–climate signal.
- Predict heading date genome-wide and rank candidates (pre-trained model).
import oryza19k as o19
gene = o19.lookup_gene("Os03g0752800") # MADS14
vars = o19.region_features("3:31031753-31041563", feature="variation")
clim = o19.climate_for_gene("MADS14", group="XI") # -> BIO6
preds = o19.predict_trait(genotypes, "hdg_80head")B — A rare TB1/FC1 high-effect variant and tiller number
Question: find carriers of the rare TB1/FC1 promoter variant that strengthens an OsBZR1 motif, and check the tiller-number association.
- Locate TB1/FC1 (Ensembl) —
FC1, chr3:28,428,504–28,430,438. - Genotype the site across the panel; get allele frequency (≈38 carriers, 0.95%).
- Confirm the regulatory consequence (VEP + the high-effect-variant table).
- Pull tiller number + group/geography for carriers.
- Effect size with a 95% CI, plus a relatedness/geography confounding check.
With ~38 carriers, always report the confidence interval and confirm the carriers are not a single clade or locale before interpreting the effect.
C — Whole-collection trait imputation & benchmarking
Question: I have genotypes for accessions with no measured grain weight — predict it, and show the method's standing against GBLUP.
- Get the core-SNP matrix (precomputed table or stream regions).
- Load a pre-trained model and predict on the missing set.
- Benchmark against rrBLUP / GBLUP / BayesA–C and tabular deep models.
- Interpret via the SHAP-top-SNP → gene table.
import oryza19k as o19
bench = o19.summary_table("benchmark") # 23 models x 5 traits
bench[bench["trait"] == "Heading date"].sort_values("spearman", ascending=False).head()
# CatBoost 0.838 · XGBoost 0.837 · RandomForest 0.834 · LightGBM 0.833 · GBLUP 0.830