Supplementary Note 10 — Data Access and Visualization

Expanded for the revision in response to Reviewer #1, comment 6 (efficient data access: APIs, precomputed summaries, example workflows). §§10.1–10.2 are retained and expanded; §§10.3–10.6 are new.

The 19K-RGP is distributed across complementary platforms, each suited to a different task. To make the resource easy to use efficiently, we provide, in addition to the platform portals: (i) documented programmatic access (APIs) per platform, (ii) DOI-citable precomputed summary tables so common questions need no large query, and (iii) runnable example workflows that connect the platforms. A companion documentation website and a small Python helper module (oryza19k) accompany this note.

Access policy. Every endpoint listed here is, or will at publication be, anonymous and login-free, consistent with the journal's data-availability policy. Where a platform is in a pre-publication, credentialed staging state, this is stated explicitly and the durable, credential-free path (an archive accession or a Zenodo DOI) is given alongside.

10.0 Choosing the right platform

If you want to…	Use	How
Inspect genotypes / haplotypes / allele frequencies for a small region across all accessions	SNP-Seek v3	Genotype viewer / REST
Browse a genome, gene models, and predicted variant effects	GrameneOryza	Ensembl REST · browser
Slice a locus across all 19,035 genomes without downloading	Gramene FTP	remote `tabix`
Explore climate ↔ genome (G×E) associations	Oryza CLIMtools	downloadable tables
Predict an agronomic trait from genotypes	Code & Models	pre-trained models (Docker/Colab)
Obtain raw reads, the full variant archive, or a citable DOI	Archives (NCBI/EVA/KAUST)	accession download

The five platinum reference genomes used throughout are Nipponbare IRGSP‑1.0 (GJ; GCA_001433935.1), IR64RS2 (XI; gap‑free, this study), MH63RS3 (XI; CP054676–CP054688), ARC 10497 (cB; GCA_009831255.1), and N22 (cA; GCA_001952365.2).

10.1 Rice SNP-Seek (genotypes, haplotypes, allele frequencies, phenotypes)

See it live → SNP-Seek v3 (overview, tutorial, and examples).

The 19K-RGP variant calls — SNPs and small InDels, plus genomic VCFs (gVCFs) from the new 9K-RGP sequencing and the public 3K-RGP / 7K-Public datasets — are available in the Rice SNP-Seek database (snp-seek.irri.org). The initial 19K-RGP release in SNP-Seek provides variants called against two references, Nipponbare IRGSP‑1.0 (GJ) and MH63RS3 (XI).

Interactive use. SNP-Seek supports inspection of SNPs/InDels within a small genome region (genes, QTLs < 500 kb) across all accessions at once, computation of local haplotypes, the genotype viewer, embedded JBrowse, gene-locus search, and pairwise SNP comparison, alongside accession phenotypes, allele frequencies, and subpopulation grouping.

Programmatic access (REST). SNP-Seek exposes a RESTful interface (described in Mansueto et al., Nucleic Acids Research 2017, "SNP-Seek II"). Typical queries — genotype-by-region, variety/germplasm lists, and SNP-by-position — follow the pattern below.

# Genotype matrix for a region across a chosen accession set (schema per SNP-Seek II).
# NOTE: confirm the exact route/parameters against the live API before relying on it.
curl -s "https://snp-seek.irri.org/<api>/genotype?ref=IRGSP-1.0&chrom=3&start=31031753&end=31041563&varset=XI"

⚠ To confirm before publication. The exact SNP-Seek REST paths/parameters are being verified with the IRRI team (McNally, Mauleon, Chebotarov). Until confirmed, the equivalent query is fully reproducible via remote tabix on the FTP-hosted VCFs (§10.6) or the precomputed allele-frequency table (§10.6), both of which are credential-free and durable.

Bulk downloads. Whole-dataset genotypes are available as VCF, HDF5, and genotype matrices, and through the 3,000 Rice Genomes Project Open Data (Amazon Sustainability Data Initiative).

10.2 GrameneOryza (genome browser, search, FTP, Ensembl REST)

See it live → GrameneOryza · copy-paste recipes at API → Ensembl REST.

GrameneOryza serves the 19K-RGP across all five references: a genome browser, a search interface, and an FTP site holding the extended per-reference variant calls and their predicted effects. Users visualize genetic variation through the location, gene, and variation pages; a video walkthrough using OsMADS50 is available at oryza.gramene.org/videotutorials.

Because GrameneOryza is built on Ensembl, the Ensembl REST API (rest.ensembl.org) provides documented, stable programmatic access to Oryza sativa gene models, regions, variants, effects, and sequence. These calls are verified live:

# Resolve a gene to coordinates (MADS14 used in the paper).
curl -s "https://rest.ensembl.org/lookup/id/Os03g0752800?expand=1;content-type=application/json"
#  -> "MADS14", chr 3:31,031,753-31,041,563, strand -1, protein_coding

# Features overlapping a region (genes and variants).
curl -s "https://rest.ensembl.org/overlap/region/oryza_sativa/3:31031753-31041563?feature=gene;feature=variation;content-type=application/json"

# Predicted consequence of a variant (Variant Effect Predictor); format: region/<chrom>:<start>-<end>/<alt>
curl -s "https://rest.ensembl.org/vep/oryza_sativa/region/3:31037240-31037240/A?content-type=application/json"
#  -> most_severe_consequence "intron_variant" for MADS14 (Os03t0752800-01)

# Genomic sequence of a transcript.
curl -s "https://rest.ensembl.org/sequence/id/Os03t0752800?type=genomic;content-type=application/json"

import requests   # or use the bundled `oryza19k` helper, which needs no extra deps
g = requests.get("https://rest.ensembl.org/lookup/id/Os03g0752800",
                 params={"expand": 1}, headers={"Content-Type": "application/json"}).json()
print(g["display_name"], g["seq_region_name"], g["start"], g["end"])

BioMart (Ensembl Plants) supports bulk attribute/identifier queries for rice genes and variants without programming. Track hubs and remote tabix (§10.6) let users overlay or stream the 19K-RGP VCFs.

Pre-publication status. During review the GrameneOryza rice resources are served from a staging instance with reviewer credentials provided via the editor. At publication they move to public, login-free production endpoints; the durable variant archive is the European Variation Archive accession PRJEB105137 (SNP set ERZ28769989, InDel set ERZ28769990).

10.3 Oryza CLIMtools (climate ↔ genome / G×E associations) — new

See it live → Oryza CLIMtools.

Oryza CLIMtools (gramene.org/CLIMtools/oryza_19K-RGP) provides three interactive applications for phenotype × genotype × environment analysis: OryzaCLIM (per-accession geo-environmental variables), Oryza GenoCLIM (gene → environment associations), and Oryza CLIMGeno (environment → genotype). The underlying methodology and v1.0 were described in Ferrero-Serrano et al. (Plant Communications, 2024).

Programmatic access. CLIMtools is an R/Shiny resource and does not expose a REST API. The supported programmatic path is its downloadable result tables, read directly in pandas/R, with the application source available under Apache 2.0 at github.com/CLIMtools. We additionally deposit the relevant association tables among the precomputed summaries (§10.6) so they are reproducible without the live application.

import pandas as pd
# A GenoCLIM result table exported (or taken from the precomputed deposit):
genoclim = pd.read_csv("climtools_genoclim.tsv", sep="\t")
mads14 = genoclim[genoclim["gene"].str.contains("MADS14", case=False)]
mads14.sort_values("P").head()            # top climate variable for MADS14 in XI is BIO6

10.4 Code, pre-trained models, and reproducible environments — new

See it live → Code & Models.

All pipelines and the AI trait-prediction models live at github.com/YongZhou2019/19K-RGP (MIT license). The root README links the variant-calling workflow HPC-GVCW (github.com/IBEXCluster/HPC-GVCW; BMC Biology, DOI 10.1186/s12915-024-01820-5). Modules include allele-frequency and diversity analysis, AlphaFold modeling, GWAS (GAPIT), molecular dynamics, rare-variant statistics, variant calling, and local-ancestry inference for cBasmati.

AI trait prediction. The AI-drive Predictive Phenotype Modeling folder ships pre-trained models that predict agronomic traits from genotypes, with a demonstration on heading date (hdg_80head). The deployed model uses the SHAP-selected top-1,000 SNPs (from the 165,640 high-confidence core SNP set; MAF > 0.05). The interface (from heading_date_predictor.py):

import joblib, numpy as np, pandas as pd
model   = joblib.load("best_trained_model_hdg_80head2025.pkl")
imputer = joblib.load("most_freq_imputer_hdg_80head2025.pkl")
df = pd.read_csv("input_top_1000_features.csv")            # cols: IID, Phenotype, + 1000 SNPs
X  = df.drop(columns=["IID", "Phenotype"]).replace(-9, np.nan).values.astype("float64")
y_pred = model.predict(imputer.transform(X))

Reproducible environment. A Docker image (recommended) bundles RAPIDS/cuDF, PyTorch, and Jupyter:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 \
  jupyter lab --ip=0.0.0.0 --allow-root

The demonstration also runs as a one-click Google Colab notebook. At resubmission the four remaining trait models (grain weight, length, width, and length/width ratio) are added alongside the heading-date model.

10.5 Cross-platform example workflows and the `oryza19k` access cookbook — new

See it live → Example workflows; the cookbook reference is on the API page.

We provide a small Python helper, oryza19k, that orchestrates the platforms' existing interfaces (it is a client-side cookbook, not a new server/API), and three end-to-end example workflows anchored to results in the paper. Each workflow is a runnable notebook (Colab or Docker) that ships with its outputs already computed and falls back to the precomputed tables if a live service is unavailable, so it always runs for a reader.

oryza19k functions: lookup_gene, region_features, vep_effects (Ensembl REST);

region_genotypes(source="tabix"|"snpseek"|"local"); climate_for_gene, accession_climate (CLIMtools tables); predict_trait (pre-trained models); summary_table (precomputed deposit); and choose_platform (the decision table above).

**Workflow A — From the MADS50/MADS14 locus to a heading-date hypothesis.** lookup_gene("Os03g0752800") (MADS14 → chr3:31,031,753–31,041,563) → genotypes across XI via remote tabix or SNP-Seek → vep_effects for predicted consequences → climate_for_gene (top variable: BIO6, minimum temperature of the coldest month) → join per-accession climate and heading date → reproduce the haplotype–climate signal → predict_trait("hdg_80head") to rank candidates.

**Workflow B — Rare TB1/FC1 high-effect variant → tiller number.** Locate TB1/FC1 and the promoter variant (Ensembl) → genotype the single site across the panel (tabix/SNP-Seek) → allele frequency (≈ 38 carriers, 0.95%) → confirm the regulatory consequence (VEP + the high-effect-variant table) → pull tiller number, group, and geography for carriers → report the rare-allele effect with a confidence interval and a relatedness/geography check.

Workflow C — Whole-collection trait imputation and benchmarking. Load the 165,640 core-SNP matrix (or stream regions) → predict_trait on the accessions with missing phenotypes → compare against the benchmark table (gradient-boosted trees vs rrBLUP/GBLUP/ BayesA–C and tabular deep/foundation models) → map the top SHAP SNPs to genes.

10.6 Efficient data access at scale — new

See it live → API → Remote tabix and Precomputed summaries.

Remote tabix streaming — query 19,035 genomes without downloading them. The per-reference VCFs are bgzip-compressed and tabix-indexed; with a server that supports HTTP range requests, any locus can be streamed directly:

# Stream just the MADS14 locus from the bgzipped 19K-RGP VCF — no full download.
tabix -h https://<public-host>/19K-RGP/IRGSP-1.0/19K-RGP.IRGSP-1.0.snps.vcf.gz 3:31031753-31041563

# Same, with bcftools, emitting a tidy genotype table:
bcftools view -r 3:31031753-31041563 https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz \
  | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'

import pysam
vcf = pysam.VariantFile("https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz")
for rec in vcf.fetch("3", 31031753, 31041563):
    print(rec.chrom, rec.pos, rec.ref, rec.alts)

Requires the VCFs to be served over HTTPS with range support; the one-time bgzip/tabix preparation is coordinated with the Gramene team.

Precomputed summary tables (no query needed). A DOI-citable Zenodo deposit (mirrored on the Gramene FTP and the KAUST repository) provides: a master accession/passport table with varietal groups and the five-reference coordinate mappings; per-variety-group allele frequencies for the 165,640 core SNPs; a per-gene variant summary; the high-effect-regulatory-variant table; the genome–environment-association hits per group; the five trait phenotype tables; the genomic- prediction benchmark table; the SHAP-top-SNP→gene table; and the MADS14/MADS50 haplotype tables. See the deposit manifest for columns and formats.

Honest limits. SNP-Seek REST endpoint paths are being confirmed with IRRI; CLIMtools has no REST API (tables only); pre-publication browser/FTP endpoints are credentialed and move to anonymous production at publication, with the European Variation Archive (PRJEB105137), NCBI BioProjects (PRJNA954521, PRJNA597070, PRJNA952097), the KAUST repository (DOI 10.25781/N3AF-NP78), and the Zenodo deposit as the durable, credential-free records.

Supplementary Note 10 — Data Access and Visualization

10.0 Choosing the right platform

10.1 Rice SNP-Seek (genotypes, haplotypes, allele frequencies, phenotypes)

10.2 GrameneOryza (genome browser, search, FTP, Ensembl REST)

10.3 Oryza CLIMtools (climate ↔ genome / G×E associations) — new

10.4 Code, pre-trained models, and reproducible environments — new

10.5 Cross-platform example workflows and the oryza19k access cookbook — new

10.6 Efficient data access at scale — new

10.5 Cross-platform example workflows and the `oryza19k` access cookbook — new