Download the Markdown source ↓
Supplementary Note 10 — Data Access and Visualization
Expanded for the revision in response to Reviewer #1, comment 6 (efficient data access: APIs, precomputed summaries, example workflows). §§10.1–10.2 are retained and expanded; §§10.3–10.6 are new.
The 19K-RGP is distributed across complementary platforms, each suited to a different task. To make the resource easy to use efficiently, we provide, in addition to the platform portals: (i) documented programmatic access (APIs) per platform, (ii) DOI-citable precomputed summary tables so common questions need no large query, and (iii) runnable example workflows that connect the platforms. A companion documentation website and a small Python helper module (oryza19k) accompany this note.
10.0 Choosing the right platform
| If you want to… | Use | How |
|---|---|---|
| Inspect genotypes / haplotypes / allele frequencies for a small region across all accessions | SNP-Seek v3 | Genotype viewer / REST |
| Browse a genome, gene models, and predicted variant effects | GrameneOryza | Ensembl REST · browser |
| Slice a locus across all 19,035 genomes without downloading | Gramene FTP | remote tabix |
| Explore climate ↔ genome (G×E) associations | Oryza CLIMtools | downloadable tables |
| Predict an agronomic trait from genotypes | Code & Models | pre-trained models (Docker/Colab) |
| Obtain raw reads, the full variant archive, or a citable DOI | Archives (NCBI/EVA/KAUST) | accession download |
The five platinum reference genomes used throughout are Nipponbare IRGSP‑1.0 (GJ; GCA_001433935.1), IR64RS2 (XI; gap‑free, this study), MH63RS3 (XI; CP054676–CP054688), ARC 10497 (cB; GCA_009831255.1), and N22 (cA; GCA_001952365.2).
10.1 Rice SNP-Seek (genotypes, haplotypes, allele frequencies, phenotypes)
See it live → SNP-Seek v3 (overview, tutorial, and examples).
The 19K-RGP variant calls — SNPs and small InDels, plus genomic VCFs (gVCFs) from the new 9K-RGP sequencing and the public 3K-RGP / 7K-Public datasets — are available in the Rice SNP-Seek database (snp-seek.irri.org). The initial 19K-RGP release in SNP-Seek provides variants called against two references, Nipponbare IRGSP‑1.0 (GJ) and MH63RS3 (XI).
Interactive use. SNP-Seek supports inspection of SNPs/InDels within a small genome region (genes, QTLs < 500 kb) across all accessions at once, computation of local haplotypes, the genotype viewer, embedded JBrowse, gene-locus search, and pairwise SNP comparison, alongside accession phenotypes, allele frequencies, and subpopulation grouping.
Programmatic access (REST). SNP-Seek exposes a RESTful interface (described in Mansueto et al., Nucleic Acids Research 2017, "SNP-Seek II"). Typical queries — genotype-by-region, variety/germplasm lists, and SNP-by-position — follow the pattern below.
# Genotype matrix for a region across a chosen accession set (schema per SNP-Seek II).
# NOTE: confirm the exact route/parameters against the live API before relying on it.
curl -s "https://snp-seek.irri.org/<api>/genotype?ref=IRGSP-1.0&chrom=3&start=31031753&end=31041563&varset=XI"tabix on the FTP-hosted VCFs (§10.6) or the precomputed allele-frequency table (§10.6), both of which are credential-free and durable.Bulk downloads. Whole-dataset genotypes are available as VCF, HDF5, and genotype matrices, and through the 3,000 Rice Genomes Project Open Data (Amazon Sustainability Data Initiative).
10.2 GrameneOryza (genome browser, search, FTP, Ensembl REST)
See it live → GrameneOryza · copy-paste recipes at API → Ensembl REST.
GrameneOryza serves the 19K-RGP across all five references: a genome browser, a search interface, and an FTP site holding the extended per-reference variant calls and their predicted effects. Users visualize genetic variation through the location, gene, and variation pages; a video walkthrough using OsMADS50 is available at oryza.gramene.org/videotutorials.
Because GrameneOryza is built on Ensembl, the Ensembl REST API (rest.ensembl.org) provides documented, stable programmatic access to Oryza sativa gene models, regions, variants, effects, and sequence. These calls are verified live:
# Resolve a gene to coordinates (MADS14 used in the paper).
curl -s "https://rest.ensembl.org/lookup/id/Os03g0752800?expand=1;content-type=application/json"
# -> "MADS14", chr 3:31,031,753-31,041,563, strand -1, protein_coding
# Features overlapping a region (genes and variants).
curl -s "https://rest.ensembl.org/overlap/region/oryza_sativa/3:31031753-31041563?feature=gene;feature=variation;content-type=application/json"
# Predicted consequence of a variant (Variant Effect Predictor); format: region/<chrom>:<start>-<end>/<alt>
curl -s "https://rest.ensembl.org/vep/oryza_sativa/region/3:31037240-31037240/A?content-type=application/json"
# -> most_severe_consequence "intron_variant" for MADS14 (Os03t0752800-01)
# Genomic sequence of a transcript.
curl -s "https://rest.ensembl.org/sequence/id/Os03t0752800?type=genomic;content-type=application/json"import requests # or use the bundled `oryza19k` helper, which needs no extra deps
g = requests.get("https://rest.ensembl.org/lookup/id/Os03g0752800",
params={"expand": 1}, headers={"Content-Type": "application/json"}).json()
print(g["display_name"], g["seq_region_name"], g["start"], g["end"])BioMart (Ensembl Plants) supports bulk attribute/identifier queries for rice genes and variants without programming. Track hubs and remote tabix (§10.6) let users overlay or stream the 19K-RGP VCFs.
10.3 Oryza CLIMtools (climate ↔ genome / G×E associations) — new
See it live → Oryza CLIMtools.
Oryza CLIMtools (gramene.org/CLIMtools/oryza_19K-RGP) provides three interactive applications for phenotype × genotype × environment analysis: OryzaCLIM (per-accession geo-environmental variables), Oryza GenoCLIM (gene → environment associations), and Oryza CLIMGeno (environment → genotype). The underlying methodology and v1.0 were described in Ferrero-Serrano et al. (Plant Communications, 2024).
Programmatic access. CLIMtools is an R/Shiny resource and does not expose a REST API. The supported programmatic path is its downloadable result tables, read directly in pandas/R, with the application source available under Apache 2.0 at github.com/CLIMtools. We additionally deposit the relevant association tables among the precomputed summaries (§10.6) so they are reproducible without the live application.
import pandas as pd
# A GenoCLIM result table exported (or taken from the precomputed deposit):
genoclim = pd.read_csv("climtools_genoclim.tsv", sep="\t")
mads14 = genoclim[genoclim["gene"].str.contains("MADS14", case=False)]
mads14.sort_values("P").head() # top climate variable for MADS14 in XI is BIO610.4 Code, pre-trained models, and reproducible environments — new
See it live → Code & Models.
All pipelines and the AI trait-prediction models live at github.com/YongZhou2019/19K-RGP (MIT license). The root README links the variant-calling workflow HPC-GVCW (github.com/IBEXCluster/HPC-GVCW; BMC Biology, DOI 10.1186/s12915-024-01820-5). Modules include allele-frequency and diversity analysis, AlphaFold modeling, GWAS (GAPIT), molecular dynamics, rare-variant statistics, variant calling, and local-ancestry inference for cBasmati.
AI trait prediction. The AI-drive Predictive Phenotype Modeling folder ships pre-trained models that predict agronomic traits from genotypes, with a demonstration on heading date (hdg_80head). The deployed model uses the SHAP-selected top-1,000 SNPs (from the 165,640 high-confidence core SNP set; MAF > 0.05). The interface (from heading_date_predictor.py):
import joblib, numpy as np, pandas as pd
model = joblib.load("best_trained_model_hdg_80head2025.pkl")
imputer = joblib.load("most_freq_imputer_hdg_80head2025.pkl")
df = pd.read_csv("input_top_1000_features.csv") # cols: IID, Phenotype, + 1000 SNPs
X = df.drop(columns=["IID", "Phenotype"]).replace(-9, np.nan).values.astype("float64")
y_pred = model.predict(imputer.transform(X))Reproducible environment. A Docker image (recommended) bundles RAPIDS/cuDF, PyTorch, and Jupyter:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
-p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 \
jupyter lab --ip=0.0.0.0 --allow-rootThe demonstration also runs as a one-click Google Colab notebook. At resubmission the four remaining trait models (grain weight, length, width, and length/width ratio) are added alongside the heading-date model.
10.5 Cross-platform example workflows and the oryza19k access cookbook — new
See it live → Example workflows; the cookbook reference is on the API page.
We provide a small Python helper, oryza19k, that orchestrates the platforms' existing interfaces (it is a client-side cookbook, not a new server/API), and three end-to-end example workflows anchored to results in the paper. Each workflow is a runnable notebook (Colab or Docker) that ships with its outputs already computed and falls back to the precomputed tables if a live service is unavailable, so it always runs for a reader.
oryza19k functions: lookup_gene, region_features, vep_effects (Ensembl REST);
region_genotypes(source="tabix"|"snpseek"|"local"); climate_for_gene, accession_climate (CLIMtools tables); predict_trait (pre-trained models); summary_table (precomputed deposit); and choose_platform (the decision table above).
**Workflow A — From the MADS50/MADS14 locus to a heading-date hypothesis.** lookup_gene("Os03g0752800") (MADS14 → chr3:31,031,753–31,041,563) → genotypes across XI via remote tabix or SNP-Seek → vep_effects for predicted consequences → climate_for_gene (top variable: BIO6, minimum temperature of the coldest month) → join per-accession climate and heading date → reproduce the haplotype–climate signal → predict_trait("hdg_80head") to rank candidates.
**Workflow B — Rare TB1/FC1 high-effect variant → tiller number.** Locate TB1/FC1 and the promoter variant (Ensembl) → genotype the single site across the panel (tabix/SNP-Seek) → allele frequency (≈ 38 carriers, 0.95%) → confirm the regulatory consequence (VEP + the high-effect-variant table) → pull tiller number, group, and geography for carriers → report the rare-allele effect with a confidence interval and a relatedness/geography check.
Workflow C — Whole-collection trait imputation and benchmarking. Load the 165,640 core-SNP matrix (or stream regions) → predict_trait on the accessions with missing phenotypes → compare against the benchmark table (gradient-boosted trees vs rrBLUP/GBLUP/ BayesA–C and tabular deep/foundation models) → map the top SHAP SNPs to genes.
10.6 Efficient data access at scale — new
See it live → API → Remote tabix and Precomputed summaries.
Remote tabix streaming — query 19,035 genomes without downloading them. The per-reference VCFs are bgzip-compressed and tabix-indexed; with a server that supports HTTP range requests, any locus can be streamed directly:
# Stream just the MADS14 locus from the bgzipped 19K-RGP VCF — no full download.
tabix -h https://<public-host>/19K-RGP/IRGSP-1.0/19K-RGP.IRGSP-1.0.snps.vcf.gz 3:31031753-31041563
# Same, with bcftools, emitting a tidy genotype table:
bcftools view -r 3:31031753-31041563 https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'import pysam
vcf = pysam.VariantFile("https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz")
for rec in vcf.fetch("3", 31031753, 31041563):
print(rec.chrom, rec.pos, rec.ref, rec.alts)bgzip/tabix preparation is coordinated with the Gramene team.Precomputed summary tables (no query needed). A DOI-citable Zenodo deposit (mirrored on the Gramene FTP and the KAUST repository) provides: a master accession/passport table with varietal groups and the five-reference coordinate mappings; per-variety-group allele frequencies for the 165,640 core SNPs; a per-gene variant summary; the high-effect-regulatory-variant table; the genome–environment-association hits per group; the five trait phenotype tables; the genomic- prediction benchmark table; the SHAP-top-SNP→gene table; and the MADS14/MADS50 haplotype tables. See the deposit manifest for columns and formats.
Honest limits. SNP-Seek REST endpoint paths are being confirmed with IRRI; CLIMtools has no REST API (tables only); pre-publication browser/FTP endpoints are credentialed and move to anonymous production at publication, with the European Variation Archive (PRJEB105137), NCBI BioProjects (PRJNA954521, PRJNA597070, PRJNA952097), the KAUST repository (DOI 10.25781/N3AF-NP78), and the Zenodo deposit as the durable, credential-free records.