# Supplementary Note 10 — Data Access and Visualization

*Expanded for the revision in response to Reviewer #1, comment 6 (efficient data access: APIs,
precomputed summaries, example workflows). §§10.1–10.2 are retained and expanded; §§10.3–10.6
are new.*

The 19K-RGP is distributed across complementary platforms, each suited to a different task.
To make the resource easy to use efficiently, we provide, in addition to the platform portals:
(i) documented programmatic access (APIs) per platform, (ii) DOI-citable **precomputed summary
tables** so common questions need no large query, and (iii) runnable **example workflows** that
connect the platforms. A companion documentation website and a small Python helper module
(`oryza19k`) accompany this note.

> **Access policy.** Every endpoint listed here is, or will at publication be, **anonymous and
> login-free**, consistent with the journal's data-availability policy. Where a platform is in a
> pre-publication, credentialed staging state, this is stated explicitly and the durable,
> credential-free path (an archive accession or a Zenodo DOI) is given alongside.

### 10.0 Choosing the right platform

| If you want to… | Use | How |
|---|---|---|
| Inspect genotypes / haplotypes / allele frequencies for a small region across all accessions | **SNP-Seek v3** | Genotype viewer / REST |
| Browse a genome, gene models, and predicted variant effects | **GrameneOryza** | Ensembl REST · browser |
| Slice a locus across all 19,035 genomes **without downloading** | **Gramene FTP** | remote `tabix` |
| Explore climate ↔ genome (G×E) associations | **Oryza CLIMtools** | downloadable tables |
| Predict an agronomic trait from genotypes | **Code & Models** | pre-trained models (Docker/Colab) |
| Obtain raw reads, the full variant archive, or a citable DOI | **Archives** (NCBI/EVA/KAUST) | accession download |

The five platinum reference genomes used throughout are Nipponbare IRGSP‑1.0 (GJ; GCA_001433935.1),
IR64RS2 (XI; gap‑free, this study), MH63RS3 (XI; CP054676–CP054688), ARC 10497 (cB; GCA_009831255.1),
and N22 (cA; GCA_001952365.2).

---

## 10.1 Rice SNP-Seek (genotypes, haplotypes, allele frequencies, phenotypes)

The 19K-RGP variant calls — SNPs and small InDels, plus genomic VCFs (gVCFs) from the new 9K-RGP
sequencing and the public 3K-RGP / 7K-Public datasets — are available in the Rice SNP-Seek
database (`snp-seek.irri.org`). The initial 19K-RGP release in SNP-Seek provides variants called
against two references, **Nipponbare IRGSP‑1.0** (GJ) and **MH63RS3** (XI).

**Interactive use.** SNP-Seek supports inspection of SNPs/InDels within a small genome region
(genes, QTLs < 500 kb) across all accessions at once, computation of local haplotypes, the genotype
viewer, embedded JBrowse, gene-locus search, and pairwise SNP comparison, alongside accession
phenotypes, allele frequencies, and subpopulation grouping.

**Programmatic access (REST).** SNP-Seek exposes a RESTful interface (described in Mansueto *et al.*,
*Nucleic Acids Research* 2017, "SNP-Seek II"). Typical queries — genotype-by-region, variety/germplasm
lists, and SNP-by-position — follow the pattern below.

```bash
# Genotype matrix for a region across a chosen accession set (schema per SNP-Seek II).
# NOTE: confirm the exact route/parameters against the live API before relying on it.
curl -s "https://snp-seek.irri.org/<api>/genotype?ref=IRGSP-1.0&chrom=3&start=31031753&end=31041563&varset=XI"
```

> ⚠ **To confirm before publication.** The exact SNP-Seek REST paths/parameters are being verified
> with the IRRI team (McNally, Mauleon, Chebotarov). Until confirmed, the equivalent query is fully
> reproducible via remote `tabix` on the FTP-hosted VCFs (§10.6) or the precomputed allele-frequency
> table (§10.6), both of which are credential-free and durable.

**Bulk downloads.** Whole-dataset genotypes are available as VCF, HDF5, and genotype matrices, and
through the 3,000 Rice Genomes Project Open Data (Amazon Sustainability Data Initiative).

---

## 10.2 GrameneOryza (genome browser, search, FTP, Ensembl REST)

GrameneOryza serves the 19K-RGP across all five references: a genome browser, a search interface,
and an FTP site holding the extended per-reference variant calls and their predicted effects. Users
visualize genetic variation through the location, gene, and variation pages; a video walkthrough
using *OsMADS50* is available at `oryza.gramene.org/videotutorials`.

Because GrameneOryza is built on Ensembl, the **Ensembl REST API** (`rest.ensembl.org`) provides
documented, stable programmatic access to *Oryza sativa* gene models, regions, variants, effects,
and sequence. These calls are verified live:

```bash
# Resolve a gene to coordinates (MADS14 used in the paper).
curl -s "https://rest.ensembl.org/lookup/id/Os03g0752800?expand=1;content-type=application/json"
#  -> "MADS14", chr 3:31,031,753-31,041,563, strand -1, protein_coding

# Features overlapping a region (genes and variants).
curl -s "https://rest.ensembl.org/overlap/region/oryza_sativa/3:31031753-31041563?feature=gene;feature=variation;content-type=application/json"

# Predicted consequence of a variant (Variant Effect Predictor); format: region/<chrom>:<start>-<end>/<alt>
curl -s "https://rest.ensembl.org/vep/oryza_sativa/region/3:31037240-31037240/A?content-type=application/json"
#  -> most_severe_consequence "intron_variant" for MADS14 (Os03t0752800-01)

# Genomic sequence of a transcript.
curl -s "https://rest.ensembl.org/sequence/id/Os03t0752800?type=genomic;content-type=application/json"
```

```python
import requests   # or use the bundled `oryza19k` helper, which needs no extra deps
g = requests.get("https://rest.ensembl.org/lookup/id/Os03g0752800",
                 params={"expand": 1}, headers={"Content-Type": "application/json"}).json()
print(g["display_name"], g["seq_region_name"], g["start"], g["end"])
```

**BioMart** (Ensembl Plants) supports bulk attribute/identifier queries for rice genes and variants
without programming. **Track hubs** and **remote `tabix`** (§10.6) let users overlay or stream the
19K-RGP VCFs.

> **Pre-publication status.** During review the GrameneOryza rice resources are served from a
> staging instance with reviewer credentials provided via the editor. At publication they move to
> public, login-free production endpoints; the durable variant archive is the European Variation
> Archive accession **PRJEB105137** (SNP set ERZ28769989, InDel set ERZ28769990).

---

## 10.3 Oryza CLIMtools (climate ↔ genome / G×E associations)  — *new*

Oryza CLIMtools (`gramene.org/CLIMtools/oryza_19K-RGP`) provides three interactive applications for
phenotype × genotype × environment analysis: **OryzaCLIM** (per-accession geo-environmental
variables), **Oryza GenoCLIM** (gene → environment associations), and **Oryza CLIMGeno**
(environment → genotype). The underlying methodology and v1.0 were described in Ferrero-Serrano
*et al.* (*Plant Communications*, 2024).

**Programmatic access.** CLIMtools is an R/Shiny resource and does **not** expose a REST API. The
supported programmatic path is its **downloadable result tables**, read directly in pandas/R, with
the application source available under Apache 2.0 at `github.com/CLIMtools`. We additionally deposit
the relevant association tables among the precomputed summaries (§10.6) so they are reproducible
without the live application.

```python
import pandas as pd
# A GenoCLIM result table exported (or taken from the precomputed deposit):
genoclim = pd.read_csv("climtools_genoclim.tsv", sep="\t")
mads14 = genoclim[genoclim["gene"].str.contains("MADS14", case=False)]
mads14.sort_values("P").head()            # top climate variable for MADS14 in XI is BIO6
```

---

## 10.4 Code, pre-trained models, and reproducible environments  — *new*

All pipelines and the AI trait-prediction models live at `github.com/YongZhou2019/19K-RGP`
(MIT license). The root README links the variant-calling workflow **HPC-GVCW**
(`github.com/IBEXCluster/HPC-GVCW`; *BMC Biology*, DOI 10.1186/s12915-024-01820-5). Modules include
allele-frequency and diversity analysis, AlphaFold modeling, GWAS (GAPIT), molecular dynamics,
rare-variant statistics, variant calling, and local-ancestry inference for cBasmati.

**AI trait prediction.** The `AI-drive Predictive Phenotype Modeling` folder ships pre-trained
models that predict agronomic traits from genotypes, with a demonstration on **heading date**
(`hdg_80head`). The deployed model uses the SHAP-selected **top-1,000 SNPs** (from the 165,640
high-confidence core SNP set; MAF > 0.05). The interface (from `heading_date_predictor.py`):

```python
import joblib, numpy as np, pandas as pd
model   = joblib.load("best_trained_model_hdg_80head2025.pkl")
imputer = joblib.load("most_freq_imputer_hdg_80head2025.pkl")
df = pd.read_csv("input_top_1000_features.csv")            # cols: IID, Phenotype, + 1000 SNPs
X  = df.drop(columns=["IID", "Phenotype"]).replace(-9, np.nan).values.astype("float64")
y_pred = model.predict(imputer.transform(X))
```

**Reproducible environment.** A Docker image (recommended) bundles RAPIDS/cuDF, PyTorch, and Jupyter:

```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 \
  jupyter lab --ip=0.0.0.0 --allow-root
```

The demonstration also runs as a one-click Google Colab notebook. *At resubmission the four
remaining trait models (grain weight, length, width, and length/width ratio) are added alongside the
heading-date model.*

---

## 10.5 Cross-platform example workflows and the `oryza19k` access cookbook  — *new*

We provide a small Python helper, **`oryza19k`**, that orchestrates the platforms' existing
interfaces (it is a client-side cookbook, **not** a new server/API), and three end-to-end example
workflows anchored to results in the paper. Each workflow is a runnable notebook (Colab or Docker)
that ships with its outputs already computed and **falls back to the precomputed tables if a live
service is unavailable**, so it always runs for a reader.

`oryza19k` functions: `lookup_gene`, `region_features`, `vep_effects` (Ensembl REST);
`region_genotypes(source="tabix"|"snpseek"|"local")`; `climate_for_gene`, `accession_climate`
(CLIMtools tables); `predict_trait` (pre-trained models); `summary_table` (precomputed deposit);
and `choose_platform` (the decision table above).

**Workflow A — From the *MADS50*/*MADS14* locus to a heading-date hypothesis.**
`lookup_gene("Os03g0752800")` (MADS14 → chr3:31,031,753–31,041,563) → genotypes across XI via
remote `tabix` or SNP-Seek → `vep_effects` for predicted consequences → `climate_for_gene` (top
variable: BIO6, minimum temperature of the coldest month) → join per-accession climate and heading
date → reproduce the haplotype–climate signal → `predict_trait("hdg_80head")` to rank candidates.

**Workflow B — Rare *TB1/FC1* high-effect variant → tiller number.**
Locate *TB1/FC1* and the promoter variant (Ensembl) → genotype the single site across the panel
(`tabix`/SNP-Seek) → allele frequency (≈ 38 carriers, 0.95%) → confirm the regulatory consequence
(VEP + the high-effect-variant table) → pull tiller number, group, and geography for carriers →
report the rare-allele effect with a confidence interval and a relatedness/geography check.

**Workflow C — Whole-collection trait imputation and benchmarking.**
Load the 165,640 core-SNP matrix (or stream regions) → `predict_trait` on the accessions with
missing phenotypes → compare against the benchmark table (gradient-boosted trees vs rrBLUP/GBLUP/
BayesA–C and tabular deep/foundation models) → map the top SHAP SNPs to genes.

---

## 10.6 Efficient data access at scale  — *new*

**Remote `tabix` streaming — query 19,035 genomes without downloading them.** The per-reference
VCFs are bgzip-compressed and tabix-indexed; with a server that supports HTTP range requests, any
locus can be streamed directly:

```bash
# Stream just the MADS14 locus from the bgzipped 19K-RGP VCF — no full download.
tabix -h https://<public-host>/19K-RGP/IRGSP-1.0/19K-RGP.IRGSP-1.0.snps.vcf.gz 3:31031753-31041563

# Same, with bcftools, emitting a tidy genotype table:
bcftools view -r 3:31031753-31041563 https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz \
  | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
```

```python
import pysam
vcf = pysam.VariantFile("https://<public-host>/.../19K-RGP.IRGSP-1.0.snps.vcf.gz")
for rec in vcf.fetch("3", 31031753, 31041563):
    print(rec.chrom, rec.pos, rec.ref, rec.alts)
```

> Requires the VCFs to be served over HTTPS with range support; the one-time `bgzip`/`tabix`
> preparation is coordinated with the Gramene team.

**Precomputed summary tables (no query needed).** A DOI-citable Zenodo deposit (mirrored on the
Gramene FTP and the KAUST repository) provides: a master accession/passport table with varietal
groups and the five-reference coordinate mappings; per-variety-group **allele frequencies** for the
165,640 core SNPs; a per-gene variant summary; the high-effect-regulatory-variant table; the
genome–environment-association hits per group; the five trait phenotype tables; the genomic-
prediction **benchmark** table; the SHAP-top-SNP→gene table; and the *MADS14/MADS50* haplotype
tables. See the deposit manifest for columns and formats.

**Honest limits.** SNP-Seek REST endpoint paths are being confirmed with IRRI; CLIMtools has no REST
API (tables only); pre-publication browser/FTP endpoints are credentialed and move to anonymous
production at publication, with the European Variation Archive (PRJEB105137), NCBI BioProjects
(PRJNA954521, PRJNA597070, PRJNA952097), the KAUST repository (DOI 10.25781/N3AF-NP78), and the
Zenodo deposit as the durable, credential-free records.
