Code & Models

All analysis pipelines for the paper, plus pre-trained AI models that predict agronomic traits directly from genotypes.

Open the repository ↗ Jump to examples

Repository: github.com/YongZhou2019/19K-RGP (MIT license). Reproducible environment via Docker; one-click Google Colab.

README — what it is & what it hosts

The repository holds the workflows, pipelines, and scripts behind the manuscript. Variant calling uses the HPC-GVCW pipeline (BMC Biology, DOI 10.1186/s12915-024-01820-5). Analysis modules:

ModuleWhat it does
AI-drive Predictive Phenotype ModelingPre-trained trait-prediction models + demos (see below)
VariantCalling ScriptsGATK4 variant calling (Phase 1–4), via HPC-GVCW
GWAS scriptsGAPIT GWAS, QC, Manhattan plots
AlleleFrequency / RARE VARIANTSAllele-frequency, MAF comparison, rare-variant statistics
AlphaFold / Molecular_dynamicsAF2/AF3 structure modeling; MD (gyration, energy) and EVO2 mutation scoring
cBasmati (LAI)Local-ancestry inference and FST

Use this when… you want to predict a trait for new genotypes, reproduce a figure, or run a pipeline end-to-end.

Tutorial — run the heading-date demo three ways

  1. Colab (one click): open the demo notebook and run the cells top to bottom.
  2. Docker (recommended): launch the bundled RAPIDS/PyTorch/Jupyter image (command in Examples), then open Heading_Date_Demo.ipynb.
  3. Python script: python heading_date_predictor.py writes predictions + a distribution plot to output/.

The deployed model consumes the SHAP-selected top-1,000 SNPs (from the 165,640-SNP core set), not the full matrix — that is what makes inference fast.

Workflow — score the whole collection for a trait

Goal: predict heading date for the ~5,500 accessions that lack a measured value. Provide a genotype frame with an IID/ID column plus the top-1,000 SNP features; oryza19k.predict_trait() aligns columns, imputes missing values with the shipped imputer, loads the model, and returns predictions. See Workflow C for the full benchmarked version.

Examples

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
  -p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
  abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 \
  jupyter lab --ip=0.0.0.0 --allow-root

Currently the heading-date model is public; the four remaining trait models (grain weight/length/width and length-to-width ratio) are added at resubmission.

Access & cite

Public Open on GitHub (MIT). Reproducible environment via the Docker image above; pre-trained models load with joblib.

Cite the manuscript (see Cite & about) and the HPC-GVCW pipeline (BMC Biology, 2024).