Code & Models
All analysis pipelines for the paper, plus pre-trained AI models that predict agronomic traits directly from genotypes.
Open the repository ↗ Jump to examples
Repository: github.com/YongZhou2019/19K-RGP (MIT license). Reproducible environment via Docker; one-click Google Colab.
README — what it is & what it hosts
The repository holds the workflows, pipelines, and scripts behind the manuscript. Variant calling uses the HPC-GVCW pipeline (BMC Biology, DOI 10.1186/s12915-024-01820-5). Analysis modules:
| Module | What it does |
|---|---|
AI-drive Predictive Phenotype Modeling | Pre-trained trait-prediction models + demos (see below) |
VariantCalling Scripts | GATK4 variant calling (Phase 1–4), via HPC-GVCW |
GWAS scripts | GAPIT GWAS, QC, Manhattan plots |
AlleleFrequency / RARE VARIANTS | Allele-frequency, MAF comparison, rare-variant statistics |
AlphaFold / Molecular_dynamics | AF2/AF3 structure modeling; MD (gyration, energy) and EVO2 mutation scoring |
cBasmati (LAI) | Local-ancestry inference and FST |
Use this when… you want to predict a trait for new genotypes, reproduce a figure, or run a pipeline end-to-end.
Tutorial — run the heading-date demo three ways
- Colab (one click): open the demo notebook and run the cells top to bottom.
- Docker (recommended): launch the bundled RAPIDS/PyTorch/Jupyter image (command in Examples), then open
Heading_Date_Demo.ipynb. - Python script:
python heading_date_predictor.pywrites predictions + a distribution plot tooutput/.
The deployed model consumes the SHAP-selected top-1,000 SNPs (from the 165,640-SNP core set), not the full matrix — that is what makes inference fast.
Workflow — score the whole collection for a trait
Goal: predict heading date for the ~5,500 accessions that lack a measured value. Provide a genotype frame with an IID/ID column plus the top-1,000 SNP features; oryza19k.predict_trait() aligns columns, imputes missing values with the shipped imputer, loads the model, and returns predictions. See Workflow C for the full benchmarked version.
Examples
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm \
-p 10000:8888 -p 8501:8501 -v ${PWD}:/workspace/mycode \
abdelghafour1/ngc_tf_rapids_25_01_vscode_torch:2025-v3 \
jupyter lab --ip=0.0.0.0 --allow-rootimport joblib, numpy as np, pandas as pd
model = joblib.load("best_trained_model_hdg_80head2025.pkl")
imputer = joblib.load("most_freq_imputer_hdg_80head2025.pkl")
df = pd.read_csv("input_top_1000_features.csv") # IID, Phenotype, + 1000 SNPs
X = df.drop(columns=["IID", "Phenotype"]).replace(-9, np.nan).values.astype("float64")
y_pred = model.predict(imputer.transform(X))Currently the heading-date model is public; the four remaining trait models (grain weight/length/width and length-to-width ratio) are added at resubmission.
Access & cite
Public Open on GitHub (MIT). Reproducible environment via the Docker image above; pre-trained models load with joblib.
Cite the manuscript (see Cite & about) and the HPC-GVCW pipeline (BMC Biology, 2024).