# ---
# jupyter:
#   jupytext:
#     formats: py:percent,ipynb
#     text_representation: {extension: .py, format_name: percent}
#   kernelspec: {display_name: Python 3, name: python3}
# ---

# %% [markdown]
# # Workflow C — Whole-collection trait imputation & benchmarking
#
# **19K-RGP cross-platform example workflow.** Predict an agronomic trait for accessions that lack a
# measured phenotype, and show how the gradient-boosted-tree models compare against the classical
# genomic-prediction methods the reviewers asked about (rrBLUP, GBLUP, BayesA–C). This directly
# substantiates the AI-prediction section.
#
# Runs in Colab or the project Docker image. The prediction step uses the pre-trained model from
# `github.com/YongZhou2019/19K-RGP`; the benchmark uses the shipped precomputed table.

# %% [markdown]
# ## What you need — quick setup
#
# This notebook **runs top-to-bottom as-is** — only the prediction step needs the model files
# (it falls back to a note otherwise), so nothing crashes:
#
# | Step | Uses | Runs now? | To enable the full step |
# |------|------|-----------|--------------------------|
# | 1    | Benchmark table (23 models × 5 traits) | yes (bundled) | — |
# | 2    | `predict_trait` — pre-trained model | prints where the output lives | `git clone https://github.com/YongZhou2019/19K-RGP`, then pass `models_dir=".../AI-drive Predictive Phenotype Modeling"` |
# | 3    | Phenotype coverage by group | yes (bundled) | — |
#
# **Dependencies:** `pip install pandas numpy` (already in Colab); add `joblib scikit-learn xgboost
# lightgbm` only for Step 2's model.

# %%
import sys, os
sys.path.insert(0, os.path.dirname(os.path.abspath("oryza19k.py")) or ".")
import numpy as np, pandas as pd
import oryza19k as o19

TRAIT = "hdg_80head"   # heading date (the public demo trait); others added at resubmission

# %% [markdown]
# ## Step 1 — The genomic-prediction benchmark (gradient-boosted trees vs classical & deep models)
# 23 models × 5 traits, by Spearman correlation (10-fold CV). The gradient-boosted-tree ensemble is
# competitive with or better than rrBLUP/GBLUP/BayesA–C, with the largest margins on the harder
# traits (e.g. grain weight).

# %%
bench = o19.summary_table("benchmark")
for trait in ["Heading date", "Grain weight"]:
    top = bench[bench["trait"] == trait].sort_values("spearman", ascending=False)
    print(f"\n=== {trait} (top models by Spearman) ===")
    print(top[["model", "family", "spearman", "r2"]].head(6).to_string(index=False))

# %% [markdown]
# ## Step 2 — Score accessions that have no measured phenotype
# The deployed model consumes the SHAP-selected top-1,000 SNPs. Provide a genotype frame with an
# `IID`/`ID` column and those features; `predict_trait` aligns columns and imputes missing values.

# %%
try:
    # the repo ships the preprocessed top-1000-feature CSV for the heading-date demo:
    to_predict = pd.read_csv("hdg_80head2025_to_predict_preprocessed_combined_top_1000_features.csv")
    preds = o19.predict_trait(to_predict, trait=TRAIT,
                              models_dir="AI-drive Predictive Phenotype Modeling")
    print(f"Predicted {TRAIT} for {len(preds)} accessions:")
    print(preds.head().to_string(index=False))
    preds.to_csv("predicted_heading_date.csv", index=False)
except Exception as e:
    print(f"Model/input not present here ({type(e).__name__}). In the repo, this writes")
    print("  output/heading_date_predictions_on_new_samples_with_ID_using_ultimate_model.csv")
    print("for the ~5,500 accessions lacking a measured heading date.")

# %% [markdown]
# ## Step 3 — How many accessions could be imputed? (passport + phenotype coverage)
# Use the precomputed passport / phenotype tables to quantify the imputation opportunity per group.

# %%
try:
    ph = o19.summary_table("phenotypes")               # accession x 24 traits + group
    have = ph[TRAIT].notna()
    print(f"{TRAIT}: measured for {have.sum():,} accessions; "
          f"{(~have).sum():,} could be imputed by the model.")
    print("\nBy varietal group (measured / total):")
    grp = ph.groupby("Group")[TRAIT].agg(measured="count", total="size")
    print(grp.to_string())
except Exception as e:
    print(f"Run precomputed_tables/generate_tables.py first ({type(e).__name__}).")

# %% [markdown]
# **Result.** A runnable path to (i) see how the AI models compare against standard genomic-
# prediction baselines and (ii) impute a trait across the whole collection at negligible cost —
# exactly the resource utility the reviewers wanted substantiated.