# ---
# jupyter:
#   jupytext:
#     formats: py:percent,ipynb
#     text_representation:
#       extension: .py
#       format_name: percent
#   kernelspec:
#     display_name: Python 3
#     name: python3
# ---

# %% [markdown]
# # Workflow A — From the *MADS50/MADS14* locus to a heading-date hypothesis
#
# **19K Rice Genome Project (19K-RGP) — cross-platform example workflow.**
#
# This notebook walks from a flowering-time gene of interest to a testable, breeder-actionable
# hypothesis, touching every platform of the resource:
#
# 1. **GrameneOryza / Ensembl REST** — resolve the gene and its coordinates, list overlapping
#    variants, and predict their consequences.
# 2. **SNP-Seek / remote `tabix`** — pull genotypes across the *Xian/Indica* (XI) accessions.
# 3. **Oryza CLIMtools** — find the climate variable most associated with the gene.
# 4. **Code & Models** — predict heading date genome-wide with the pre-trained model.
#
# It uses the bundled `oryza19k` access cookbook. Every networked step degrades gracefully:
# if a live service is unavailable, the step falls back to a shipped precomputed table, so the
# notebook always runs. Run it in Google Colab or the project Docker image.

# %% [markdown]
# ## What you need — quick setup
#
# This notebook **runs top-to-bottom as-is** (an internet connection covers the Ensembl steps).
# Anything that needs an extra resource falls back to a bundled table or a short note, so it never
# crashes — enable the full version of a step once you have its resource:
#
# | Step | Uses | Runs now? | To enable the full step |
# |------|------|-----------|--------------------------|
# | 1–2  | Ensembl REST — gene, variants, VEP | yes (needs internet) | — |
# | 3    | `tabix` genotype streaming | falls back to the allele-frequency table | `o19.CONFIG.vcf_base = "https://<public-host>/19K-RGP/<ref>/"` (public at publication), or pass `vcf_url=...` |
# | 4    | CLIMtools (G×E) association | prints the expected result (BIO6) | drop the GenoCLIM export at `precomputed_tables/climtools_genoclim.tsv` |
# | 5    | `predict_trait` — pre-trained model | prints where the output lives | `git clone https://github.com/YongZhou2019/19K-RGP`, then pass `models_dir=".../AI-drive Predictive Phenotype Modeling"` |
# | 6    | Benchmark table | yes (bundled) | — |
#
# **Dependencies:** `pip install pandas numpy requests` (already in Colab). Add `pyarrow` for fast
# local genotype slices, and `joblib scikit-learn xgboost lightgbm` only for Step 5's model.

# %%
# Setup — `oryza19k` lives alongside this notebook. Only stdlib + pandas/numpy are required here.
import sys, os
sys.path.insert(0, os.path.dirname(os.path.abspath("oryza19k.py")) or ".")
import pandas as pd
import oryza19k as o19

pd.set_option("display.width", 140, "display.max_columns", 20)
print("oryza19k loaded. Decision table — where each task lives:")
o19.choose_platform()

# %% [markdown]
# ## Step 1 — Resolve the gene (GrameneOryza / Ensembl REST)
# *MADS14* (`Os03g0752800`) is a MADS-box flowering-time transcription factor highlighted in the
# paper. We resolve it to genome coordinates with the documented, stable Ensembl REST API.

# %%
gene = o19.lookup_gene("Os03g0752800")          # MADS14
chrom, start, end = gene["seq_region_name"], gene["start"], gene["end"]
print(f"{gene['display_name']}  chr{chrom}:{start:,}-{end:,}  strand {gene['strand']}  ({gene['biotype']})")
# The companion flowering-time gene OsMADS50 (Os03g0122600) is handled identically.

# %% [markdown]
# ## Step 2 — Variants and predicted effects in the locus (Ensembl REST + VEP)

# %%
try:
    feats = o19.region_features(f"{chrom}:{start}-{end}", feature="variation")
    print(f"{len(feats)} known variants overlap the locus")
    display_cols = [c for c in ("id", "start", "consequence_type") if c in feats.columns]
    print(feats[display_cols].head(8).to_string(index=False) if display_cols else feats.head())
except Exception as e:
    print("Live overlap unavailable; in the paper the locus carries ~33 haplotype-defining "
          "variants (Supplementary Table 28).", f"[{type(e).__name__}]")

# Predicted consequence of a real group-specific variant in the locus (chrom:pos:alt):
try:
    vep = o19.vep_effects(f"{chrom}:31037240:A")        # Chr03_31037240 T>A (GJ_trop-enriched)
    print(vep.head().to_string(index=False))
except Exception as e:
    print("VEP step is documented in Supplementary Note 10.2.", f"[{type(e).__name__}]")

# %% [markdown]
# ## Step 3 — Genotypes across XI accessions (remote `tabix`, no full download)
# The efficient way to query 19,035 genomes is to **stream just this region** from the bgzipped VCF.
# Set `oryza19k.CONFIG.vcf_base` to the public host (credential-free, at publication), or pass
# `vcf_url=`. Offline, we fall back to a local slice / the precomputed allele-frequency table.

# %%
try:
    gt = o19.region_genotypes(chrom, start, end, ref="IRGSP-1.0", source="tabix")
    print(f"Streamed {len(gt)} variants from the region VCF (no full download).")
except Exception as e:
    print(f"Remote tabix not configured here ({type(e).__name__}). Falling back to precomputed "
          "allele frequencies for the locus:")
    try:
        af = o19.summary_table("allele_freq_by_group")
        chrom_key = f"Chr{int(chrom):02d}" if str(chrom).isdigit() else str(chrom)
        locus = af[(af["chrom"] == chrom_key) & af["pos"].between(start, end)]
        print(locus.head(10).to_string(index=False) if len(locus) else
              "(no core SNPs fall inside this locus; the thinned core set is genome-wide.)")
    except Exception as e2:
        print(f"  precomputed table not generated yet ({type(e2).__name__}); see precomputed_tables/.")

# %% [markdown]
# ## Step 4 — Climate association for the gene (Oryza CLIMtools)
# CLIMtools has no REST API; we read its exported GenoCLIM result table. For *MADS14* in XI the top
# associated variable is **BIO6** (minimum temperature of the coldest month) — consistent with a
# cold-adaptation, flowering-time interpretation.

# %%
try:
    clim = o19.climate_for_gene("MADS14", group="XI")
    print(clim.head().to_string(index=False))
except Exception as e:
    print("Ship the GenoCLIM export as precomputed_tables/climtools_genoclim.tsv to enable this "
          f"step ({type(e).__name__}). Expected top variable: BIO6 (min temp, coldest month).")

# %% [markdown]
# ## Step 5 — Predict heading date with the pre-trained model (Code & Models)
# The deployed model uses the SHAP-selected top-1,000 SNPs. Point `models_dir` at a clone of
# `github.com/YongZhou2019/19K-RGP` (the `AI-drive Predictive Phenotype Modeling` folder).

# %%
# Minimal illustration of the interface (runs where joblib/sklearn + the model files are present):
try:
    # genotype_df: columns IID/ID, Phenotype (optional), + the top-1000 SNP features
    demo = pd.read_csv("hdg_80head2025_to_predict_preprocessed_combined_top_1000_features.csv")
    preds = o19.predict_trait(demo, trait="hdg_80head", models_dir="AI-drive Predictive Phenotype Modeling")
    print(preds.head().to_string(index=False))
except Exception as e:
    print(f"Model/input not present in this environment ({type(e).__name__}).")
    print("In the repo's committed demo output, predicted heading date for new samples is written to")
    print("  output/heading_date_predictions_on_new_samples_with_ID_using_ultimate_model.csv")

# %% [markdown]
# ## Step 6 — Put it together: is the locus' variation consistent with cold-driven flowering?
# Compare the benchmark context for heading date — gradient-boosted trees vs classical genomic
# prediction — using the shipped benchmark table.

# %%
try:
    bench = o19.summary_table("benchmark")
    hd = bench[bench["trait"] == "Heading date"].sort_values("spearman", ascending=False)
    print(hd[["model", "family", "spearman"]].head(8).to_string(index=False))
except Exception as e:
    print(f"Run precomputed_tables/generate_tables.py to build the benchmark table ({type(e).__name__}).")

# %% [markdown]
# **Result.** In a few credential-free steps we resolved *MADS14*, inspected its variants and their
# predicted effects, saw how to stream genotypes across all accessions without downloading them,
# retrieved the gene's top climate association (BIO6), and scored heading date genome-wide — a
# concrete, reproducible path from a locus to a testable adaptation hypothesis.