# Precomputed Summary Tables — Zenodo Deposit Manifest

DOI-citable summary tables for the 19K Rice Genome Project (19K-RGP), so that the most
common questions about the resource need **no large query**. This directly answers
Reviewer #1.6's request for *precomputed summaries*.

**Hosting plan (at publication):** Zenodo (mints a versioned DOI) → mirrored on the Gramene
FTP (`…/oryza/19K-RGP/summaries/`) and the KAUST repository (DOI `10.25781/N3AF-NP78`).
Cite the **Zenodo DOI**, not a hostname.

**Provenance:** tables 1–4 are regenerated from local data by `generate_tables.py`
(reproducible, no credentials). Tables 5–9 are exported from cluster-scale variant data by
the platform teams and are listed here so the catalog is complete.

---

## Generated here (`generate_tables.py`)

| # | File | Key columns | Format / size | Answers |
|---|------|-------------|---------------|---------|
| 1 | `genomic_prediction_benchmark.tsv` | model, family, trait, spearman, r2, training_time_sec | TSV, ~10 KB (115 rows = 23 models × 5 traits) | example workflows; head-to-head vs rrBLUP/GBLUP/BayesA-C (Rev #1.4) |
| 2 | `accession_passport.tsv` | accession, varietal_group, n_phenotypes_scored | TSV, ~0.5 MB (16,779 accessions) | precomputed summaries; panel composition (Rev #1.2) |
| 3 | `phenotypes_by_accession.tsv` | accession, 24 phenotypes, varietal_group | TSV, ~3 MB | precomputed summaries; example workflows |
| 4 | `allele_frequency_core_snps.tsv` | snp, chrom, pos, ref, alt, n_called, alt_allele_freq, maf, af_\<group\> | TSV, ~15 MB (165,640 SNPs) | precomputed summaries; rare/group-specific variants (Rev #1.3, #2) |

Varietal groups present in the local labels: XI_indica, GJ_trop, GJ_temp, GJ-adm, cAus,
cBasmati (plus "Unknown" for accessions not yet group-assigned).

## Exported by the platform teams (cluster-scale; full ~57 M variants)

| # | File | Key columns | Owner | Answers |
|---|------|-------------|-------|---------|
| 5 | `gene_variant_summary.tsv` | gene, n_variants, n_high_impact, consequence_counts, pangene_class | Gramene | precomputed summaries |
| 6 | `hev_table.tsv` | variant, gene, pred_delta_accessibility, group, motif (e.g. OsBZR1) | KAUST | example workflow B; transparency (Rev #2.2/#2.4) |
| 7 | `gea_hits_by_group.tsv` | variant, climate_variable, effect, P, FDR, group | CLIMtools | lets reviewers re-test FDR (Rev #2.6) |
| 8 | `mads14_mads50_haplotypes.tsv` | gene, group, haplotype, defining_variants, mean_heading_date, mean_BIO6 | KAUST | flagship workflow A reproducibility |
| 9 | `shap_top_snp_to_gene.tsv` | trait, snp, shap_importance, nearest_gene, known_novel | KAUST (Halimi) | interpretability; example workflow C |

## Large master artifacts (Zenodo, not git)

- `genotypes_with_phenos.parquet` — the 165,640 core-SNP × 16,779-accession genotype+phenotype matrix (the input to the AI models). ~1–2 GB.
- Per-reference bgzipped + tabix-indexed VCFs (for remote `tabix` streaming) — served from the public Gramene FTP at publication.

---

### Reproduce tables 1–4
```bash
cd precomputed_tables
python generate_tables.py            # all four (allele-freq pass reads ~1–2 GB)
python generate_tables.py --quick    # skip the allele-frequency pass
```
