Download the Markdown source ↓

Precomputed Summary Tables — Zenodo Deposit Manifest

DOI-citable summary tables for the 19K Rice Genome Project (19K-RGP), so that the most common questions about the resource need no large query. This directly answers Reviewer #1.6's request for precomputed summaries.

Hosting plan (at publication): Zenodo (mints a versioned DOI) → mirrored on the Gramene FTP (…/oryza/19K-RGP/summaries/) and the KAUST repository (DOI 10.25781/N3AF-NP78). Cite the Zenodo DOI, not a hostname.

Provenance: tables 1–4 are regenerated from local data by generate_tables.py (reproducible, no credentials). Tables 5–9 are exported from cluster-scale variant data by the platform teams and are listed here so the catalog is complete.


Generated here (generate_tables.py)

#FileKey columnsFormat / sizeAnswers
1genomic_prediction_benchmark.tsvmodel, family, trait, spearman, r2, training_time_secTSV, ~10 KB (115 rows = 23 models × 5 traits)example workflows; head-to-head vs rrBLUP/GBLUP/BayesA-C (Rev #1.4)
2accession_passport.tsvaccession, varietal_group, n_phenotypes_scoredTSV, ~0.5 MB (16,779 accessions)precomputed summaries; panel composition (Rev #1.2)
3phenotypes_by_accession.tsvaccession, 24 phenotypes, varietal_groupTSV, ~3 MBprecomputed summaries; example workflows
4allele_frequency_core_snps.tsvsnp, chrom, pos, ref, alt, n_called, alt_allele_freq, maf, af_\<group\>TSV, ~15 MB (165,640 SNPs)precomputed summaries; rare/group-specific variants (Rev #1.3, #2)

Varietal groups present in the local labels: XI_indica, GJ_trop, GJ_temp, GJ-adm, cAus, cBasmati (plus "Unknown" for accessions not yet group-assigned).

Exported by the platform teams (cluster-scale; full ~57 M variants)

#FileKey columnsOwnerAnswers
5gene_variant_summary.tsvgene, n_variants, n_high_impact, consequence_counts, pangene_classGrameneprecomputed summaries
6hev_table.tsvvariant, gene, pred_delta_accessibility, group, motif (e.g. OsBZR1)KAUSTexample workflow B; transparency (Rev #2.2/#2.4)
7gea_hits_by_group.tsvvariant, climate_variable, effect, P, FDR, groupCLIMtoolslets reviewers re-test FDR (Rev #2.6)
8mads14_mads50_haplotypes.tsvgene, group, haplotype, defining_variants, mean_heading_date, mean_BIO6KAUSTflagship workflow A reproducibility
9shap_top_snp_to_gene.tsvtrait, snp, shap_importance, nearest_gene, known_novelKAUST (Halimi)interpretability; example workflow C

Large master artifacts (Zenodo, not git)

  • genotypes_with_phenos.parquet — the 165,640 core-SNP × 16,779-accession genotype+phenotype matrix (the input to the AI models). ~1–2 GB.
  • Per-reference bgzipped + tabix-indexed VCFs (for remote tabix streaming) — served from the public Gramene FTP at publication.

Reproduce tables 1–4

cd precomputed_tables
python generate_tables.py            # all four (allele-freq pass reads ~1–2 GB)
python generate_tables.py --quick    # skip the allele-frequency pass