
Plant materials and sample preparation
For genome sequencing, we selected the doubled haploid (DH) A. cepa line DHW30006. It was derived from the female gametophyte of ‘Wonye 30006’, which was a short-day yellow onion9. The onion plants were grown in a greenhouse for 4 months in pots. Leaves from one young onion plant were harvested for DNA extraction (Fig. 2a). For IsoSeq and mRNAseq for gene prediction, DHW30006 onions were sown in September, overwintered, and grown in the field at the Allium Vegetable Research Center (National Institute of Horticultural and Herbal Science, NIHHS), Muan, South Korea (34°58′02.7″N, 126°27′06.8″E). Different tissues were harvested individually: a root, a bulb, and a leaf from a young plant, as well as two developmental stages of roots, bulbs, and flowers from March to May (Fig. 2b). Harvested samples for DNA and RNA extractions were frozen immediately in liquid nitrogen and stored −70 °C deep freezer until extraction. For isolation of high molecular weight genomic DNA, it was extracted from eight grams of young leaves from one young onion plant as previously described AquaPhenol (MPbio, USA) extraction protocol10. Total RNAs were extracted using Trizol reagent (ambion, USA) according to the manufactural instruction. To eliminate residual DNA fragments, it was treated with a DNaseI (Qiagen, Germany) and purified with one more Trizol and Chloroform treatment.
DHW30006 onion plants used for DNA and RNA sequencing. (a) For genome sequencing, DHW30006 onion plants were grown in a greenhouse for four months and young onion leaves were harvested. (b) For IsoSeq and mRNAseq, DHW30006 onions were sown in September, overwintered, and grown in the farm of the Allium Vegetable Research Center. And then it was harvested separately by tissues; a root, a bulb and a leaf from a young plant, two developmental stages of roots, bulbs and flowers.
PacBio, Illumina, Hi-C and RNA sequencing
To generate high-quality genome sequences, both long-read sequencing using the PacBio Sequel II system and short-read sequencing using the Illumina NovaSeq 6000 platform were performed. High-molecular-weight genomic DNA was sheared into ~20 kb fragments using a G-tube (Covaris, Woburn, MA, USA), following the manufacturer’s recommended protocol for long-read genome sequencing. The SMRTbell™ library was constructed using the SMRTbell™ Template Prep Kit (Pacific Biosciences, Menlo Park, CA, USA). After annealing the sequencing primer to the SMRTbell template, DNA polymerase was bound to the complex using the Sequel™ Binding Kit (Pacific Biosciences). The prepared library was then loaded onto the PacBio Sequel II system for sequencing. A total of one hundred Sequel™ 1 M SMRT cells were used, and each SMRT cell was subjected to a 600-minute movie runtime to capture sequencing data. For short-read sequencing, genomic DNA was randomly fragmented into ~350 bp and ~550 bp fragments using the Covaris S2 system (Covaris Inc., Woburn, MA, USA). The fragmented DNA was end-repaired, and Illumina sequencing adapters were ligated to the processed DNA fragments. The final short-read library was sequenced on an Illumina NovaSeq 6000 system (Illumina Inc., San Diego, CA, USA) using a paired-end (2 × 150 bp) sequencing strategy to generate high-coverage short-read data. PacBio long-read data were produced a total of 1,006,477,019,394 bp (63-fold coverage) with N50 read length 26,289 bp, and Illumina short-read data were produced a total of 855,328,332,552 bp (53.5-fold coverage). For scaffolding, seven CHiCAGO libraries and five Dovetail Hi-C libraries were prepared in a similar manner as described previously11,12, sequenced on an Illumina Novaseq 6000 by 2 × 100 sequencing, and produced a total of 483,185,565,306 bp and 177,737,984,042 bp, respectively.
Iso-Seq sequencing and analysis were done by the PacBio Sequel sequencing platform and IsoSeq3 pipeline, and produced a total of 3,407,596,089 bp with mean read length 1,679 bp. For RNA sequencing, the libraries were prepared according to the manufacturer’s instructions (Illumina Truseq stranded mRNA library prep kit). RNA sequencing was performed using an Illumina NovaSeq 6000 system following provided protocols for 2 × 100 PE sequencing and produced a total of 46,525,891,188 bp. RNAseq reads were preprocessed using cutadapt v.2.813 and mapped using the aligner STAR v.2.7.1a14.
Genome size estimation and k-mer analysis
To estimate genome size, we used whole genome sequencing data, k-mer counting by Jellyfish version 2.1.315 with the k-mer size set to 17, 19 and 25 were used, and the genome size can be estimated using the following formula: Genome size = total number of nucleotides/peak depth of k-mer frequency distribution. Additionally, we were analyzed using the GenomeScope16 website to obtain estimates for genome sizes (k = 25, 12.19 Gb), heterozygosity (0.0257%) and duplication levels (0.556%) (Fig. 3).

Onion genome size estimation. (a) K-mer histogram plot using Jellyfish by 17 (red), 19 (green) and 25 (blue) mer (b) GenomeScope results with total genome length (len), percent of the genome that is unique (uniq), overall rate of hetrozygosity (het), mean k-mer coverage for heterozygous bases (kcov), average rate of duplication (dup) and k-mer size (k).
Genome assembly and scaffolding
de novo assembly was conducted using FALCON-Unzip assembler17 with filtered subreads sequences (Fig. 4). The length cut-off option was specified based on the subreads N50 value 26,289 bp. We got primary assembly by unzip for the phased diploid assembly and polished using Arrow consensus algorithm. We performed error correction with 350-bp and 550-bp Illumina short reads using Pilon18 with haplotig-merged primary contigs by default parameters to improve the base quality of genome assembly. After error correction, the primary assembly was extracted.

Flowchart of genome assembly and annotation of A. cepa DHW30006.
For scaffolding, the input de novo assembly, shotgun reads, CHiCAGO library reads, and Dovetail Hi-C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. An iterative analysis was conducted. First, Shotgun and CHiCAGO library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of CHiCAGO read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative mis-joins, to score prospective joins, and make joins above a threshold. After aligning and scaffolding CHiCAGO data, Dovetail Hi-C library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs. To remove the ambiguity by genomic duplications that interfere with scaffold continuity, we performed purge_dups19 with whole genome sequencing data by default parameters. The analysis of the scaffolding was performed with HiRise iteratively. The chromosome sequences for the onion were conducted with corrected contigs and scaffolds using RagTag20. We finally assembled 12.7 Gb of the onion genome, 94.53% of which is anchored to 8 chromosomes (Table 1a).
Genome annotation
Before annotating the structural and functional features, repeat sequences were modeled with RepeatModeler v2.0.3 and masked using RepeatMasker v4.1.1 (Fig. 4). Further, the repeats classified into their subclasses upon the Repbase v20.08 database (Table 3). A total of 9.8 Gb (76.9%) of sequences were annotated and LTR elements were predominantly abundant, accounting for 5.7 Gb (44.9%), especially those of the Gypsy family (4.5 Gb). Genome was scanned for non-coding RNAs through tRNAscan-SE 2.021, RNAmmer v1.222 and Infernal v1.1.223. The evaluation of rRNAs and tRNAs revealed 1,416 rRNAs and 7,389 tRNAs, totaling 851 Kb.
The structural annotations for the repeat-masked genome were conducted with five genomic gene models from the Asparagales taxonomical order, proteins from eleven traits of Allium cepa, plant transcription factor24 and R gene25 datasets from respective databases (Table S1, https://doi.org/10.6084/m9.figshare.28079846.v1)26. The transcriptome generated in this study, consisting of de novo assembled 46.5 Gb of short read RNA-Seq sequences from Illumina and 3.4 Gb of long read full-length transcripts from PacBio IsoSeq method, were also included. The in–house pipeline used three modules, as explained in Shin et al.27, including an evidence-based gene modeler (EGM), an ab–initio gene modeler called predicted gene model (PGM), and a consensus gene modeler (CGM). For EGM, PASA v2.4.128, NCBI-BLAST + v2.2.28+, and exonerate v2.4.0 methods were used to map the listed transcripts and proteins to the masked genome. For PGM, the mapped protein information was used to train ab-initio gene prediction methods using AUGUSTUS v3.1.0 and Geneid v1.429. Finally, the consensus model was developed from both models generated from EGM and PGM. A final set of 65,730 genes were obtained; 8,827 bp of average gene length with 5.48 exons per gene. The functional annotation of genes and proteins was performed using various tools, including GO, KEGG pathway, conserved domains, and assigned transcription factors through Trinotate v3.0.130, InterProScan v.5.331, and PlanTFDB v5.024 (Table 4). Additionally, R genes/proteins were manually curated from The Plant Resistance Genes database25 and RGAugury v1.032. All analyses were performed using default parameters unless stated otherwise. In summary, the genome annotation identified a predominantly repetitive genome, characterized by abundant LTRs, and yielded 65,730 high-confidence genes with detailed functional annotations, contributing valuable resources for further research. We found 65,730 genes, with an average gene length of 8.8 Kb and an average number of exons per gene of 5.48 (Table 1b, Fig. 1b).
Synteny analysis of onion and garlic
We performed the synteny analysis between this onion and the garlic genome5 with protein sequences and related gene annotation information using GENESPACE33, and high collinearity was observed (Fig. 5).

Synteny map of A. cepa and A. sativum. The vertical lines indicate the regions of orthologous genes, and the order of A. cepa chromosomes is colour-coded. The chromosome (rounded rectangle) is scaled by gene rank orders.
Long terminal repeat retrotransposons insertion time estimation
Intact Long terminal repeat retrotransposons (LTR-RT) repeats were retrieved from the garlic5 and onion (DHW30006) genome assemblies and the time estimation was calculated with LTR-retriver v. 2.9.034 following the formula T = K/2μ, where K is the divergence rate and μ is the neutral mutation rate35. Frequency graphs of the insertion time were rendered using R software. Based on LTR insertion time analysis, the accumulation of LTRs over the last 2 million years might have been responsible for the genome size increase (Fig. 1c).
Fluorescence in situ hybridization
The bulbs DHW30006 onion were obtained from the Allium Vegetable Research Center and the seeds of wild watermelon Citrullus amarus PI 299379 were obtained from SPRING SEED CO., LTD. FISH was performed with fresh root tips using the procedures described by Lim et al.36. Pre-labeled oligoprobes (PLOPs) for 5S rDNA and 45S rDNA sequences were labeled following the procedures by Waminal et al.37. Homologous chromosomes were paired based on their rDNA signals, chromosomal size, centromeric position, and arranged in a decreasing order based on chromosomal lengths (Fig. 1a).