Comprehensive re-assembly and annotation dataset for the argan tree (Argania spinosa L., Sapotaceae) genome – Scientific Data
A new genome resource for the argan tree—famed for its drought resistance and high-value oil—is now available, offering a refreshed scaffold-level nuclear genome assembly and a richly layered annotation set. Released in an unedited, pre-final form to speed access, the dataset may still contain errors and remains subject to standard legal disclaimers. Even so, it provides a timely foundation for conservation genomics, trait discovery, and sustainable agriculture in one of North Africa’s most iconic species.
What’s new in this release
Using previously generated Illumina whole-genome shotgun reads from the “Argan Amghar” individual (BioProject PRJNA294096) and the corresponding GenBank assembly (GCA_003260245.2), the team re-assembled and curated a substantially improved draft genome:
- Assembly size: 690 Mbp (scaffold-level).
- Contiguity: scaffold N50 of 25 Mbp; L50 of 11 macro-scaffolds—indicating long, well-ordered scaffolds.
- Repeat landscape: 53.0% of the assembly is annotated as repetitive, consistent with many woody perennials.
- Gene models: 51,078 predicted protein-coding genes and 2,081 non-coding RNA genes.
- Functional coverage: curated functions, domains, and Gene Ontology terms assigned to 32,785 genes.
- Protein evidence: 25,484 proteins supported by UniProtKB/Swiss-Prot matches.
- Completeness: BUSCO analyses indicate a highly complete gene space; the predicted proteome reaches 74.6% completeness.
How the genome and annotations were built
The re-assembly integrates Illumina shotgun data from the Argan Amghar individual with rigorous curation of the existing GenBank reference. Gene prediction relied on two leading ab initio tools—AUGUSTUS and GeneMark-ES—whose outputs were unified using EVidenceModeler, a consensus engine that harmonizes predictions into a stable set of gene models.
Functional annotation combined orthology- and domain-centric pipelines to maximize reliability:
- eggNOG-mapper to infer orthologs and functional terms from large-scale orthology databases.
- InterProScan to identify protein domains, motifs, and signatures.
- BLASTp searches against UniProtKB/Swiss-Prot to assign high-confidence functional labels supported by expert curation.
This multi-pronged approach reduces false positives and strengthens downstream interpretability for pathways, gene families, and trait-linked loci.
Why it matters
Argania spinosa underpins unique dryland ecosystems and local economies across Morocco and neighboring regions. By refining the genome’s contiguity and expanding its functional map, the dataset can accelerate:
- Gene discovery for oil biosynthesis, drought and heat tolerance, pest resistance, and wood formation.
- Marker development for breeding and domestication efforts aimed at yield, resilience, and quality.
- Conservation genomics to trace population structure, genetic diversity, and adaptation under climate stress.
- Comparative genomics across Sapotaceae and other eudicots to illuminate genome evolution and specialized metabolism.
For developers building genomic tools and pipelines, the availability of a unified GFF3 and a predicted proteome FASTA also streamlines integration into analysis stacks, from variant annotation and pan-genome construction to expression and network studies.
Key technical highlights at a glance
- Source data: Illumina WGS reads from “Argan Amghar” (PRJNA294096); reference assembly GCA_003260245.2.
- Assembly quality: 690 Mbp; scaffold N50 25 Mbp; L50 11 macro-scaffolds.
- Repeat content: 53.0% masked and classified.
- Gene prediction: 51,078 protein-coding; 2,081 non-coding RNA genes via AUGUSTUS + GeneMark-ES, integrated by EVidenceModeler.
- Functional annotation: 32,785 genes assigned curated functions, domains, and GO terms via eggNOG-mapper, InterProScan, and BLASTp.
- Protein support: 25,484 proteins with UniProtKB/Swiss-Prot evidence.
- BUSCO: high gene-space completeness; predicted proteome completeness 74.6%.
- Primary outputs: unified GFF3 and predicted proteome FASTA.
Accessing the data
All primary data products are openly available via NCBI and Zenodo. The Zenodo record, which aggregates the unified annotations and sequences, can be found at: https://doi.org/10.5281/zenodo.17901083.
Researchers can retrieve the source reads under BioProject PRJNA294096 and explore the underlying assembly GCA_003260245.2 at GenBank. The harmonized GFF3 and proteome FASTA are plug-and-play for most modern bioinformatics workflows.
Caveats and next steps
As an unedited, pre-final release, this resource may contain errors and is subject to updates. Users should expect versioning as the authors finalize editing and refine annotation thresholds. While BUSCO indicates strong coverage of the assembly’s gene space, the proteome completeness figure (74.6%) leaves room for additional polishing—especially for lineage-specific or highly repetitive gene families common in trees.
Continued improvements could include long-read sequencing or Hi-C scaffolding for chromosome-level resolution, transcriptome-guided refinement to capture alternative isoforms, and expanded functional benchmarking using curated plant-specific databases.
Bottom line
This re-assembled and comprehensively annotated argan genome represents a robust, open foundation for both basic and applied research. With improved contiguity, deep functional labeling, and accessible formats, it should quickly become the reference of choice for scientists, breeders, and conservationists working to understand and safeguard Argania spinosa.