Upload an image and identify the taxon of the shell
Published on: July 2024
We present a COX1 phylogeny for the marine snail family Terebridae, based on 760 publicly available sequences representing 112 nominal species (July 2024). All entries were retrieved from GenBank, cross‑validated against MolluscaBase/WoRMS, and aligned with MAFFT v7.505. After trimming ambiguously aligned sites, maximum‑likelihood and Bayesian trees were inferred using IQ‑TREE v2.4.0 and BEAST v1.10, respectively, with branch support assessed via 1 000 ultrafast bootstraps and 1 000 SH‑aLRT replicates. Of the 112 species, 84 (75 %) formed exclusive, monophyletic clades, while 28 (25 %) exhibited paraphyly or polyphyly — consistent with rates reported in large‐scale COI surveys. We identify key sources of discordance (introgression, NUMTs, taxonomic error, analytical artefacts) and apply transparent criteria to remove six problematic haplotypes, yielding a cleaned dataset of 754 ingroup sequences. This COX1 “backbone” phylogeny highlights resolved clades, flags taxa in need of further scrutiny, and establishes the molecular scaffold for a forthcoming total‑evidence analysis that will integrate additional ribosomal loci (12S, 16S, 28S) and CNN‑extracted shell‑morphology characters.
Cytochrome-c-oxidase subunit I (COX1) is the single most abundant locus in public repositories, driven by its use as the standard DNA-barcode for Metazoa. Its high amplification success, well-curated reference libraries, and relatively rapid substitution rate make COX1 an ideal first pass for assessing lineage diversity. At the same time, reliance on a single mitochondrial fragment can mask introgression, incomplete lineage sorting, and nuclear mitochondrial pseudogenes (NUMTs). These limitations motivate a total-evidence framework in which COX1 trees are interpreted alongside additional loci and independent phenotypic features.
This technical report provides an updated COX1 phylogeny of Terebridae that nearly doubles the sampling of previous studies (760 sequences, 112 nominal species) by mining GenBank records through July 2024 and cross-validating every taxon name against MolluscaBase/WoRMS. The work has two immediate purposes. First, it offers a reproducible baseline tree that highlights which clades are already well supported, which species remain non-monophyletic, and where additional sequencing effort is most urgent. Second, it sets the molecular scaffold for a forthcoming total-evidence analysis that will integrate ribosomal genes (12S, 16S, 28S) and shell-morphology characters extracted automatically with convolutional neural networks (CNNs).
By combining exhaustive public-sequence harvesting, rigorous alignment and model selection, and transparent criteria for retaining or excluding problematic sequences, the present COX1 tree establishes the molecular backbone on which the multi-locus and image-informed species tree will be built.
All sequences were retrieved from https://www.ncbi.nlm.nih.gov/gene/ using search expression "Terebridae"[Organism] OR Terebridae[All Fields]. A total of 2608 entries were retrieved and stored locally in a BioSQL database. These data were cross checked with MolluscaBase/WoRMS. A CSV is available
The sequences selected for this report are the mitochondrial Cox1 (cytochrome oxidase subunit 1) that have a valid MolluscaBase name. A total of 760 sequences fulfilled this criteria. These 760 sequences represent 112 species. These sequences where mostly published (27 publications). The publications with the source of most sequences are those by Holford et al. (2009) [4] and Modica et al. (2020) [5]
An outgroup sequence (Conasprella sp. KJ551368) was included to root the tree, and both unrooted and rooted phylogenies were inferred. First, aligned nucleotide
sequences were exported in FASTA format using MAFFT v7.505 with the default (“--auto”) parameters, allowing MAFFT to choose the optimal alignment algorithm (L-INS-i,
FFT-NS-i, or FFT-NS-2) based on dataset size. Columns containing gaps in more than 80% of taxa were then removed, reducing the Cox1 alignment from 709
to 616 positions. The trimmed alignment was inspected manually to ensure homology and absence of misalignments. Finally, phylogenetic reconstruction was
performed in IQ-TREE multicore v2.4.0 (Linux x86_64, built 12 February 2025) under the best-fit substitution model selected by ModelFinder,
with branch support assessed via 1 000 ultrafast bootstrap replicates (-B 1000) and 1 000 SH-aLRT replicates (-alrt 1000);
both unrooted and outgroup-rooted trees were exported in Newick format.
The same sequences were also analyzed using Beast version 10.5.0. [6]. The IQ-Tree ModelFinder recommends to use GTR substitution with
Gamma (equal weights) and invariant sites. As tree prior, Speciation: Birth-Death process was selected. All other settings (Priors, States, Operators, MCMC) were default.
TreeAnnotater v10.5.0 was used to create a maximum clade credibility (MCC) tree from 10 000 trees generated with Beast.
Except the Beast analysis, all data manipulation and analysis was performed with Jupyter Lab,
using Biopython [7].
The COX1 phylogeny of Terebridae (Fig. 1) contains 112 nominal species represented by 760 sequences. Of these, 84 species (75 %) form exclusive, monophyletic clades, whereas 28 species (25 %) are either paraphyletic or polyphyletic. A quarter of the fauna therefore fails the “barcode = species” expectation at this single locus.
That proportion lies within the range reported for large animal surveys. The meta-analysis of Funk & Omland (2003) [1] collated 2 319 species from 564 mtDNA studies and found that 23 % showed species-level non-monophyly, concluding the phenomenon is taxonomically widespread and statistically common. A decade later, Ross (2014) [2] re-estimated the figure using 7 368 publicly available COI barcodes and recovered 19 % paraphyletic species — slightly lower, yet broadly confirming the earlier result. Focusing on a single, well-sampled insect order, Mutanen et al. (2016) [3] analysed >4 000 European Lepidoptera and, after stringent curation, recorded just 12 % non-monophyly, demonstrating how rigorous voucher checks and denser sampling can reduce the apparent rate.
Although the mitochondrial COX1 gene has become the work-horse of DNA-barcoding, it frequently fails to render species as tidy monophyletic units. Four, partly overlapping, sources of discordance recur across empirical datasets.
For the non-monophyletic haplotype EU685521.1, Figure 2 (nucleotides) and Figure 3 (translated amino acids) compare it with its two closest COX1 neighbours (Myurella dedonderi). All three sequences are identical at the amino-acid level and differ by only two silent substitutions. Because COX1 therefore contains virtually no phylogenetically informative sites for this trio, further statistical resampling of this locus alone cannot restore monophyly. Resolving the discordance will require additional, independent markers or a re-examination of voucher identity.
The history of this sequence (and several others) is rather confusing because the sequence is published as "Hastulopsis sp." [4]. The entry in Genbank has the name "Strioterebrum fuscotaeniatum" in its description, but is tagged as organism "Punctoterebra fuscotaeniata". The accepted name is shown to be Punctoterebra fuscotaeniata (WoRMS: https://www.marinespecies.org/aphia.php?p=taxdetails&id=1417524). The sibling sequences are all Hastulopsis species (see [4]), the current name (WoRMS) is Myurella.
Our objective is to present a total-evidence phylogeny that integrates molecular data with CNN-derived image characters, not to carry out a full taxonomic revision of the Terebridae. We will therefore remove all COX1 haplotypes where a species is present in the subtree of another Genus. In practice, we will not investigate the causes of the non-monophyly.
Following sequences are removed because the species is present inside the subtree of another genus. A blast shows their best match are NOT sequences of the same or closely related species. All sequences gave also a low support (<0.5) value:
The cleaned COX1 phylogeny of Terebridae (Fig. 4) contains 111 nominal species represented by 754 sequences (and 1 outgroup). Of these, 87 species (78 %) form exclusive, monophyletic clades, whereas 24 species (22 %) are either paraphyletic or polyphyletic. After removing the above list, still 2 genera are non-monophyletic: Neoterebra and Maculauger. Before cleaning, 10 of the 12 genera were non-monophyletic. Because there not a "lone" species in a subtree of another genus in this case, the data are kept and used in the next step, creation of the species tree.
The same 754 sequences were inputted to Beast. A similar topology was produced (Fig 5.). From the 111 species, 89 are monophyletic. The original list of 760 sequences was also analyzed with Beast and the same 6 sequences were located inside the subtree of another genus (data not shown).
Figure 1. Maximum‐Likelihood COX1 Phylogeny of Terebridae An rooted maximum‐likelihood tree inferred in IQ-TREE from 760 COX1 sequences representing 112 Terebridae species. Branch support is indicated by ultrafast bootstrap (≥ 95%) and SH-aLRT (≥ 80%) values at key nodes. This tree highlights the and identifies instances of species-level non-monophyly (25% of taxa)
Figure 2. Nucleotide Alignment of a Non-Monophyletic COX1 Haplotype Cox1 Gene alignment for non-monophyletic nucleotide sequence EU685521.1, Punctoterebra fuscotaeniata, sibling sequences are species Myurella dedonderi
Figure 3. Amino Acid Alignment of a Non-Monophyletic COX1 Haplotype Cox1 Gene alignment for non-monophyletic amino acid sequence EU685521.1, Punctoterebra fuscotaeniata, sibling sequences are species Myurella dedonderi
Figure 4. Cleaned COX1 Phylogeny After Removal of Misplaced Haplotypes Maximum-likelihood tree of 754 COX1 sequences (111 nominal species plus one outgroup) after excluding six haplotypes that nested within other genera with low support. This cleaned tree shows an improved monophyly rate (78% of species), reduces artificial polytomies, and will serve as the refined molecular backbone for species-tree inference
Figure 5. Ultrametric COX1 MCC Tree from BEAST Analysis A time-calibrated maximum-clade-credibility (MCC) tree generated in BEAST under a birth–death speciation prior, based on the same 754-sequence alignment used in Figure 4. The ultrametric format and consensus filtering collapse poorly supported branches, producing a tree shape consistent with birth–death diversification expectations