Data drops // Short research & technical findings

Technical Report: Gene phylogenetic tree of Terebridae based on Cox1

Published on: July 2024

Abstract

We present a COX1 phylogeny for the marine snail family Terebridae, based on 760 publicly available sequences representing 112 nominal species (July 2024). All entries were retrieved from GenBank, cross‑validated against MolluscaBase/WoRMS, and aligned with MAFFT v7.505. After trimming ambiguously aligned sites, maximum‑likelihood and Bayesian trees were inferred using IQ‑TREE v2.4.0 and BEAST v1.10, respectively, with branch support assessed via 1 000 ultrafast bootstraps and 1 000 SH‑aLRT replicates. Of the 112 species, 84 (75 %) formed exclusive, monophyletic clades, while 28 (25 %) exhibited paraphyly or polyphyly — consistent with rates reported in large‐scale COI surveys. We identify key sources of discordance (introgression, NUMTs, taxonomic error, analytical artefacts) and apply transparent criteria to remove six problematic haplotypes, yielding a cleaned dataset of 754 ingroup sequences. This COX1 “backbone” phylogeny highlights resolved clades, flags taxa in need of further scrutiny, and establishes the molecular scaffold for a forthcoming total‑evidence analysis that will integrate additional ribosomal loci (12S, 16S, 28S) and CNN‑extracted shell‑morphology characters.

Introduction

Cytochrome-c-oxidase subunit I (COX1) is the single most abundant locus in public repositories, driven by its use as the standard DNA-barcode for Metazoa. Its high amplification success, well-curated reference libraries, and relatively rapid substitution rate make COX1 an ideal first pass for assessing lineage diversity. At the same time, reliance on a single mitochondrial fragment can mask introgression, incomplete lineage sorting, and nuclear mitochondrial pseudogenes (NUMTs). These limitations motivate a total-evidence framework in which COX1 trees are interpreted alongside additional loci and independent phenotypic features.

This technical report provides an updated COX1 phylogeny of Terebridae that nearly doubles the sampling of previous studies (760 sequences, 112 nominal species) by mining GenBank records through July 2024 and cross-validating every taxon name against MolluscaBase/WoRMS. The work has two immediate purposes. First, it offers a reproducible baseline tree that highlights which clades are already well supported, which species remain non-monophyletic, and where additional sequencing effort is most urgent. Second, it sets the molecular scaffold for a forthcoming total-evidence analysis that will integrate ribosomal genes (12S, 16S, 28S) and shell-morphology characters extracted automatically with convolutional neural networks (CNNs).

By combining exhaustive public-sequence harvesting, rigorous alignment and model selection, and transparent criteria for retaining or excluding problematic sequences, the present COX1 tree establishes the molecular backbone on which the multi-locus and image-informed species tree will be built.

Methods

Data Collection

All sequences were retrieved from https://www.ncbi.nlm.nih.gov/gene/ using search expression "Terebridae"[Organism] OR Terebridae[All Fields]. A total of 2608 entries were retrieved and stored locally in a BioSQL database. These data were cross checked with MolluscaBase/WoRMS. A CSV is available

The sequences selected for this report are the mitochondrial Cox1 (cytochrome oxidase subunit 1) that have a valid MolluscaBase name. A total of 760 sequences fulfilled this criteria. These 760 sequences represent 112 species. These sequences where mostly published (27 publications). The publications with the source of most sequences are those by Holford et al. (2009) [4] and Modica et al. (2020) [5]

Phylogenetical analysis

An outgroup sequence (Conasprella sp. KJ551368) was included to root the tree, and both unrooted and rooted phylogenies were inferred. First, aligned nucleotide sequences were exported in FASTA format using MAFFT v7.505 with the default (“--auto”) parameters, allowing MAFFT to choose the optimal alignment algorithm (L-INS-i, FFT-NS-i, or FFT-NS-2) based on dataset size. Columns containing gaps in more than 80% of taxa were then removed, reducing the Cox1 alignment from 709 to 616 positions. The trimmed alignment was inspected manually to ensure homology and absence of misalignments. Finally, phylogenetic reconstruction was performed in IQ-TREE multicore v2.4.0 (Linux x86_64, built 12 February 2025) under the best-fit substitution model selected by ModelFinder, with branch support assessed via 1 000 ultrafast bootstrap replicates (-B 1000) and 1 000 SH-aLRT replicates (-alrt 1000); both unrooted and outgroup-rooted trees were exported in Newick format. The same sequences were also analyzed using Beast version 10.5.0. [6]. The IQ-Tree ModelFinder recommends to use GTR substitution with Gamma (equal weights) and invariant sites. As tree prior, Speciation: Birth-Death process was selected. All other settings (Priors, States, Operators, MCMC) were default. TreeAnnotater v10.5.0 was used to create a maximum clade credibility (MCC) tree from 10 000 trees generated with Beast.
Except the Beast analysis, all data manipulation and analysis was performed with Jupyter Lab, using Biopython [7].

Results

The COX1 phylogeny of Terebridae (Fig. 1) contains 112 nominal species represented by 760 sequences. Of these, 84 species (75 %) form exclusive, monophyletic clades, whereas 28 species (25 %) are either paraphyletic or polyphyletic. A quarter of the fauna therefore fails the “barcode = species” expectation at this single locus.

That proportion lies within the range reported for large animal surveys. The meta-analysis of Funk & Omland (2003) [1] collated 2 319 species from 564 mtDNA studies and found that 23 % showed species-level non-monophyly, concluding the phenomenon is taxonomically widespread and statistically common. A decade later, Ross (2014) [2] re-estimated the figure using 7 368 publicly available COI barcodes and recovered 19 % paraphyletic species — slightly lower, yet broadly confirming the earlier result. Focusing on a single, well-sampled insect order, Mutanen et al. (2016) [3] analysed >4 000 European Lepidoptera and, after stringent curation, recorded just 12 % non-monophyly, demonstrating how rigorous voucher checks and denser sampling can reduce the apparent rate.

Although the mitochondrial COX1 gene has become the work-horse of DNA-barcoding, it frequently fails to render species as tidy monophyletic units. Four, partly overlapping, sources of discordance recur across empirical datasets.

True biological discordance. Incomplete lineage sorting (ILS) is inevitable when speciation was rapid and effective population sizes were large. Recently diverged species may still share ancestral mitochondrial polymorphisms, so their haplotypes intermix on a gene tree, obscuring species boundaries. In addition, mitochondria can move sideways: hybridisation followed by repeated back-crossing may lead to mitochondrial introgression or outright “mito-capture”. When that happens, the captured COX1 lineage will place its host species inside the donor’s clade, producing a characteristic pattern that is repeated across many mitochondrial loci but contradicted by the nuclear genome.
Sequence artefacts.A deceptively common pitfall is the amplification of nuclear mitochondrial pseudogenes (NUMTs). These truncated, non-functional copies of COX1 reside in the nucleus; they accumulate substitutions, frameshifts or stop codons, and often yield unusually long branches or odd base-composition. PCR chimeras, low-quality reads and cross-sample contamination create similar rogue sequences that inflate paraphyly.
Taxonomic problems.Even perfect sequences cannot rescue mis-named vouchers. Mis-identifications shuffle genuine haplotypes into the wrong species, while cryptic species complexes hide multiple biological entities under a single name. Both scenarios split the nominal species into two or more well-supported clades and can only be fixed by re-examining morphology, geography and additional loci.
Analytical issues. Alignment artefacts—untrimmed primer tails, poorly aligned indels—stretch or compress terminal branches, whereas low-support rearrangements can scatter conspecific samples. Simply collapsing edges with bootstrap support below ~70 % often restores monophyly in otherwise problematic trees.

Because mitochondrial genomes have an effective population size only one-quarter that of nuclear genes, genuine ILS is actually less frequent in COX1 than in many nuclear markers. Consequently, when COX1 trees show species-level paraphyly or polyphyly, the leading culprits tend to be introgression, NUMTs, taxonomic error or analytical noise rather than classical ILS. Robust workflows therefore pair COX1 barcodes with nuclear loci, rigorous voucher curation and alignment hygiene to discriminate biological signal from technical artefact.

For the non-monophyletic haplotype EU685521.1, Figure 2 (nucleotides) and Figure 3 (translated amino acids) compare it with its two closest COX1 neighbours (Myurella dedonderi). All three sequences are identical at the amino-acid level and differ by only two silent substitutions. Because COX1 therefore contains virtually no phylogenetically informative sites for this trio, further statistical resampling of this locus alone cannot restore monophyly. Resolving the discordance will require additional, independent markers or a re-examination of voucher identity.

The history of this sequence (and several others) is rather confusing because the sequence is published as "Hastulopsis sp." [4]. The entry in Genbank has the name "Strioterebrum fuscotaeniatum" in its description, but is tagged as organism "Punctoterebra fuscotaeniata". The accepted name is shown to be Punctoterebra fuscotaeniata (WoRMS: https://www.marinespecies.org/aphia.php?p=taxdetails&id=1417524). The sibling sequences are all Hastulopsis species (see [4]), the current name (WoRMS) is Myurella.

Our objective is to present a total-evidence phylogeny that integrates molecular data with CNN-derived image characters, not to carry out a full taxonomic revision of the Terebridae. We will therefore remove all COX1 haplotypes where a species is present in the subtree of another Genus. In practice, we will not investigate the causes of the non-monophyly.

Following sequences are removed because the species is present inside the subtree of another genus. A blast shows their best match are NOT sequences of the same or closely related species. All sequences gave also a low support (<0.5) value:

EU685533: Terebra fenestrata in Pellifronia genus
EU685532: Pellifronia jungi in Terebra subtree
EU685583: Myurella affinis in Terebra subtree
EU685517: Punctoterebra fuscotaeniata in Myurella subtree
EU685521: Punctoterebra fuscotaeniata in Myurella subtree
MK852047: Profunditerebra poppei in Punctoterebra subtree

The cleaned COX1 phylogeny of Terebridae (Fig. 4) contains 111 nominal species represented by 754 sequences (and 1 outgroup). Of these, 87 species (78 %) form exclusive, monophyletic clades, whereas 24 species (22 %) are either paraphyletic or polyphyletic. After removing the above list, still 2 genera are non-monophyletic: Neoterebra and Maculauger. Before cleaning, 10 of the 12 genera were non-monophyletic. Because there not a "lone" species in a subtree of another genus in this case, the data are kept and used in the next step, creation of the species tree.

The same 754 sequences were inputted to Beast. A similar topology was produced (Fig 5.). From the 111 species, 89 are monophyletic. The original list of 760 sequences was also analyzed with Beast and the same 6 sequences were located inside the subtree of another genus (data not shown).

Next steps

creation of gene tree for 12S, 16S and (nuclear) 28S ribosomal RNA.
creation of a multilocus species tree
creation of all gene trees and species tree using the Bayesian Markov Chain Monte Carlo method

References

[1] Funk, D & Omland, K. Species-Level Paraphyly and Polyphyly: Frequency, Causes, and Consequences, with Insights from Animal Mitochondrial DNA. Annual Review of Ecology, Evolution, and Systematics. 34. 397-423. (2003)
[2] Ross HA. The incidence of species-level paraphyly in animals: a re-assessment.. Mol Phylogenet Evol. 76:10-7. (2014)
[3] Mutanen M et al. Species-Level Para- and Polyphyly in DNA Barcode Gene Trees: Strong Operational Bias in European Lepidoptera.. Syst Biol. 65(6):1024-1040. (2016)
[4] M Holford et al. Evolution of the Toxoglossa Venom Apparatus as Inferred by Molecular Phylogeny of the Terebridae. Mol. Biol. Evol. 26(1):15–25. (2009)
[5] Modica, M. V., Gorson, J., Fedosov, A. E., et al. Macroevolutionary analyses suggest that environmental factors, not venom apparatus, play key role in Terebridae marine snail diversification. Systematic Biology 69 (3): 413–430 (2020)
[6] Suchard MA et al. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution 4 (2018)
[7] Cock PA et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25, 1422-1423 (2009)

Figure 1. Maximum‐Likelihood COX1 Phylogeny of Terebridae An rooted maximum‐likelihood tree inferred in IQ-TREE from 760 COX1 sequences representing 112 Terebridae species. Branch support is indicated by ultrafast bootstrap (≥ 95%) and SH-aLRT (≥ 80%) values at key nodes. This tree highlights the and identifies instances of species-level non-monophyly (25% of taxa)

Figure 2. Nucleotide Alignment of a Non-Monophyletic COX1 Haplotype Cox1 Gene alignment for non-monophyletic nucleotide sequence EU685521.1, Punctoterebra fuscotaeniata, sibling sequences are species Myurella dedonderi

Figure 3. Amino Acid Alignment of a Non-Monophyletic COX1 Haplotype Cox1 Gene alignment for non-monophyletic amino acid sequence EU685521.1, Punctoterebra fuscotaeniata, sibling sequences are species Myurella dedonderi

Figure 4. Cleaned COX1 Phylogeny After Removal of Misplaced Haplotypes Maximum-likelihood tree of 754 COX1 sequences (111 nominal species plus one outgroup) after excluding six haplotypes that nested within other genera with low support. This cleaned tree shows an improved monophyly rate (78% of species), reduces artificial polytomies, and will serve as the refined molecular backbone for species-tree inference

Figure 5. Ultrametric COX1 MCC Tree from BEAST Analysis A time-calibrated maximum-clade-credibility (MCC) tree generated in BEAST under a birth–death speciation prior, based on the same 754-sequence alignment used in Figure 4. The ultrametric format and consensus filtering collapse poorly supported branches, producing a tree shape consistent with birth–death diversification expectations

Shell Identification