Data drops // Short research & technical findings

Deep Learning Meets Phylogeny: Evaluating CNN‑derived Morphological Signal in Auger Snail Genus Oxymeris (Terebridae)

Published on: August 2025

Abstract

Shell characters in the hyperdiverse Terebridae have long challenged taxonomy because many diagnostic traits are subject to convergent evolution and environmental plasticity [1]. Manual shell identification is also labour‑intensive, relying on features such as radial rib counts that are difficult to score consistently [2]. To explore whether deep‑learning‑derived morphometrics can aid evolutionary inference, we assembled 2,256 images of 13 species in the auger snail genus Oxymeris and trained an EfficientNetV2B2 convolutional neural network to classify species. Feature vectors from the penultimate layer were averaged by species, converted to cosine distance matrices, and used to build neighbor‑joining and UPGMA trees. A multilocus phylogeny was also reconstructed from 12S, 16S, 28S and COI sequences, and patristic distances were calculated. Topologies were compared with Robinson–Foulds distances, and matrix correspondence was tested using Mantel’s test.

The CNN achieved high classification performance (weighted F1 ≈ 0.96) and revealed clusters of visually similar species. However, the morphology‑based UPGMA and NJ trees shared only half their bipartitions (normalized RF = 0.5), and neither tree shared any splits with the molecular phylogeny (RF = 1.0). The Mantel correlation between the CNN and molecular distance matrices was negligible (r = 0.006, p = 0.966). These results suggest that while CNN‑extracted features are powerful for identification, they capture phenetic similarity rather than phylogenetic signal — an outcome consistent with prior warnings that shell morphology alone can misrepresent terebrid relationships [1]. We discuss how sampling limitations, image variability and ecological convergence may contribute to this decoupling, and we advocate integrating machine‑vision data with molecular and traditional morphological characters in a total‑evidence framework.

Introduction

The family Terebridae (auger snails) has long presented a taxonomic puzzle. While their shells are morphologically diverse, this variation is often evolutionarily misleading due to rampant convergent evolution and phenotypic plasticity [1]. Seminal molecular work revealed that many traditional shell-based genera are not natural groups; for instance, species morphologically classified as Terebra fall into at least three distinct clades [1]. This unreliability extends to manual species identification, a labor-intensive process that often relies on subjective characters that are difficult to score consistently [2]. In response to similar challenges in other molluscan groups, researchers have successfully employed Convolutional Neural Networks (CNNs) to automate species identification from images, achieving high accuracy and objectivity where traditional methods fall short [2].

This technological advance raises a critical evolutionary question: What exactly do these powerful models learn from shell images? While CNNs can expertly classify species, it's unclear if their learned features merely capture superficial phenetic similarity or if they contain a deeper phylogenetic signal. Given that shell morphology in terebrids can deceive human experts by falsely grouping unrelated species, it is crucial to test whether a machine's interpretation of that same morphology suffers from the same limitations. Answering this is a key step toward understanding whether machine vision can provide a new, scalable source of character data for evolutionary studies.

Here, the hypothesis is tested that deep-learning-derived shell morphometrics capture phylogenetic signal in the auger snail genus Oxymeris. an image dataset of 2,256 photographs across 13 species is assembled, trained a CNN for classification, and extracted feature vectors from the network to build morphological trees using neighbor-joining (NJ) and UPGMA methods. These trees are compared to a new multilocus molecular phylogeny (12S, 16S, 28S, and COI) using Robinson–Foulds distances for topology and a Mantel test for matrix correlation. This study directly addresses calls to integrate multiple data types in terebrid systematics [4, 5] by quantitatively evaluating whether machine vision can generate phylogenetically informative characters, paving the way for its potential inclusion in a total-evidence framework.

Methods

Data Acquisition

Shell images were collected from many online resources, from specialized websites on shell collecting to institutes and universities. One of the largest collections of shell images is available on GBIF. Also online marketplace such as ebay contain a large collection of images. Other large shell image collections are available at , Malacopics, Femorale and Thelsica. A shell dataset created for AI is available [8].

Some online resources have facilities to download images, but most websites require a specialized webscraper. Scrapy , an open source and collaborative framework for extracting the data from websites, is used to create a custom webscraper to extract images and their scientific names. All data was stored in a MySQL database before further processing was performed.

The dataset for the Oxymeris CNN model comprises 2256 shell images representing 13 Oxymeris species (see table II). There are 22 species in the genus Oxymeris (WoRMS or MolluscaBase), but not enough images were found for 9 species. Species with less than 25 images were removed (see Minimum number of images needed for each species).

All sequences were retrieved from https://www.ncbi.nlm.nih.gov/gene/ using search expression "Terebridae"[Organism] OR Terebridae[All Fields]. A total of 3398 entries were retrieved and stored locally in a BioSQL database. These data were cross checked with MolluscaBase/WoRMS.

The sequences selected for this report are the ribosomal DNA , mitochondrial 12S, 16S and nuclear 28S and the mitochondrial Cox1 gene that have a valid MolluscaBase name. The publications with the source of most sequences are those by Holford et al. (2009) [6] and Modica et al. (2020) [4]

Image Pre-processing

All names were checked against WoRMS or MolluscaBase for their validity. Names that were not found in WoRMS/MolluscaBase were excluded for further processing. While a large part of this data quality step was automated, a manual verification (time-consuming) step was also included. In addition to text-based quality control, both automated and manual preprocessing steps were applied to the images. When an image contained multiple shells, we applied thresholding to binarize the background and then used contour detection to locate each shell’s outline, cropping out each detected contour as an individual image. The background was replaced with a uniform black background. A square image was made by padding with a black background. All shells were resized (400 x 400 px). A final visual selection was made before producing the final image dataset. Overall, 10-20% of the images were removed for various reasons (when other objects were visible in the picture such as hands, habitat, text, etc.).

Hardware

An HP Omen 30L GT13 was used for training the model. It contains a Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz processor, with 64GB RAM, Nvidia GeForce RTX 3080 10GB.

Model Training

For this study, Python (version 3.10.12) was used. The EffiecientNetV2B2 pre-trained models were used. (see Identifying Shells using Convolutional Neural Networks: Data Collection and Model Selection) Table 2 lists the hyperparameters. The models were trained using a batch size of 64 samples, and the number of epochs used was 50. The learning process was initiated with an initial learning rate of 0.0005 and the Adam optimiser was utilised for efficient weight updates. Two callbacks were used, one to monitor the validation loss and decreasing the learning rate , a second callback for early stopping. Both callbacks were applied to prevent the model from over-fitting. Fine-tuning the model was performed as described before. The top 3 layers of the model were unfrozen.

Table I. Hyperparameters

Hyperparameter	Value	Comments
Batch Size	64
Epochs	100	The number of epochs determines how many times the entire training dataset is passed through the model. Because early-stopping is used, often less than 100 epochs were needed. The current model ran for 24 epochs
Optimizer	Adam	The optimizer determines the algorithm used to update model weights during training.
Learning Rate	0.0005	The validation loss was monitored and adjusted reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_lr=1e-6)
Loss	Categorical Cross-entropy
Regularization	0.001

Evaluation Metrics

The evaluation of the performance of the CNN models was carried out by using standard metrics for classification: accuracy, precision, recall, and F1 score, which are defined by [7] in terms of the number of FP (false positives); TP (true positives); TN (true negatives); and FN (false negatives) as follows:

Accuracy = \frac{TP + TN}{TP + TN + FN + FP}

Precision = \frac{TP}{TP + FP}

Recall = \frac{TP}{TP + FN}

F_{1} - Score = 2 x  \frac{Precision Recall}{Precision + Recall}

Python library sklearn.metrics was used to calculate these metrics.

Feature vectors from the the penultimate layer of the trained Oxymeris CNN model

To analyze the internal representations learned by the Oxymeris CNN model, we extracted high-dimensional feature vectors from the penultimate layer of the trained network. These embeddings capture rich semantic information about each image while abstracting away from pixel-level details. The model was implemented and trained using TensorFlow, and feature vectors were obtained using Keras’ Model subclassing, where a truncated version of the network outputs activations from the final convolutional or dense layer prior to classification. Each image in the dataset was passed through the network in inference mode, and the resulting feature vector (1408 dimensions) was stored for further analysis.
To quantify similarity between images, we computed pairwise cosine similarity between feature vectors.

For each image $i$ , compute similarity to all other images $j$ in the same class:

\text{sim}(i,j) = \cos(\theta) = \frac{\vec{f_i} \cdot \vec{f_j}}{\|\vec{f_i}\| \|\vec{f_j}\|}

Species	# images	Recall	Precision	F1
Oxymeris areolata (Link, 1807)	339	1.000	1.000	1.000
Oxymeris cerithina (Lamarck, 1822)	221	1.000	0.973	0.986
Oxymeris chlorata (Lamarck, 1822)	163	0.946	0.946	0.946
Oxymeris crenulata (Linneaus, 1758)	396	0.905	0.927	0.916
Oxymeris dillwynii (Deshayes, 1857)	67	0.909	0.833	0.870
Oxymeris dimidiata (Linneaus, 1758)	263	1.000	1.000	1.000
Oxymeris fatua (Hinds, 1844)	66	1.000	0.923	0.960
Oxymeris felina (Dillwyn, 1817)	155	0.966	0.933	0.949
Oxymeris gouldi (Deshayes, 1857)	56	0.938	1.000	0.968
Oxymeris maculata (Linneaus, 1758)	237	0.875	0.913	0.894
Oxymeris senegalensis (Lamarck, 1822)	186	1.000	0.974	0.987
Oxymeris strigata (G. B. Sowerby I, 1825)	46	0.875	1.000	0.933
Oxymeris trochlea (Deshayes, 1857)	61	1.000	1.000	1.000
This table provides a detailed breakdown of the model's classification performance for each of the 13 Oxymeris species included in the study. # images indicates the total number of images used for each species. Recall (Sensitivity) measures the model's ability to correctly identify all images of a given species. Precision measures the proportion of correct identifications among all images assigned to a species. The F1-score is the harmonic mean of precision and recall, providing a single metric for overall accuracy per species. Values approaching 1.0 indicate high performance.

Species	areolata	cerithina	chlorata	crenulata	dillwynii	dimidiata	fatua	felina	gouldi	maculata	senegalensis	strigata	trochlea
areolata	1.00
cerithina	0.31	1.00
chlorata	0.62	0.69	1.00
crenulata	0.55	0.66	0.76	1.00
dillwynii	0.36	0.77	0.68	0.77	1.00
dimidiata	0.50	0.60	0.63	0.56	0.60	1.00
fatua	0.33	0.77	0.61	0.67	0.82	0.69	1.00
felina	0.61	0.55	0.76	0.76	0.67	0.55	0.67	1.00
gouldi	0.28	0.78	0.60	0.72	0.79	0.60	0.75	0.53	1.00
maculata	0.53	0.59	0.79	0.81	0.60	0.43	0.59	0.68	0.57	1.00
senegalensis	0.54	0.70	0.76	0.74	0.77	0.59	0.66	0.55	0.78	0.71	1.00
strigata	0.60	0.62	0.67	0.65	0.66	0.58	0.61	0.60	0.56	0.57	0.73	1.00
trochlea	0.30	0.63	0.50	0.66	0.65	0.70	0.63	0.46	0.82	0.41	0.60	0.50	1.00
This matrix quantifies the morphological similarity between species as learned by the CNN. Each cell contains the cosine similarity value between the averaged feature vectors of two species. Values range from 0 to 1, where 1 (along the diagonal) represents perfect self-similarity. Higher off-diagonal values (e.g., > 0.80) indicate that the model perceives two species as very similar in shell morphology, while lower values (e.g., < 0.40) indicate high distinctiveness in the learned feature space.

References

[1] Puillandre N, Holford M. The Terebridae and teretoxins: Combining phylogeny and anatomy for concerted discovery of bioactive compounds. BMC Chem Biol. 2010 Sep 17;10:7 (2010)
[2] Eiseul K et al. Deep learning-based phenotype classification of three ark shells: Anadara kagoshimensis, Tegillarca granosa, and Anadara broughtonii. Frontiers in Marine Science, Volume 11 (2024)
[3] Ph. Kerremans Technical Reports for a Total-Evidence Phylogenetics of Terebridae. Identifyshell.org (2025)
[4] Modica, M. V., Gorson, J., Fedosov, A. E., et al. Macroevolutionary analyses suggest that environmental factors, not venom apparatus, play key role in Terebridae marine snail diversification. Systematic Biology 69 (3): 413–430 (2020)
[5] Fedosov, A. E., Malcolm, G., Terryn, Y., et al. Phylogenetic classification of the family Terebridae (Neogastropoda: Conoidea). Journal of Molluscan Studies 86: 1–29 (2020)
[6] M Holford et al. Evolution of the Toxoglossa Venom Apparatus as Inferred by Molecular Phylogeny of the Terebridae. Mol. Biol. Evol. 26(1):15–25. (2009)
[7] Powers, D. M. W. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), 37–63. (2011)
[8] Zhang, Q., Zhou, J., He, J. et al. A shell dataset, for shell features extraction and recognition.. Nature, Sci Data 6, 226 (2019)