Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features

Neururer, Daniel; Dellwo, Volker; Stadelmann, Thilo

doi:10.1016/j.patrec.2024.03.016

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-30407

Publication type:	Article in scientific journal
Type of review:	Peer review (publication)
Title:	Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features
Authors:	Neururer, Daniel Dellwo, Volker Stadelmann, Thilo
et. al:	No
DOI:	10.1016/j.patrec.2024.03.016 10.21256/zhaw-30407
Published in:	Pattern Recognition Letters
Issue Date:	26-Mar-2024
Publisher / Ed. Institution:	Elsevier
ISSN:	0167-8655 1872-7344
Language:	English
Subjects:	Speaker verification; Speaker clustering; Dynamic feature; Prosodic features; Deep learning; Explainable AI (XAI)
Subject (DDC):	006: Special computer methods
Abstract:	While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
URI:	https://digitalcollection.zhaw.ch/handle/11475/30407
Fulltext version:	Accepted version
License (according to publishing contract):	Licence according to publishing contract
Departement:	School of Engineering
Organisational Unit:	Centre for Artificial Intelligence (CAI)
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
2024_Neururer_Deep-learning-networks-for-automatic-speaker-recognition.pdf	Accepted Version	6.76 MB	Adobe PDF	View/Open

Show full item record

Neururer, D., Dellwo, V., & Stadelmann, T. (2024). Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features. Pattern Recognition Letters. https://doi.org/10.1016/j.patrec.2024.03.016

Neururer, D., Dellwo, V. and Stadelmann, T. (2024) ‘Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features’, Pattern Recognition Letters [Preprint]. Available at: https://doi.org/10.1016/j.patrec.2024.03.016.

D. Neururer, V. Dellwo, and T. Stadelmann, “Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features,” Pattern Recognition Letters, Mar. 2024, doi: 10.1016/j.patrec.2024.03.016.

NEURURER, Daniel, Volker DELLWO und Thilo STADELMANN, 2024. Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features. Pattern Recognition Letters. 26 März 2024. DOI 10.1016/j.patrec.2024.03.016

Neururer, Daniel, Volker Dellwo, and Thilo Stadelmann. 2024. “Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features.” Pattern Recognition Letters, March. https://doi.org/10.1016/j.patrec.2024.03.016.

Neururer, Daniel, et al. “Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features.” Pattern Recognition Letters, Mar. 2024, https://doi.org/10.1016/j.patrec.2024.03.016.