Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-30407
Full metadata record
DC FieldValueLanguage
dc.contributor.authorNeururer, Daniel-
dc.contributor.authorDellwo, Volker-
dc.contributor.authorStadelmann, Thilo-
dc.date.accessioned2024-03-27T15:49:48Z-
dc.date.available2024-03-27T15:49:48Z-
dc.date.issued2024-03-26-
dc.identifier.issn0167-8655de_CH
dc.identifier.issn1872-7344de_CH
dc.identifier.urihttps://digitalcollection.zhaw.ch/handle/11475/30407-
dc.description.abstractWhile deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.de_CH
dc.language.isoende_CH
dc.publisherElsevierde_CH
dc.relation.ispartofPattern Recognition Lettersde_CH
dc.rightshttps://creativecommons.org/licenses/by/4.0/de_CH
dc.subjectSpeaker verificationde_CH
dc.subjectSpeaker clusteringde_CH
dc.subjectDynamic featurede_CH
dc.subjectProsodic featuresde_CH
dc.subjectDeep learningde_CH
dc.subjectExplainable AI (XAI)de_CH
dc.subject.ddc006: Spezielle Computerverfahrende_CH
dc.titleDeep neural networks for automatic speaker recognition do not learn supra-segmental temporal featuresde_CH
dc.typeBeitrag in wissenschaftlicher Zeitschriftde_CH
dcterms.typeTextde_CH
zhaw.departementSchool of Engineeringde_CH
zhaw.organisationalunitCentre for Artificial Intelligence (CAI)de_CH
dc.identifier.doi10.1016/j.patrec.2024.03.016de_CH
dc.identifier.doi10.21256/zhaw-30407-
zhaw.funding.euNode_CH
zhaw.originated.zhawYesde_CH
zhaw.pages.end69de_CH
zhaw.pages.start64de_CH
zhaw.publication.statuspublishedVersionde_CH
zhaw.volume181de_CH
zhaw.publication.reviewPeer review (Publikation)de_CH
zhaw.webfeedDatalabde_CH
zhaw.webfeedMachine Perception and Cognitionde_CH
zhaw.webfeedZHAW digitalde_CH
zhaw.author.additionalNode_CH
zhaw.display.portraitYesde_CH
Appears in collections:Publikationen School of Engineering

Files in This Item:
File Description SizeFormat 
2024_Neururer_Deep-learning-networks-for-automatic-speaker-recognition_VoR.pdfPublished Version1.28 MBAdobe PDFThumbnail
View/Open
2024_Neururer_Deep-learning-networks-for-automatic-speaker-recognition.pdfAccepted Version6.76 MBAdobe PDFThumbnail
View/Open
Show simple item record
Neururer, D., Dellwo, V., & Stadelmann, T. (2024). Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features. Pattern Recognition Letters, 181, 64–69. https://doi.org/10.1016/j.patrec.2024.03.016
Neururer, D., Dellwo, V. and Stadelmann, T. (2024) ‘Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features’, Pattern Recognition Letters, 181, pp. 64–69. Available at: https://doi.org/10.1016/j.patrec.2024.03.016.
D. Neururer, V. Dellwo, and T. Stadelmann, “Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features,” Pattern Recognition Letters, vol. 181, pp. 64–69, Mar. 2024, doi: 10.1016/j.patrec.2024.03.016.
NEURURER, Daniel, Volker DELLWO und Thilo STADELMANN, 2024. Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features. Pattern Recognition Letters. 26 März 2024. Bd. 181, S. 64–69. DOI 10.1016/j.patrec.2024.03.016
Neururer, Daniel, Volker Dellwo, and Thilo Stadelmann. 2024. “Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features.” Pattern Recognition Letters 181 (March): 64–69. https://doi.org/10.1016/j.patrec.2024.03.016.
Neururer, Daniel, et al. “Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features.” Pattern Recognition Letters, vol. 181, Mar. 2024, pp. 64–69, https://doi.org/10.1016/j.patrec.2024.03.016.


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.