Learning embeddings for speaker clustering based on voice equality

Lukic, Yanick X.; Vogt, Carlo; Dürr, Oliver; Stadelmann, Thilo

doi:10.21256/zhaw-3762

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-3762

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lukic, Yanick X.	-
dc.contributor.author	Vogt, Carlo	-
dc.contributor.author	Dürr, Oliver	-
dc.contributor.author	Stadelmann, Thilo	-
dc.date.accessioned	2018-06-19T12:33:06Z	-
dc.date.available	2018-06-19T12:33:06Z	-
dc.date.issued	2017	-
dc.identifier.isbn	978-1-5090-6341-3	de_CH
dc.identifier.other	INSPEC Accession Number: 17416144	de_CH
dc.identifier.uri	https://digitalcollection.zhaw.ch/handle/11475/7088	-
dc.description.abstract	Recent work has shown that convolutional neural networks (CNNs) trained in a supervised fashion for speaker identification are able to extract features from spectrograms which can be used for speaker clustering. These features are represented by the activations of a certain hidden layer and are called embeddings. However, previous approaches require plenty of additional speaker data to learn the embedding, and although the clustering results are then on par with more traditional approaches using MFCC features etc., room for improvements stems from the fact that these embeddings are trained with a surrogate task that is rather far away from segregating unknown voices - namely, identifying few specific speakers. We address both problems by training a CNN to extract embeddings that are similar for equal speakers (regardless of their specific identity) using weakly labeled data. We demonstrate our approach on the well-known TIMIT dataset that has often been used for speaker clustering experiments in the past. We exceed the clustering performance of all previous approaches, but require just 100 instead of 590 unrelated speakers to learn an embedding suited for clustering.	de_CH
dc.language.iso	en	de_CH
dc.publisher	IEEE	de_CH
dc.rights	Licence according to publishing contract	de_CH
dc.subject	Datalab	de_CH
dc.subject	Speaker recognition	de_CH
dc.subject	Deep learning	de_CH
dc.subject	Speaker clustering	de_CH
dc.subject.ddc	006: Spezielle Computerverfahren	de_CH
dc.title	Learning embeddings for speaker clustering based on voice equality	de_CH
dc.type	Konferenz: Paper	de_CH
dcterms.type	Text	de_CH
zhaw.departement	School of Engineering	de_CH
zhaw.organisationalunit	Institut für Informatik (InIT)	de_CH
zhaw.organisationalunit	Institut für Datenanalyse und Prozessdesign (IDP)	de_CH
dc.identifier.doi	10.21256/zhaw-3762	-
dc.identifier.doi	10.1109/MLSP.2017.8168166	de_CH
zhaw.conference.details	27th IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2017), Tokyo, 25-28 September 2017	de_CH
zhaw.funding.eu	No	de_CH
zhaw.originated.zhaw	Yes	de_CH
zhaw.publication.status	acceptedVersion	de_CH
zhaw.publication.review	Peer review (Publikation)	de_CH
zhaw.title.proceedings	2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)	de_CH
zhaw.webfeed	Datalab	de_CH
zhaw.webfeed	Information Engineering	de_CH
zhaw.webfeed	Machine Perception and Cognition	de_CH
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
MLSP_2017.pdf		1.37 MB	Adobe PDF	View/Open

Show simple item record

Lukic, Y. X., Vogt, C., Dürr, O., & Stadelmann, T. (2017). Learning embeddings for speaker clustering based on voice equality. 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). https://doi.org/10.21256/zhaw-3762

Lukic, Y.X. et al. (2017) ‘Learning embeddings for speaker clustering based on voice equality’, in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE. Available at: https://doi.org/10.21256/zhaw-3762.

Y. X. Lukic, C. Vogt, O. Dürr, and T. Stadelmann, “Learning embeddings for speaker clustering based on voice equality,” in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), 2017. doi: 10.21256/zhaw-3762.

LUKIC, Yanick X., Carlo VOGT, Oliver DÜRR und Thilo STADELMANN, 2017. Learning embeddings for speaker clustering based on voice equality. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). Conference paper. IEEE. 2017. ISBN 978-1-5090-6341-3

Lukic, Yanick X., Carlo Vogt, Oliver Dürr, and Thilo Stadelmann. 2017. “Learning Embeddings for Speaker Clustering Based on Voice Equality.” Conference paper. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE. https://doi.org/10.21256/zhaw-3762.

Lukic, Yanick X., et al. “Learning Embeddings for Speaker Clustering Based on Voice Equality.” 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2017, https://doi.org/10.21256/zhaw-3762.