CEASR : a corpus for evaluating automatic speech recognition

Ulasik, Malgorzata Anna; Hürlimann, Manuela; Germann, Fabian; Gedik, Esin; Benites de Azevedo e Souza, Fernando; Cieliebak, Mark

doi:10.21256/zhaw-20125

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-20125

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ulasik, Malgorzata Anna	-
dc.contributor.author	Hürlimann, Manuela	-
dc.contributor.author	Germann, Fabian	-
dc.contributor.author	Gedik, Esin	-
dc.contributor.author	Benites de Azevedo e Souza, Fernando	-
dc.contributor.author	Cieliebak, Mark	-
dc.date.accessioned	2020-06-08T08:08:17Z	-
dc.date.available	2020-06-08T08:08:17Z	-
dc.date.issued	2020	-
dc.identifier.isbn	979-10-95546-34-4	de_CH
dc.identifier.uri	https://www.aclweb.org/anthology/2020.lrec-1.798	de_CH
dc.identifier.uri	https://digitalcollection.zhaw.ch/handle/11475/20125	-
dc.description.abstract	In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from public speech corpora, containing manual transcripts enriched with metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems with normalised transcript texts and metadata. We then use CEASR to evaluate the quality of ASR systems on the basis of their Word Error Rate (WER). Our experiments show, among other results, a substantial difference in quality between commercial versus open-source ASR tools and differences up to a factor of ten for single systems on different corpora. By using CEASR, we could very efficiently and easily obtain these results. This shows that our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort: without the need to collect, process and transcribe the speech data themselves.	de_CH
dc.language.iso	en	de_CH
dc.publisher	European Language Resources Association	de_CH
dc.rights	http://creativecommons.org/licenses/by-nc/4.0/	de_CH
dc.subject	Automatic speech recognition	de_CH
dc.subject	Evaluation	de_CH
dc.subject	Speech corpus	de_CH
dc.subject	ASR system	de_CH
dc.subject.ddc	006: Spezielle Computerverfahren	de_CH
dc.title	CEASR : a corpus for evaluating automatic speech recognition	de_CH
dc.type	Konferenz: Paper	de_CH
dcterms.type	Text	de_CH
zhaw.departement	School of Engineering	de_CH
zhaw.organisationalunit	Institut für Informatik (InIT)	de_CH
dc.identifier.doi	10.21256/zhaw-20125	-
zhaw.conference.details	12th Language Resources and Evaluation Conference (LREC), Marseille, France, 11-16 May 2020	de_CH
zhaw.funding.eu	No	de_CH
zhaw.originated.zhaw	Yes	de_CH
zhaw.pages.end	6485	de_CH
zhaw.pages.start	6477	de_CH
zhaw.parentwork.editor	Calzolari, Nicoletta	-
zhaw.parentwork.editor	Béchet, Frédéric	-
zhaw.parentwork.editor	Blache, Philippe	-
zhaw.parentwork.editor	Choukri, Khalid	-
zhaw.parentwork.editor	Cieri, Christopher	-
zhaw.parentwork.editor	Declerck, Thierry	-
zhaw.parentwork.editor	Goggi, Sara	-
zhaw.parentwork.editor	Isahara, Hitoshi	-
zhaw.parentwork.editor	Maegaard, Bente	-
zhaw.parentwork.editor	Mariani, Joseph	-
zhaw.parentwork.editor	Mazo, Hélène	-
zhaw.parentwork.editor	Moreno, Asuncion	-
zhaw.parentwork.editor	Odijk, Jan	-
zhaw.parentwork.editor	Piperidis, Stelios	-
zhaw.publication.status	publishedVersion	de_CH
zhaw.publication.review	Peer review (Publikation)	de_CH
zhaw.title.proceedings	Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)	de_CH
zhaw.webfeed	Software Systems	de_CH
zhaw.webfeed	Natural Language Processing	de_CH
zhaw.author.additional	No	de_CH
zhaw.display.portrait	Yes	de_CH
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
2020_Ulasik-etal_CEASR_LREC.pdf		733.4 kB	Adobe PDF	View/Open

Show simple item record

Ulasik, M. A., Hürlimann, M., Germann, F., Gedik, E., Benites de Azevedo e Souza, F., & Cieliebak, M. (2020). CEASR : a corpus for evaluating automatic speech recognition [Conference paper]. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 6477–6485). European Language Resources Association. https://doi.org/10.21256/zhaw-20125

Ulasik, M.A. et al. (2020) ‘CEASR : a corpus for evaluating automatic speech recognition’, in N. Calzolari et al. (eds) Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association, pp. 6477–6485. Available at: https://doi.org/10.21256/zhaw-20125.

M. A. Ulasik, M. Hürlimann, F. Germann, E. Gedik, F. Benites de Azevedo e Souza, and M. Cieliebak, “CEASR : a corpus for evaluating automatic speech recognition,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 6477–6485. doi: 10.21256/zhaw-20125.

ULASIK, Malgorzata Anna, Manuela HÜRLIMANN, Fabian GERMANN, Esin GEDIK, Fernando BENITES DE AZEVEDO E SOUZA und Mark CIELIEBAK, 2020. CEASR : a corpus for evaluating automatic speech recognition. In: Nicoletta CALZOLARI, Frédéric BÉCHET, Philippe BLACHE, Khalid CHOUKRI, Christopher CIERI, Thierry DECLERCK, Sara GOGGI, Hitoshi ISAHARA, Bente MAEGAARD, Joseph MARIANI, Hélène MAZO, Asuncion MORENO, Jan ODIJK und Stelios PIPERIDIS (Hrsg.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) [online]. Conference paper. European Language Resources Association. 2020. S. 6477–6485. ISBN 979-10-95546-34-4. Verfügbar unter: https://www.aclweb.org/anthology/2020.lrec-1.798

Ulasik, Malgorzata Anna, Manuela Hürlimann, Fabian Germann, Esin Gedik, Fernando Benites de Azevedo e Souza, and Mark Cieliebak. 2020. “CEASR : A Corpus for Evaluating Automatic Speech Recognition.” Conference paper. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 6477–85. European Language Resources Association. https://doi.org/10.21256/zhaw-20125.

Ulasik, Malgorzata Anna, et al. “CEASR : A Corpus for Evaluating Automatic Speech Recognition.” Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), edited by Nicoletta Calzolari et al., European Language Resources Association, 2020, pp. 6477–85, https://doi.org/10.21256/zhaw-20125.