Publication type: Master thesis
Title: Speech recognition component for search-oriented conversational artificial intelligence
Authors: Büchi, Matthias
Advisors / Reviewers: Hutter, Hans-Peter
Cieliebak, Mark
Extent: 62
Issue Date: 2020
Publisher / Ed. Institution: ZHAW Zürcher Hochschule für Angewandte Wissenschaften
Publisher / Ed. Institution: Winterthur
Language: English
Subjects: Automatic speech recognition; Conversational agent
Subject (DDC): 006: Special computer methods
Abstract: User experience is key to make a computer program successful. If the handling needs a lot of expertise, people will not use it. In an optimal scenario, the user does not need to learn new procedures to control a new application. Conversational agents try to achieve that by providing a user interface using natural language. With spoken natural language the interaction can be simplified even more. In order to create a conversational agent with spoken natural language, a reliable speech recognition system is essential. In this work different aspects of automatic speech recognition (ASR), for the application with a conversational agent, are explored. The goal of the conversational agent is to support people in the process of legal research. It has to find the correct information based on the user’s input. To train a speech recognition system, data is needed. In a first step, two different ways to collect text data are explored. The text is needed to record speech data. With a grammar-based approach, manually crafted rules are used to generate sentences. Since grammars are restricted in variation, neural question generation was evaluated to produce open questions from specific input texts. In a next step, the performance of ASR systems was tested on task- and domain-specific data, using data recorded based on the generated text. Due to restricted time and resources, data was recorded only from one speaker. Since there was not enough data for further experiments on task-specific scenarios, open source German datasets were used to implement and improve acoustic models for generic speech recognition. In order to build a speech recognition component for a conversational agent, different aspects influence the final result. Text generation for training language models or collecting speech data still needs grammar-based approaches for reliable results. Neural question generation produces too many invalid samples. Nevertheless, text generated with grammars can be employed to record speech and train language models. With adaptation using specific language models, open source ASR systems achieve similar results or even outperform commercial systems. For data with a very specific structure open source systems can outperform commercial systems by about 30% word error rate absolutely. Furthermore, for the acoustic model different approaches are feasible. Hybrid systems and end-to-end systems achieve similar results, but the hybrid system is still slightly better. End-to-end systems make adaptation to domain specific use cases easier, since no phonetic transcriptions are needed. To go even further, an end-to-end system can be trained on character n-grams instead of only single characters. Models trained on predicting tokens, generated with byte pair encoding, perform similar to models based on single characters. With the integration of complex decoding strategies and language models, character-based models still perform better.
License (according to publishing contract): Licence according to publishing contract
Departement: School of Engineering
Organisational Unit: Institute of Applied Information Technology (InIT)
Appears in collections:Publikationen School of Engineering

Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.