Please use this identifier to cite or link to this item:
https://doi.org/10.21256/zhaw-20319
Publication type: | Conference paper |
Type of review: | Peer review (publication) |
Title: | A methodology for creating question answering corpora using inverse data annotation |
Authors: | Deriu, Jan Milan Mlynchyk, Katsiaryna Schläpfer, Philippe Rodrigo, Alvaro von Grünigen, Dirk Kaiser, Nicolas Stockinger, Kurt Agirre, Eneko Cieliebak, Mark |
et. al: | No |
DOI: | 10.18653/v1/2020.acl-main.84 10.21256/zhaw-20319 |
Proceedings: | Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |
Pages: | 897 |
Pages to: | 911 |
Conference details: | ACL 2020, Virtual, 5-10 July 2020 |
Issue Date: | Jul-2020 |
Publisher / Ed. Institution: | Association for Computational Linguistics |
Language: | English |
Subjects: | Natural language interface to database; Artificial intelligence; Deep learning; Semantic parsing |
Subject (DDC): | 004: Computer science 400: Language, linguistics |
Abstract: | In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations. Thus, we randomly generate OTs from a context free grammar and annotators just have to write the appropriate question and assign the tokens. We compare our corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases, to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our dataset is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance. |
URI: | https://digitalcollection.zhaw.ch/handle/11475/20319 |
Fulltext version: | Published version |
License (according to publishing contract): | CC BY 4.0: Attribution 4.0 International |
Departement: | School of Engineering |
Organisational Unit: | Institute of Applied Information Technology (InIT) |
Published as part of the ZHAW project: | LIHLITH - Learning to Interact with Humans by Lifelong Interaction with Humans EU Horizon 2020: INODE - Intelligent Open Data Exploration |
Appears in collections: | Publikationen School of Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
2020_Deriu-etal_Question-answering-corpora-inverse-data-annotation.pdf | 556.6 kB | Adobe PDF | ![]() View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.