FormulaNet : a benchmark dataset for mathematical formula detection

Schmitt-Koopmann, Felix M.; Huang, Elaine M.; Hutter, Hans-Peter; Stadelmann, Thilo; Darvishy, Alireza

doi:10.1109/ACCESS.2022.3202639

Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-25554

Publication type:	Article in scientific journal
Type of review:	Peer review (publication)
Title:	FormulaNet : a benchmark dataset for mathematical formula detection
Authors:	Schmitt-Koopmann, Felix M. Huang, Elaine M. Hutter, Hans-Peter Stadelmann, Thilo Darvishy, Alireza
et. al:	No
DOI:	10.1109/ACCESS.2022.3202639 10.21256/zhaw-25554
Published in:	IEEE Access
Volume(Issue):	10
Page(s):	91588
Pages to:	91596
Issue Date:	2022
Publisher / Ed. Institution:	IEEE
ISSN:	2169-3536
Language:	English
Subjects:	Automatic annotation; Dataset; Document analysis; Deep learning; Mathematical formula detection; Page object detection
Subject (DDC):	005: Computer programming, programs and data
Abstract:	One unsolved sub-task of document analysis is mathematical formula detection (MFD). Research by ourselves and others has shown that existing MFD datasets with inline and display formula labels are small and have insufficient labeling quality. There is therefore an urgent need for datasets with better quality labeling for future research in the MFD field, as they have a high impact on the performance of the models trained on them. We present an advanced labeling pipeline and a new dataset called FormulaNet in this paper. At over 45k pages, we believe that FormulaNet is the largest MFD dataset with inline formula labels. Our experiments demonstrate substantially improved labeling quality for inline and display formulae detection over existing datasets. Additionally, we provide a math formula detection baseline for FormulaNet with an mAP of 0.754. Our dataset is intended to help address the MFD task and may enable the development of new applications, such as making mathematical formulae accessible in PDFs for visually impaired screen reader users.
URI:	https://digitalcollection.zhaw.ch/handle/11475/25554
Fulltext version:	Published version
License (according to publishing contract):	CC BY 4.0: Attribution 4.0 International
Departement:	School of Engineering
Organisational Unit:	Centre for Artificial Intelligence (CAI) Institute of Computer Science (InIT)
Appears in collections:	Publikationen School of Engineering

Files in This Item:

File	Description	Size	Format
2022_SchmittKoopmann-etal_FormulaNet-Benchmark-Dataset-Mathematical-Formula-Detection.pdf		1.35 MB	Adobe PDF	View/Open

Show full item record

Schmitt-Koopmann, F. M., Huang, E. M., Hutter, H.-P., Stadelmann, T., & Darvishy, A. (2022). FormulaNet : a benchmark dataset for mathematical formula detection. IEEE Access, 10, 91588–91596. https://doi.org/10.1109/ACCESS.2022.3202639

Schmitt-Koopmann, F.M. et al. (2022) ‘FormulaNet : a benchmark dataset for mathematical formula detection’, IEEE Access, 10, pp. 91588–91596. Available at: https://doi.org/10.1109/ACCESS.2022.3202639.

F. M. Schmitt-Koopmann, E. M. Huang, H.-P. Hutter, T. Stadelmann, and A. Darvishy, “FormulaNet : a benchmark dataset for mathematical formula detection,” IEEE Access, vol. 10, pp. 91588–91596, 2022, doi: 10.1109/ACCESS.2022.3202639.

SCHMITT-KOOPMANN, Felix M., Elaine M. HUANG, Hans-Peter HUTTER, Thilo STADELMANN und Alireza DARVISHY, 2022. FormulaNet : a benchmark dataset for mathematical formula detection. IEEE Access. 2022. Bd. 10, S. 91588–91596. DOI 10.1109/ACCESS.2022.3202639

Schmitt-Koopmann, Felix M., Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, and Alireza Darvishy. 2022. “FormulaNet : A Benchmark Dataset for Mathematical Formula Detection.” IEEE Access 10: 91588–96. https://doi.org/10.1109/ACCESS.2022.3202639.

Schmitt-Koopmann, Felix M., et al. “FormulaNet : A Benchmark Dataset for Mathematical Formula Detection.” IEEE Access, vol. 10, 2022, pp. 91588–96, https://doi.org/10.1109/ACCESS.2022.3202639.