Please use this identifier to cite or link to this item: https://doi.org/10.21256/zhaw-3197
Title: Entity matching on unstructured data : an active learning approach
Authors : Brunner, Ursin
Stockinger, Kurt
Proceedings: 2019 6th Swiss Conference on Data Science (SDS)
Conference details: 6th Swiss Conference on Data Science, Bern, 14 June 2019
Publisher / Ed. Institution : IEEE
Issue Date: 14-Jun-2019
License (according to publishing contract) : Not specified
Type of review: Peer review (publication)
Language : English
Subjects : Entity matching; Active learning; Data integration; Unstructured data
Subject (DDC) : 005: Computer programming, programs and data
Abstract: With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured real-word data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.
Departement: School of Engineering
Organisational Unit: Institute of Applied Information Technology (InIT)
Publication type: Conference paper
DOI : 10.21256/zhaw-3197
ISBN: 978-1-7281-3105-4
URI: https://digitalcollection.zhaw.ch/handle/11475/17388
Appears in Collections:Publikationen School of Engineering

Files in This Item:
File Description SizeFormat 
ActiveLearning_Brunner_Stockinger_SDS_2019.pdfEntity Matching on Unstructured Data: An Active Learning Approach221.35 kBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.