Document Type

Article

Publication Date

4-1-2025

Journal / Book Title

Machine Learning

Abstract

Synthetic data has been actively used for various machine learning-based tasks due to its benefits such as massive reproducibility and privacy enhancement compared to using the original data. The quality of the generated synthetic dataset crucially depends on the quality of the original data, and the latter is often corrupted by label noise. While there have been studies on feature noise, how label noise affects synthetic data generation is under-explored. In this paper, we evaluate the impact of the noisy label on synthetic data generation with a focus on tabular data. One challenge is how to evaluate the quality of synthetic data under label noise. To this end, we design comprehensive experiments to measure the impact of label noise on synthetic data generation in different aspects: synthetic data quality, data utility, and convergence for training synthesizers and machine learning models for downstream tasks. The empirical results cover wide aspects of synthetic data generation under label noise and they show quality and utility degrades with higher noise levels while there is no significant effect on the synthesizer convergence observed.

Comments

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

DOI

10.1007/s10994-024-06629-5

Journal ISSN / Book ISBN

85218410360 (Scopus)

Published Citation

Kim, J., Huang, C. & Liu, X. An empirical study on impact of label noise on synthetic tabular data generation. Mach Learn 114, 90 (2025). https://doi.org/10.1007/s10994-024-06629-5

Share

COinS