Machine learning data practices through a data curation lens: An evaluation framework

Published in ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2024

This paper examines data practices in machine learning dataset development through the lens of data curation. We develop a framework for evaluating machine learning datasets using data curation concepts and principles through a rubric. Through a mixed-methods analysis of evaluation results for 25 ML datasets, we study the feasibility of data curation principles to be adopted for machine learning data work in practice and explore how data curation is currently performed.