About Me
My doctoral research examines data practices in machine learning, particularly the processes by which data is collected, processed, managed and designed for its usage in ML models. I claim that by examining these practices as data curation, ML research can benefit from more effective dataset development that supports transparent, fair, and accountable ML practices and outcomes.
ML research has surfaced concerns regarding biased results1,2,3,4,5, the obfuscation of crowdwork and labour that enables AI6,7,8,9, a lack of accountability and transparency in how models are developed and deployed as well as their outcomes10,11,12,13,14, and the environmental costs15,16,17. Many of these widely recognized challenges of ML systems can be traced back to how data becomes AI. Studies call for ethical data curation18 and methods from archival studies as these fields have long dealt with large amounts of data and concerns of representativeness, ethics, and integrity19,20,21. The field of data curation is rooted in librarianship and archives and is an information science discipline that can inform how to more actively document and steward datasets in ML research.
In my research, I apply concepts, principles, and methods from the field of data curation to make ML data practices more rigorous, reflexive, and responsible. The inquiry lines I deepen through my research are built on anti-racist, intersectional feminist scholarship, and critical data studies18,22. On this basis, I have developed an evaluation framework for systematically evaluating ML dataset development processes through an explicit data curation lens.
My research published at FAccT23 and NeurIPS24 performs a critical examination of the norms and standards that underpin the state of the art in ML. Building on recent scholarship22,25,26 and considering data justice27 and environmental justice28,29, I investigate how inherent values and politics of the field predispose researchers to prioritize certain model work over data work30,31 and how that impacts equity in data practices and outcomes32.
I have led a team of researchers to assess how the NeurIPS Datasets and Benchmarks track currently performs data curation activities by applying my evaluation framework to summarize strengths and areas of improvement. My research, through 1) the operationalization of data curation concepts for ML and 2) the analysis of current curation practices at NeurIPS, provided targeted approaches to improve documentation and reflexivity in the ML dataset development process. To me, the most impactful of these contributions was providing practical strategies and resources to dataset creators to help them consider the proportionality principle when creating their datasets, write positionality statements, and take a more reflexive approach to writing broader impact statements such as considering how field epistemologies may impact their framing and methods.
My work assesses and critiques the cutting edge of data sciences through a lens from reference disciplines, but in the spirit of ‘critical friendship’28, aiming to advance that state of art through constructive criticism rather than condemn it. Bringing in unique perspectives from the field of data curation will offer novel guidance for more equitable dataset creation methods in ML. While this has previously been discussed in theory, my work translates the concepts into actionable strategies and provides practical resources. The outcomes from the critical friendship between data curation and machine learning will be an elevated standard of quality, ethicality, and human oversight of new datasets fostering greater scientific integrity, alignment, and advancement.
References
Z. Ahmed, B. Vidgen, S. A. Hale, EPJ Data Sci. 11, 8 (2022). ↩
T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, in Advances in Neural Information Processing Systems (2016), vol. 29. ↩
L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, A. Rohrbach, in Computer Vision – ECCV 2018, pp. 793–811. ↩
H. Ledford, Nature. 574, 608–609 (2019). ↩
M. Tomalin, B. Byrne, S. Concannon, D. Saunders, S. Ullmann, Ethics Inf. Technol. 23, 419–433 (2021). M. Tomalin, B. Byrne, S. Concannon, D. Saunders, S. Ullmann, Ethics Inf. Technol. 23, 419–433 (2021). ↩
A. A. Casilli, J. Posada, in Society and the Internet: How Networks of Information and Communication are Changing Our Lives, M. Graham, W. H. Dutton, Eds. (Oxford University Press, 2019). ↩
R. Gorwa, Inf. Commun. Soc. 22, 854–871 (2019). ↩
M. Miceli, J. Posada, Proc. ACM Hum.-Comput. Interact. 6, 1–37 (2022). ↩
P. Tubaro, A. A. Casilli, M. Coville, Big Data Soc. 7, 2053951720919776 (2020). ↩
N. Diakopoulos, Commun. ACM. 59, 56–62 (2016). ↩
B. Hutchinson et al., in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575. ↩
M. Khan, A. Hanna, Forthcom. 19 Ohio St Tech LJ 2023 (2022), doi:https://dx.doi.org/10.2139/ssrn.4217148. ↩
I. D. Raji et al., in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 33–44. ↩
M. Veale, M. Van Kleek, R. Binns, in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–14. ↩
E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. ↩
S. Luccioni, B. Trevelin, M. Mitchell, The Environmental Impacts of AI - Primer. Hugging Face Blog (2024), (available at https://huggingface.co/blog/sasha/ai-environment-primer). ↩
G. Varoquaux, A. S. Luccioni, M. Whittaker, in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. ↩
S. Leavy, E. Siapera, B. O’Sullivan, in Proc. of 2021 AAAI/ACM Conf. on AI, Ethics, and Society, pp. 695–703. ↩ ↩2
G. Colavizza, T. Blanke, C. Jeurgens, J. Noordegraaf, J. Comput. Cult. Herit. 15, 1–15 (2022). ↩
E. S. Jo, T. Gebru, in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 306–316. ↩
N. B. Thylstrup, Media Cult. Soc. 44, 655–671 (2022). ↩
A. Paullada, I. D. Raji, E. M. Bender, E. Denton, A. Hanna, Patterns. 2, 100336 (2021). ↩ ↩2
E. Bhardwaj et al., in 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 1055–1067. ↩
E. Bhardwaj et al., in Advances in Neural Information Processing Systems (NeurIPS) (2024), vol. 37. ↩
M. K. Scheuerman, A. Hanna, E. Denton, Proc. ACM Hum.-Comput. Interact. 5, 1–37 (2021). ↩
N. B. Thylstrup, Media Cult. Soc. 44, 655–671 (2022). ↩
L. Taylor, Big Data Soc. 4, 2053951717736335 (2017). ↩
C. Becker, Insolvent: How to Reorient Computing for Just Sustainability (MIT Press, 2023). ↩ ↩2
B. Rakova, R. Dobbe, in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, p. 491. ↩
B. Green, “Data Science as Political Action: Grounding Data Science in a Politics of Justice” (SSRN Scholarly Paper ID 3658431, Social Science Research Network, Rochester, NY, 2020). ↩
N. Sambasivan et al., in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. ↩
M. Miceli, J. Posada, T. Yang, Proc. ACM Hum.-Comput. Interact. 6, 1–14 (2022). ↩