What Is the Depth of Expertise of an AI Training Dataset?
I use “depth of expertise” as a data quality dimension of AI training datasets. It describes how much a dataset reflects of expertise in a knowledge domain. This is not a common data quality dimension used in other contexts, and I haven’t seen it as such in discussions of, say, quality of data used for Large Language Model training – see, for example, [1], [2], [3].
Suppose that we have the following datasets:
- Dataset A includes only the Merriam-Webster definition of the term “logic”.
- Dataset B includes only the Wikipedia page for the term “logic”.
- Dataset C includes only the Stanford Encyclopedia of Philosophy page for the term “classical logic”.
- Dataset D includes all peer reviewed books and research publications categorized as being about “logic”, and published by the largest three publishers of academic research.
The four datasets are sorted in increasing depth of expertise: B has more than A about “logic”, C more than B, and D more than C. Stated otherwise, if I wanted to learn about logic, I would know more if I learned from dataset C, relative to using B, and relative to using A.
Depth of expertise is interesting because it makes it inevitable to question the competence an AI can have, based on the depth of expertise of the datasets used to train it.
If I want to ask an AI about logic, I would likely prefer a more competent one, which in turn means – among other things – that it had to be trained on datasets with a depth of expertise I’m looking for. In some cases, it’s fine that an AI was not trained on the corpus of academic publications related to logic, but it would be worth knowing that.
Depth of expertise is hard to assess, or that’s at least the current situation. Doing so requires much more elaborate metadata than what is available about training data used to build foundational models – see [4]. However, there are, as I wrote here and here, incentives for expert authors to improve the quality of training data.
References
- Groeneveld, Dirk, et al. “Olmo: Accelerating the science of language models.” arXiv preprint arXiv:2402.00838 (2024). https://arxiv.org/abs/2402.00838
- Soldaini, Luca, et al. “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” arXiv preprint arXiv:2402.00159 (2024). https://arxiv.org/abs/2402.00159
- Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “” Why should I trust you?” Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
- https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/