What Is the Depth of Expertise of an AI Training Dataset?

I use “depth of expertise” as a data quality dimension of AI training datasets. It describes how much a dataset reflects of expertise in a knowledge domain. This is not a common data quality dimension used in other contexts, and I haven’t seen it as such in discussions of, say, quality of data used for Large Language Model training – see, for example, [1], [2], [3].

Suppose that we have the following datasets:

Dataset A includes only the Merriam-Webster definition of the term “logic”.
Dataset B includes only the Wikipedia page for the term “logic”.
Dataset C includes only the Stanford Encyclopedia of Philosophy page for the term “classical logic”.
Dataset D includes all peer reviewed books and research publications categorized as being about “logic”, and published by the largest three publishers of academic research.

The four datasets are sorted in increasing depth of expertise: B has more than A about “logic”, C more than B, and D more than C. Stated otherwise, if I wanted to learn about logic, I would know more if I learned from dataset C, relative to using B, and relative to using A.

Depth of expertise is interesting because it makes it inevitable to question the competence an AI can have, based on the depth of expertise of the datasets used to train it.

If I want to ask an AI about logic, I would likely prefer a more competent one, which in turn means – among other things – that it had to be trained on datasets with a depth of expertise I’m looking for. In some cases, it’s fine that an AI was not trained on the corpus of academic publications related to logic, but it would be worth knowing that.
Depth of expertise is hard to assess, or that’s at least the current situation. Doing so requires much more elaborate metadata than what is available about training data used to build foundational models – see [4]. However, there are, as I wrote here and here, incentives for expert authors to improve the quality of training data.

References

Groeneveld, Dirk, et al. “Olmo: Accelerating the science of language models.” arXiv preprint arXiv:2402.00838 (2024). https://arxiv.org/abs/2402.00838
Soldaini, Luca, et al. “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” arXiv preprint arXiv:2402.00159 (2024). https://arxiv.org/abs/2402.00159
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “” Why should I trust you?” Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.
https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/

Timing Nonfunctional Requirements

ByIvan Jureta November 2008August 2021

Analysis of temporal properties of nonfunctional – i.e., quality – requirements (NFRs) has not received significant attention. In response, this paper introduces basic concepts and techniques needed for the specification and analysis of time properties of NFRs. Jureta, I.J. and Faulkner, S., 2008, October. Timing Nonfunctional Requirements. In International Conference on Conceptual Modeling (pp. 302-311). Springer, Berlin,…

AI | Concepts | Decision Governance

What is a “Decision” in an Artificial Intelligence System?

ByIvan Jureta May 2024

In the context of human decision making, a decision is a commitment to a course of action (see the note here); it involves mental states that lead to specific actions. An AI system, as long as it is a combination of statistical learning algorithms and/or logic, and data, cannot have mental states in the same…

AI | Opinion | Quality

Data Quality & AI Quality Are not Independent

ByIvan Jureta February 2018August 2021

AI | Economics of Artificial Intelligence

Valuation of an AI Training Dataset

ByIvan Jureta March 2024April 2024

If there is a market for AI training datasets, then the price will be determined by supply and demand. How does the supplier set the price, and how does the buyer evaluate if the price is right? The question behind both of these is this: how to estimate the value of a training dataset? We…

AI | Decision Analysis | Opinion

Quality Assurance for AI: An Inevitable Tradeoff

ByIvan Jureta January 2018August 2021

AI | AI Regulations

Ambiguity of “Artificial Intelligence”

ByIvan Jureta April 2024April 2024

Artificial Intelligence, if incorrectly defined, is even more confusing than it can be. Sometimes, it is considered a technology, which itself is problematic: is it a technology on par with database management systems, for example, which are neutral with respect to the data they are implemented to manage in their specific instances? Or, is it…

Similar Posts