Valuation of an AI Training Dataset

March 2024September 2024

If there is a market for AI training datasets, then the price will be determined by supply and demand. How does the supplier set the price, and how does the buyer evaluate if the price is right? The question behind both of these is this: how to estimate the value of a training dataset?

We can look at parameters impacting cost to make the dataset/stream and maintain it, the obvious being the costs to make, manage and sell:

Cost of the infrastructure/tools used to collect data – for example, if data came from sensors, these had a cost to make/buy, set up, collect data, and so on; if the data is collected through a survey of people, there are other people and or tools used to gather responses; if data was elicited through observation of something people do, it is the time to perform observations, among other things;
Cost of managing the data, that is, of the effort invested and tools used to store, clean, describe, govern the data;
And then there are costs of distributing, marketing, selling, and providing support to buyers of data.

What are the parameters impacting the value of an AI dataset to buyers?

Estimated willingness to pay of the addressable market
Price elasticity
Potential impact of quality outputs of the AI system to mitigating risks that the audience faces
Value for that audience of the decisions that the recommendations are about
Cost to acquire the same recommendation from human experts
Exclusivity of the dataset, and of substitute datasets
Variety and probability of erroneous outputs, and the impact of these to the user
Data quality
Scope and depth of data governance applied to the dataset

A business case to develop an AI training dataset/stream needs to cover at least the above.

Preconditions for a Market for High Quality AI Training Data

There is no high quality AI without high quality training data. A large language model (LLM) AI system, for example, may seem to deliver accurate and relevant information, but verifying that may be very hard – hence the effort into explainable AI, among others. If I wanted accurate and relevant legal advice, how much risk…

Limits of Explainability in AI Built Using Statistical Learning

How good of an explanation can be provided by Artificial Intelligence built using statistical learning methods? This note is slightly more complicated than my usual ones. In logic, conclusions are computed from premises by applying well defined rules. When a conclusion is the appropriate one, given the premises and the rules, then it is said…

How to Make GenAI Better Faster? Authorship + Community + Credibility

We should reduce the cost of authorship and create an incentive mechanism that generates and assigns credibility to authors in a community.

What Do You Get when You Mix AI, PR, and HR?

What Is the Depth of Expertise of an AI Training Dataset?

I use “depth of expertise” as a data quality dimension of AI training datasets. It describes how much a dataset reflects of expertise in a knowledge domain. This is not a common data quality dimension used in other contexts, and I haven’t seen it as such in discussions of, say, quality of data used for…

Algorithmic Accountability Act for AI Product Managers: Sections 1 and 2

The Algorithmic Accountability Act (2022 and 2023) applies to many more settings than what is in early 2024 considered as Artificial Intelligence. It applies across all kinds of software products, or more generally, products and services which rely in any way on algorithms to support decision making. This makes it necessary for any product manager…

Similar Posts