Valuation of an AI Training Dataset

If there is a market for AI training datasets, then the price will be determined by supply and demand. How does the supplier set the price, and how does the buyer evaluate if the price is right? The question behind both of these is this: how to estimate the value of a training dataset?

We can look at parameters impacting cost to make the dataset/stream and maintain it, the obvious being the costs to make, manage and sell:

  • Cost of the infrastructure/tools used to collect data – for example, if data came from sensors, these had a cost to make/buy, set up, collect data, and so on; if the data is collected through a survey of people, there are other people and or tools used to gather responses; if data was elicited through observation of something people do, it is the time to perform observations, among other things;
  • Cost of managing the data, that is, of the effort invested and tools used to store, clean, describe, govern the data;
  • And then there are costs of distributing, marketing, selling, and providing support to buyers of data.

What are the parameters impacting the value of an AI dataset to buyers?

  • Estimated willingness to pay of the addressable market
  • Price elasticity
  • Potential impact of quality outputs of the AI system to mitigating risks that the audience faces
  • Value for that audience of the decisions that the recommendations are about
  • Cost to acquire the same recommendation from human experts
  • Exclusivity of the dataset, and of substitute datasets
  • Variety and probability of erroneous outputs, and the impact of these to the user
  • Data quality
  • Scope and depth of data governance applied to the dataset

A business case to develop an AI training dataset/stream needs to cover at least the above.