Data Authenticity, Accuracy, Objectivity, and Diversity Requirements in Generative AI

In April 2023, the Cyberspace Administration of China released a draft Regulation for Generative Artificial Intelligence Services. The note below continues the previous one related to the same regulation, here.

One of the requirements on Generative AI is that the authenticity, accuracy, objectivity, and diversity of the data can be guaranteed.

My intent below is to comment on what this requirement may mean, and what implications there may be if it needs to be satisfied. To do so, we’ll need definitions for each of the four terms. Below, I’m using definitions from the NIST Computer Security Resource Center, when they exist.

Data authenticity:
- The property that data originated from its purported source.
- The property of being genuine and being able to be verified and trusted; confidence in the validity of a transmission, a message, or message originator.
Data accuracy:
- The degree of conformity of a measured or calculated value to the true value, typically based on a global reference system.
Data objectivity – Not defined in the NIST resources. I never encountered “objectivity” as a property of data; I’ll discuss this further below.
Data diversity: Same as objectivity, no definition from NIST. Again, to be discussed below.

How can you demonstrate that data used by a generative AI has the above properties?

Authenticity requires knowing the source of data. Does knowing the source equate with knowing the dataset or stream that it is from? The dataset may have been some kind of aggregate of another dataset, or the result of various transformations of an original dataset.

Point being that you cannot demonstrate data authenticity if you do not understand the process that generates data you are trying to demonstrate this for.

For example, demonstrating authenticity of heartbeat data requires knowing how the sensor generates it, not simply pointing to the dataset which the data came from. To claim authenticity, you need to be able to claim that you understand all transformations to data between the sensor or instrument that generated it, and the dataset you got it from. You need to know the source that generates input data, and then all algorithms applied to it up to the point of use, where you need to demonstrate authenticity.

Accuracy is the assessment of the distance between what is referred as the true value of a measurement and the known value, or the value you have in a dataset. The challenge is knowing the true value, and knowing how to recognize if a value you have is true. This, again, requires knowing how the measurement is made, which makes accuracy and authenticity related.

Let’s say you have a dataset of your heart rate from a sports training session. How do you assess its accuracy? You need to know all steps from the sensor to the dataset, and determine if there are any reasons to believe anything in those steps caused or involved errors that would make the value in the dataset different from the value that would have been recorded.

In many cases, you don’t know in detail the process that generated data. For example, you have a dataset of restaurant evaluations by visitors. What does it mean that that data is accurate? One possibility is to claim that it is, as long as there is no difference between what evaluations people gave and what is in the dataset, so nothing was changed relative to the original dataset. Another way to question accuracy is to ask if there are evaluations by people who never visited the restaurant. Yet another way is to question the ability of the measurement tool to represent accurately the level of satisfaction.

Notice how difficult it becomes, depending on how high the threshold for evidence for authenticity or accuracy is, to demonstrate that data is authentic and/or accurate. When data is from many sources, and is from very large datasets, claims of accuracy and authenticity are easy to challenge.

Now, data objectivity is an odd one to request in a regulatory text. I haven’t encountered the term objectivity outside scientific work, which leads me to the “Scientific Objectivity” entry, here, in the Stanford Encyclopedia of Philosophy:

“Scientific objectivity is a property of various aspects of science. It expresses the idea that scientific claims, methods, results—and scientists themselves—are not, or should not be, influenced by particular perspectives, value judgments, community bias or personal interests, to name a few relevant factors. Objectivity is often considered to be an ideal for scientific inquiry, a good reason for valuing scientific knowledge, and the basis of the authority of science in society.”

What could data objectivity mean in the context of Generative AI? One meaning could be that data is faithful to facts. We can throw data objectivity out the window for datasets describing casual impressions, such as, for example, restaurant evaluations. We can do the same for a large part of data used for large language model learning, such as works of fiction.

Faithfulness to facts obviously is a critical consideration when building AI that should support medical diagnosis, for example. It is an open question whether such AI and then another, made to support graphic design, for instance, should be subject to the same “data objectivity” requirements. The conditions, say, a medical device needs to meet before it can go to market are very different from those that needed to be met to sell a new kind of screwdriver.

Another conception of objectivity, in science, is freedom or independence from value. While there are more nuances, this is mostly about being able to identify, and remove the influence of values related to specific ethical, moral, social, political views, among others.

A third understanding of scientific objectivity is freedom from personal bias. This is roughly speaking the idea that what is offered as a scientific theory, for example, should be somehow independent from the individual offering it and from the specific context it is offered in. There is more to it, but reproducibility, for example, is a condition that if met, would be good grounds to believe that the theory is free from personal bias. Evidently, demonstrating reproducibility is non-trivial; the idiot who believes the Earth is flat seemingly reproduces their own experience of it every time they look at the horizon.

Given the above, what could data objectivity refer to? All of it – faithfulness to facts, as well as freedom from values and personal bias. This sets high standards when designing Generative AI, perhaps too high for many applications it lends itself to. If we are making Generative AI that needs to be used to write fiction, it is not clear what objectivity means – perhaps it is as simple as ensuring that if we said that the AI was trained on, say, Hemingway’s works among others, that it in fact did train on Hemingway’s works, not, for example, someone’s edits thereof. It is a very different problem, though, a much more costly one, to claim objectivity of data used in AI that supports medical diagnosis.

Finally, how to satisfy the data diversity requirement? It is not clear what exactly data diversity may be, but the likely interpretation is that it involves demonstrating that if Generative AI produces outputs that involve claims about some underlying population, the input data used to train Generative AI should be representative of that population; or at least we should be able to understand biases of the dataset/sample relative to the population. Again, this is hard to demonstrate, because it requires that you are able to explain what the population is, if the dataset or stream you are using is about a sample thereof. In some cases, the meaning of data diversity is obvious – for example, if Generative AI is designed to generate human portraits, data diversity could be about the training data being representative of human faces across genders, geographies, and so on. It’s not as clear in other application areas: if we are making Generative AI for literary fiction, what is the population to train it on? Perhaps we want to bias it relative to the entire body of published works of fiction, because we want a particular style to be generated.