Can an Artificial Intelligence Trained on Large-Scale Crawled Web Data Comply with the Algorithmic Accountability Act?

If an artificial intelligence system is trained on large-scale crawled web/Internet data, can it comply with the Algorithmic Accountability Act?

For the sake of discussion, I assume below that (1) the Act is passed, which it is not at the time of writing, and (2) the Act applies to the system (for more on applicability, see my notes on Section 2 of the Act, here).

Below is a sample of requirements from Section 4(a)(7) of the Act (here), which are very hard to satisfy if the artificial intelligence system was trained on crawled web data, such as using Common Crawl.

Section 4(a)(7) Maintain and keep updated documentation of any data or other input information used to develop, test, maintain, or update the automated decision system or augmented critical decision process, including—

(A) how and when such data or other input information was sourced and, if applicable, licensed, including information such as—

(i) metadata and information about the structure and type of data or other input information, such as the file type, the date of the file creation or modification, and a description of data fields;

(ii) an explanation of the methodology by which the covered entity collected, inferred, or obtained the data or other input information and, if applicable, labeled, categorized, sorted, or clustered such data or other input information, including whether such data or other input information was labeled, categorized, sorted, or clustered prior to being collected, inferred, or obtained by the covered entity; and

(iii) whether and how consumers provided informed consent for the inclusion and further use of data or other input information about themselves and any limitations stipulated on such inclusion or further use;

(B) why such data or other input information was used and what alternatives were explored; and

(C) other information about the data or other input information, such as—

(i) the representativeness of the dataset and how this factor was measured, including any assumption about the distribution of the population on which the augmented critical decision process is deployed; and

(ii) the quality of the data, how the quality was evaluated, and any measure taken to normalize, correct, or clean the data.

One way to satisfy Section 4(a)(7) is to refer to, or maintain documentation of the crawled data. The difficult question is how extensive that documentation needs to be?

For example, part of documentation may be the list of resources crawled, including information about resources that were not crawled (e.g., Common Crawl includes robots.txt files of the resources).

Does such a list of crawled resources satisfy Section 4(a)(7)(A)(iii) “whether and how consumers provided informed consent for the inclusion and further use of data…”? To satisfy the requirement, it is necessary to be able to provide evidence that data was provided through informed consent. The scale of crawled web data makes this difficult to satisfy, and leads to the dilemma of using large-scale crawled data to improve performance in the eyes of the customer without being able to demonstrate informed consent across all data, versus training the system only on data where such evidence is available. How this dilemma is addressed will depend on the perception of its designers, of the potential impact of (1) failure to comply with the Act (in case evidence of informed consent is absent) and (2) the importance of controlling the inputs to training, itself dependent on the perception of the importance of the decision that the system is making recommendations on. It is likely that because of this dilemma, there is/will be a market for crawled datasets accompanied with well-documented evidence of customers’ informed consent.

A more difficult requirement to satisfy is Section 4(a)(7)(C)(i) “the representativeness of the dataset and how this factor was measured…”. Large-scale crawled web data includes anything and everything, which is why it is appealing in the first place (besides being low cost to acquire and use, and easily available). I don’t see how it is possible to argue that a dataset such as Common Crawl, i.e., a dataset created by a free ranging web crawl, is representative of anything.

The more general and important point is that compliance requires the ability to establish data lineage, and that can be very expensive to do for large-scale crawled data, as it is data created over a long period of time (think about data crawled on a messaging forum) and there is a lot of it.

It will be interesting to see how much focus is given to the ability to demonstrate data lineage, when audits are done of artificial intelligence systems.