Protecting Copyright Holders In The Age Of AI

In the rapidly evolving landscape of artificial intelligence (AI), the training data that fuels AI models is of paramount importance. However, there are questions as to whether existing copyright law will give adequate control to copyright holders over whether and how their data is used to train AI models. In this article, I outline a potential solution to this problem in the form of a new right for copyright holders, referred to here as a “training data” right.

Training Data and Existing Copyright Law

AI models typically rely on vast amounts of data to learn and make accurate predictions. This data typically comes from a variety of sources, including text, images, and audio, often gathered from the internet. However, the use of such data raises significant legal and ethical issues, particularly when copyrighted material is used against the wishes of the copyright holder. Although case law is beginning to develop in this area1, there are early signs that existing copyright law will ultimately fall short in providing protection for copyright holders in this context. For example, copyright infringement currently requires copying of a “substantial part” of a work, and whilst copying may take place during training of an AI model, it seems unlikely that the AI model itself (typically a neural network that is, in essence, a mathematical construct) could be viewed as containing a “substantial part” of the material on which it has been trained. Similarly, there are jurisdictional issues associated with existing copyright law that could interfere with a copyright infringement claim when, for example, the AI model has been trained in a different jurisdiction from where the AI model is ultimately used.  

A New “Training Data” Right?

It is possible to envisage a right, referred to here as a “training data” right, that would enable copyright holders to decide whether, when and how their data is used to train AI models.

Here is an outline of how a “training data” right might be formulated:

  • The “training data” right would grant the copyright holder the right to stop a party from providing/offering an AI model trained using the copyright holder’s data for use in a relevant jurisdiction. This formulation of the right would allow a copyright holder to prevent an AI developer bypassing their “training data” right simply by training their AI model in an obscure jurisdiction, then offering the trained AI model in the relevant jurisdiction.
  • In order to prove infringement of the “training data” right, the copyright holder would need to prove that the AI model has actually been trained using a “substantial part” of their data. It might seem challenging to ask copyright holders to prove this, but actually, arguments like this are already being advanced in copyright cases being brought by copyright holders against AI model developers, see e.g. Getty Images Inc. v Stability AI Ltd. [2023] EWHC 3090 (Ch), and New York Times vs. OpenAI (US).
  • Whilst the new “training data” right would seek to limit the activity of AI model developers, it would not necessarily extend to end users of the AI model (who may not have any knowledge of the potential infringement).

This new “training data” right would be intended to create a new layer of protection that would supplement, rather than replace, existing copyright law. 

A Need to Balance Protection and Innovation

Whilst a new “training data” right as outlined above seems feasible, it is reasonable to ask whether protecting copyright holders in this way would be in the interest of society as a whole. For example, it could be argued that providing copyright holders with a new “training data” right could stymy the development of AI models, by making it much more difficult for AI model developers to provide their models with the large volumes training data they need to work well. But on the other hand, is it fair for AI developers to build an AI model for profit, without any of that profit going to the copyright holders whose material was used to build the AI model?  

One approach that might provide a reasonable balance between the needs of copyright holders and AI model developers would be for the “training data” right to be implemented on an “opt-out” basis. That is, for AI model developers to be allowed to use copyrighted data for their model except where the copyright holder has taken some step to “opt-out” their data from being used in this way. This would then allow AI model developers to use the vast majority of available data, whilst giving copyright holders the ability to prevent their data from being used in AI models where they do not want this.

Clearly, there are logistical and technical issues associated with an “opt-out” approach:

  • There would need to be some way of making it easy for copyright holders to record their “opt-out” and clearly demarcate which data is opted-out, so that AI model developers could obey the “opt-out” without undue burden.
  • AI model developers would need to implement an “opt-out” within a reasonable timeframe, once the “opt-out” has been recorded.
  • Some thought would need to go into what happens if an “opt-out” is recorded after data has already been used to train an AI model. In particular, would the AI model need to be retrained so as to avoid using that data? 

Early Regulatory Developments

Interestingly, it appears that the EU is moving towards the “opt-out” approach outlined above via its AI Act, the enforcement of which is due to commence from 2 August 20262. Since enforcement of the EU’s AI Act has not yet begun, it’s not clear that the EU has yet dealt with all of the logical and technical issues associated with such an approach, as noted above.

In contrast to the EU, it seems that Japan appears to be moving towards a much more permissive regime, whereby AI developers are free to use copyrighted material whether they have permission or not3.

Clearly, regulation in this area is at an early stage, so we will need to wait and see if a consistent approach is adopted between different jurisdictions. 

 


 

1 Getty Images Inc. v Stability AI Ltd. [2023] EWHC 3090 (Ch), New York Times vs. OpenAI (US), Tremblay v. OpenAI, Inc. (US), Millette v. OpenAI (US)

2 https://www.whitecase.com/insight-alert/long-awaited-eu-ai-act-becomes-law-after-publication-eus-official-journal

3 https://www.privacyworld.blog/2024/03/japans-new-draft-guidelines-on-ai-and-copyright-is-it-really-ok-to-train-ai-using-pirated-materials/