Large language models (LLMs) are trained on vast, nearly unfathomable amounts of data—data that is now reshaping the very fields from which it was sourced, including literature, journalism, music, and photography. As a result, these models have sparked high-stakes litigation and raised novel legal questions about ownership and intellectual property, both in the AI training process and the output they produce. In this conversation, we explore the intersection of AI training and copyright law with Professor Shyamkrishna (Shyam) Balganesh of Columbia Law School, a prominent legal scholar who has been closely examining these emerging issues.
At the core of the debate is how these models are trained—using vast datasets that combine both copyrighted and public domain material. LLMs ingest this data to absorb patterns that power their ability to generate intelligent responses, yet their reliance on copyrighted works raises concerns about unauthorized use. Professor Balganesh walks us through the technical aspects of how these models are built, explaining the intricacies of data ingestion and why the training process involves copying datasets onto local servers, potentially leading to copyright violations.
The fair use doctrine has emerged as a central argument in the defense of using copyrighted material in AI training, but this defense has its limitations. Professor Balganesh details how the courts are grappling with balancing innovation with intellectual property rights. While AI companies claim their use of copyrighted works falls under fair use, critics argue that fair use cannot “scale” with the models and that the models reproduce creative outputs in ways that violate authors' rights. Shyam examines the boundaries of this argument and where the law may be heading.
These legal questions are playing out in real time, with high-profile cases capturing national attention. Professor Balganesh shares his insights on key lawsuits, including the New York Times’ challenge to OpenAI, the Suno AI music case brought by Universal Music Group, and Getty Images' case against Stable Diffusion. While these cases remain pending at the time of the interview, Shyam predicts a shift towards increased licensing regimes, where AI developers will secure permissions to use copyrighted material for training their models.
🎧 Listen and Subscribe to the AI Lawyer Podcast