New York’s stalled AI bill would have blurred the line between disclosure and restriction
ID 59948523 © Jiawangkun | Dreamstime.com

Commentary

New York’s stalled AI bill would have blurred the line between disclosure and restriction

While pitched as a transparency measure, Assembly Bill 8595 would have set a new, unusually high bar for compliance.

Earlier this year, New York state lawmakers advanced a proposal that would have required artificial intelligence developers to reveal the exact sources behind their models. Assembly Bill 8595, the Artificial Intelligence Transparency for Journalism Act, would have mandated a detailed, publication-level accounting of every uniform resource locator (the formal name for website addresses, which is typically shortened to URL) and data source accessed in every phase of model development. While pitched as a transparency measure, AB 8595 would have set a new, unusually high bar for compliance, raising the question of when transparency begins to look less like demanding openness and more like a deliberate barrier to entry. The bill’s progress appears to have stalled, but it is worth examining as the legislative approach it contains is likely to shape future legislation.

State Sen. Kristen Gonzalez (D-59) introduced the bill earlier this year. A key portion of the legislative text reads:

A developer of a generative artificial intelligence system or service … shall post and maintain on its website, with a link to such posting included on its homepage, the following information for each generative artificial intelligence system or service that utilizes covered publication content:

(i) the uniform resource locators or uniform resource identifiers accessed by crawlers deployed by the developer or by third parties on their behalf or from whom they have obtained video, audio, text or data … .

Despite the title, the bill defined a “journalism provider” as a “covered publication,” which is any print, broadcast, or digital outlet that “performs a public-information function,” and “invests substantial expenditure of labor, skill, and money.” The provision grants covered publications the right to “bring an action in the supreme court for statutory damages or injunctive relief”.

Ultimately, the bill did not define what is considered a copyright violation. Instead, it may have given publishers easier evidence to prove that a violation took place. And, importantly, courts have already begun outlining the contours of what AI may be considered under the fair use doctrine.

In a landmark case last September, AI developer Anthropic agreed to a $1.5 billion settlement with authors whose works were not purchased. Large language models (LLMs) are trained on vast amounts of data, some of which may include pirated copies of books. Notably, the case sets a precedent that AI models can be trained on works that are legally obtained. For instance, if the developer purchases a book, it can then train the model on the content and not have to compensate authors beyond the cost of the book itself. In Thomson Reuters v. Ross Intelligence, US Circuit Court Judge Stephanos Bibas held that Ross’s use of proprietary Westlaw headnotes to train its AI engine was not fair use, emphasizing the originality of the content and the commercial nature of Ross’s competing product. (Ross is a now-defunct AI company for legal research.)

In June, a U.S. district judge declared that Meta did not cause substantial harm to the market of publishers by using books to train its AI model, siding against a number of high-profile authors. A California court also sided with AI company Anthropic in a similar case involving book publishers.

Complying with New York’s proposal would have posed significant technical hurdles. LLMs are built on datasets containing billions of documents collected via automated web crawlers. Tracking and publishing every individual URL or identifier accessed during each stage is not standard practice. While engineers may spot-check a model’s citations or investigate suspected “hallucinations,” they rarely maintain exhaustive logs of every browser request or data pull.

Under the hood, LLMs learn by adjusting weights—numerical values that encode the statistical strength of connections between words—rather than storing or indexing URLs directly. Once training is completed, a model’s weights reflect aggregated patterns from the entire dataset, not discrete source pointers.

Even after training, engineers often conduct manual verification. For instance, one study describes clinicians checking whether an LLM’s medical citations matched real articles and assessing accuracy. If AB 8595 was passed and interpreted broadly, companies might be required to document every URL opened during such checks, in addition to all sources ingested into model weights.

“If a URL pointed to an uploaded PDF of one of my novels, that’s not proof that the model’s understanding of that came from that link. It could be from hundreds of discussions, promotional materials, or the Amazon page,” Andrew Mayne told Reason Foundation in an email. Mayne is a novelist and an AI consultant and was a technical consultant on a popular AI model from OpenAI’s ChatGPT-4.

Mayne’s observation highlights a fundamental ambiguity: Even with perfect logs of every URL a crawler hit, developers couldn’t trace how indirect discussions or metadata influenced model outputs. Must they disclose URLs opened for manual fact-checks? Or every ancillary page that informed a bot’s interpretation?

Such questions underscore how AB 8595 would have blurred the line between disclosure and restriction. Overly complex reporting requirements can impede innovation from large developers responsible for the most popular AI applications.

The last bill action was in June of 2025, when the legislation was referred to the rules committee. No further action on the bill is indicated on the New York state legislature’s website and it is immediately clear why this bill did not advance further. However, tensions between developers and publishers are far from settled, and a version of this bill could likely return to New York or another state next session.