Microsoft CEO of AI says the content you post online is ‘freeware’ for AI training

The CEO of Microsoft’s AI division has sat down for an interview where he touched on the sensitive subject of where the data comes from to train the popular emerging AI tools, such as ChatGPT, or Microsoft’s Copilot.

Up until now, there hasn’t been any transparency with the datasets used by companies such as OpenAI to train its neural networks, which power its popular AI tools. The ambiguity around where AI companies are acquiring these large swaths of data has led to several lawsuits, with owners of online content claiming OpenAI and Microsoft stole copyrighted content to train its AI algorithms, which are then used commercially.

Two authors have already sued Microsoft and OpenAI over using their work to train the AI models without their permission, while eight newspapers, along with the New York Times, have filed lawsuits against OpenAI and Microsoft. The ambiguity around copyrighted content can be traced back to the grey area in current laws, which appears to be what AI companies are relying on to get away with taking data from any area of the internet they can.

Mustafa Suleyman, the CEO of Microsoft AI, appeared to allude to this gap in the law in a recent interview with CNBC, where he said there is a difference between content that is published online by people and content that is backed up by copyright holders.

I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use,” he opined. “Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding.

There’s a separate category where a website or publisher or news organization had explicitly said, ‘do not scrape or crawl me for any other reason than indexing me,’ so that other people can find that content,” he explained. “But that’s the gray area. And I think that’s going to work its way through the courts.