Reddit will start charging you to train AI models from its extremely human archives

If you’re in the business of teaching a large language model (LLM) to an AI and want it to learn from the u/420NarutoConspiracy subreddit, you’ll soon have to pay for it.

Steve Huffman, founder and CEO of social news and discussion aggregator Reddit, recently told The New York Times that he plans to charge companies accessing his API to extract 18 years of mostly human-created content. Details of the new terms and conditions are available in the following announcement on Reddit.

The API will continue to be free for developers working on bots and other Reddit tools, and for researchers working on academic or non-profit projects. But simply using Reddit discussions for AI training purposes will come at a cost, the exact amount of which should emerge in the coming weeks.

“The Reddit data set is really valuable,” Huffman said in an interview with the Times. “But we don’t have to give all that value away to some of the biggest companies in the world for free.

“Crawling Reddit, creating value, and not returning that value to our users is something we have problems with. Now is the time for us to make things right.”

The comments and conversations on Reddit have become a rich resource for learning AI LLMs. ChatGPT and Google Bard cite the Reddit data as one of their sources. In their analysis of just one subset (12 million) of the Stable Diffusion (2.3 billion) image generation dataset, Andy Baio and Simon Willison noted that “user-generated content platforms have been a huge source of image data.”A study of common data sources for many AIs published today by The Washington Post found that “compiling text from links highly rated by Reddit users”is included in GPT-3.

While Reddit intends to restrict access to AI, it intends to provide developers and moderators with better tools to work within their communities. The Reddit apps for iOS and Android will offer ways to quickly view a user’s story, update community rules, and better handle multiple mod queues.

Reddit’s change to API access comes as the company is set to go public in the second half of 2023, according to The Information. The company has confidentially filed for an initial public offering in December 2021 . According to Reuters, it had hoped for a $15 billion valuation, but delayed filing until market conditions, especially around tech companies, improved.

CDN CTB