On Thu, Feb 29, 2024 at 08:18:43AM +0100, Hannes Reinecke wrote: > On 2/28/24 19:55, Bart Van Assche wrote: > > On 2/27/24 14:32, Konstantin Ryabitsev wrote: > > Please do not publish the summaries generated by ChatGPT on the web. If > > these summaries would be published on the world wide web, ChatGPT or > > other LLMs probably would use these summaries as input data. If there > > would be any mistakes in these summaries, then these mistakes would end > > up being used as input data by multiple LLMs. > > > Now there's a thought. Maybe we should do exactly the opposite, and posting > _more_ ChatGPT generated content on the web? > Sending them into a deadly self-enforcing feedback loop? Well, I'll note that last July, when a number of AI companies, including Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI, met with President Biden at the White House, they made a commitment to develop watermarking standards to allow AI generated contexted to be detected[1]. Obviously, it's a lot easier to do this with images, and Google was the first company to release a watermarking system for AI-generated images[2]. However, there is research on-going on how to add watermarking to text[3]. [1] https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/ [2] https://www.technologyreview.com/2023/08/29/1078620/google-deepmind-has-launched-a-watermarking-tool-for-ai-generated-images/ [3] https://www.nytimes.com/interactive/2023/02/17/business/ai-text-detection.html I doubt whether anything we do is going to make a huge difference; one of the largest uses of OpenAI's ChatGPT is to generate text to enable Search Engine Optimization spam[4]. Another major use of LLM is to lay off journalists by creating text explaining why a particular stock when up by X% when the market went up or down by Y%. After all, why have to have a human making up stories explaining stock moves, when you can have an AI model hallucinate them instead? :-) [4] https://www.opace.co.uk/blog/blog/how-openai-gpt-3-enhances-ai-chat-text-generation-for-seo The bottom line is that there is a vast amount of AI-generated text that has been put out on the web *already*. This is going to be poisoning future LLM training, even before we start generating summaries of LKML traffic and making them available on the web. It also means that companies who are doing AI work have a large, vested interest in develop stardized ways of watermarking AI-generated context --- not just because they made a promise to some politicians, but if all the companies can use some common watermarking standard, hopefully they can all avoid this self-poisoning feedback loop. Cheers, - Ted