Re: Toy/demo: using ChatGPT to summarize lengthy LKML threads (b4 integration)

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 29 Feb 2024 02:37:32 -0600

On Thu, Feb 29, 2024 at 08:18:43AM +0100, Hannes Reinecke wrote:
> On 2/28/24 19:55, Bart Van Assche wrote:
> > On 2/27/24 14:32, Konstantin Ryabitsev wrote:
> > Please do not publish the summaries generated by ChatGPT on the web. If
> > these summaries would be published on the world wide web, ChatGPT or
> > other LLMs probably would use these summaries as input data. If there
> > would be any mistakes in these summaries, then these mistakes would end
> > up being used as input data by multiple LLMs.
> > 
> Now there's a thought. Maybe we should do exactly the opposite, and posting
> _more_ ChatGPT generated content on the web?
> Sending them into a deadly self-enforcing feedback loop?

Well, I'll note that last July, when a number of AI companies,
including Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and
OpenAI, met with President Biden at the White House, they made a
commitment to develop watermarking standards to allow AI generated
contexted to be detected[1].  Obviously, it's a lot easier to do this
with images, and Google was the first company to release a
watermarking system for AI-generated images[2].  However, there is
research on-going on how to add watermarking to text[3].

[1] https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/
[2] https://www.technologyreview.com/2023/08/29/1078620/google-deepmind-has-launched-a-watermarking-tool-for-ai-generated-images/
[3] https://www.nytimes.com/interactive/2023/02/17/business/ai-text-detection.html

I doubt whether anything we do is going to make a huge difference; one
of the largest uses of OpenAI's ChatGPT is to generate text to enable
Search Engine Optimization spam[4].  Another major use of LLM is to
lay off journalists by creating text explaining why a particular stock
when up by X% when the market went up or down by Y%.  After all, why
have to have a human making up stories explaining stock moves, when
you can have an AI model hallucinate them instead?  :-)

[4] https://www.opace.co.uk/blog/blog/how-openai-gpt-3-enhances-ai-chat-text-generation-for-seo

The bottom line is that there is a vast amount of AI-generated text
that has been put out on the web *already*.  This is going to be
poisoning future LLM training, even before we start generating
summaries of LKML traffic and making them available on the web.  It
also means that companies who are doing AI work have a large, vested
interest in develop stardized ways of watermarking AI-generated
context --- not just because they made a promise to some politicians,
but if all the companies can use some common watermarking standard,
hopefully they can all avoid this self-poisoning feedback loop.

Cheers,

					- Ted