Re: [PATCH -v2] memcg: fix a crash in wb_workfn when a device disappears

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Tue, 7 Jan 2020 21:12:38 -0500

On Tue, Jan 07, 2020 at 03:33:48PM -0800, Andrew Morton wrote:
> On Fri, 27 Dec 2019 19:52:11 -0500 "Theodore Ts'o" <tytso@xxxxxxx> wrote:
> 
> > Unfortunately, del_gendisk() in block/gen_hd.c never got the memo
> > about the Brave New memcg World, and calls bdi_unregister directly.
> > It does this without informing the file system, or the memcg code, or
> > anything else.  This causes the root wb associated with the bdi to be
> > unregistered, but none of the memcg-specific wb's are shutdown.  So when
> > one of these wb's are woken up to do delayed work, they try to
> > dereference their wb->bdi->dev to fetch the device name, but
> > unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
> > called by del_gendisk().   As a result, *boom*.
> > 
> > Fortunately, it looks like the rest of the writeback path is perfectly
> > happy with bdi->dev and bdi->owner being NULL, so the simplest fix is
> > to create a bdi_dev_name() function which can handle bdi->dev being
> > NULL.  This also allows us to bulletproof the writeback tracepoints to
> > prevent them from dereferencing a NULL pointer and crashing the kernel
> > if one is tracing with memcg's enabled, and an iSCSI device dies or a
> > USB storage stick is pulled.
> 
> Is hotremoval of a device while tracing writeback the only known way of
> triggering this?

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering
several times a day in a heavily loaded production environment.

> Is it worth a cc:stable?

Yes, I think so.

						- Ted