Re: [PATCH v4.4.y] drivers: core: Remove glue dirs from sysfs earlier

Guenter Roeck <groeck@xxxxxxxxxx> · Tue, 29 Jan 2019 06:35:48 -0800

On Mon, Jan 28, 2019 at 10:58 PM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Jan 28, 2019 at 09:31:30AM -0800, Zubin Mithra wrote:
> > From: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> >
> > commit 726e41097920a73e4c7c33385dcc0debb1281e18 upstream
> >
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> >
> > This directory is never "explicitely" removed when empty however,
> > this is left to the implicit sysfs removal done by kobject_release()
> > when the object loses its last reference via kobject_put().
> >
> > This is problematic because as long as it's not been removed from
> > sysfs, it is still present in the class kset and in sysfs directory
> > structure.
> >
> > The presence in the class kset exposes a use after free bug fixed
> > by the previous patch, but the presence in sysfs means that until
> > the kobject is released, which can take a while (especially with
> > kobject debugging), any attempt at re-creating such as binding a
> > new device for that class/parent pair, will result in a sysfs
> > duplicate file name error.
> >
> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> >
> > This is made easy by the fact that all glue dir operations are
> > done with a global mutex, and there's already a function
> > (cleanup_glue_dir) called in all the right places taking that
> > mutex that can be enhanced for this. It appears that this was
> > in fact the intent of the function, but the implementation was
> > wrong.
> >
> > Backport Note: kref_read() is not present in 4.4. Hence,
> > use atomic_read(&kref.refcount) instead of kref_read(&kref).
> >
> > Signed-off-by: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
> > Acked-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Zubin Mithra <zsm@xxxxxxxxxxxx>
> > ---
> >  drivers/base/core.c     |  2 ++
> >  include/linux/kobject.h | 17 +++++++++++++++++
> >  2 files changed, 19 insertions(+)
>
> Wait, why is this needed?
>

We have syzcaller/syzbot hits, and there is a reproducer. Is this sufficient ?

> And why only for 4.4?  What about 4.9 and 4.14?  Do you want to upgrade
> and suddenly hit the same "bug" that you fixed before?
>

Good point. Sorry, that got lost. We are, after all, oly human. I'd
ask Zubin to provide backports, or do it myself, but we'll have to
resolve the issue you bring up below first.

> There was a reason that I did not backport this to the stable tree when
> it was submitted, and that was because this was an odd race to ever hit.
> Are you hitting this in the real world without kobject deferred
> release enabled?  And if so, are you hitting the WARN_ON that is added
> here?
>
I think we may need updated rules for stable. Many bug fixes are
backported to stable releases without having been seen in the "real
world" (whatever that means). At the same time we do see many races in
the real world, many of them not fully understood. Our policy so far
is to fix as many problems as possible if they are understood, in the
hope that it fixes at least some of those problems seen in the field.

If there is a new rule that problems have to have been observed in the
real world (ie without debugging options enabled, and without specific
reproducer) before a patch is applied to stable, can we have that as
generic rule that is not only selectively applied ?

We'll definitely continue fixing race conditions (and security issues,
for that matter) in our branches. So far our policy was to request
backport to -stable releases to limit deviation from stable releases
as much as possible, to reduce our maintenance burden going forward,
and to have others benefit from our work. We'll definitely continue
doing that for ChromeOS branches. I would find it regrettable to have
to revise our policy and bypass -stable, but we'll do that given no
choice. We will not spend additional effort trying to prove if a
problem is seen "in the real world"; that would be futile and a waste
of time. If that is the new criteria, we'll have to keep our patches
local.

Thanks,
Guenter