Re: [PATCH 1/1] NFSD: fix WARN_ON_ONCE in __queue_delayed_work

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 11 Jan 2023 07:00:18 -0500

On Wed, 2023-01-11 at 12:19 +0100, Mike Galbraith wrote:
> On Wed, 2023-01-11 at 05:55 -0500, Jeff Layton wrote:
> > > 
> > > > crash> delayed_work ffff8881601fab48
> > > > struct delayed_work {
> > > >   work = {
> > > >     data = {
> > > >       counter = 1
> > > >     },
> > > >     entry = {
> > > >       next = 0x0,
> > > >       prev = 0x0
> > > >     },
> > > >     func = 0x0
> > > >   },
> > > >   timer = {
> > > >     entry = {
> > > >       next = 0x0,
> > > >       pprev = 0x0
> > > >     },
> > > >     expires = 0,
> > > >     function = 0x0,
> > > >     flags = 0
> > > >   },
> > > >   wq = 0x0,
> > > >   cpu = 0
> > > > }
> > > 
> > > That looks more like a memory scribble or UAF. Merely having multiple
> > > tasks calling queue_work at the same time wouldn't be enough to trigger
> > > this, IMO. It's more likely that the extra locking is changing the
> > > timing of your reproducer somehow.
> > > 
> > > It might be interesting to turn up KASAN if you're able.
> 
> I can try that.
> 
> > If you still have this vmcore, it might be interesting to do the pointer
> > math and find the nfsd_net structure that contains the above
> > delayed_work. Does the rest of it also seem to be corrupt? My guess is
> > that the corrupted structure extends beyond just the delayed_work above.
> > 
> > Also, it might be helpful to do this:
> > 
> >      kmem -s ffff8881601fab48
> > 
> > ...which should tell us whether and what part of the slab this object is
> > now a part of. That said, net-namespace object allocations are somewhat
> > weird, and I'm not 100% sure they come out of the slab.
> 
> I tossed the vmcore, but can generate another.  I had done kmem sans -s
> previously, still have that.
> 
> crash> kmem ffff8881601fab48
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> kmem: kmalloc-1k: partial list slab: ffffea0005b20c08 invalid page.inuse: -1
> ffff888100041840     1024       2329      2432     76    32k  kmalloc-1k
>   SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>   ffffea0005807e00  ffff8881601f8000     0     32         32     0
>   FREE / [ALLOCATED]
>   [ffff8881601fa800]
> 
>       PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
> ffffea0005807e80 1601fa000 dead000000000400        0  0 200000000000000
> crash
> 

Thanks. The pernet allocations do come out of the slab. The allocation
is done in ops_init in net/core/namespace.c. This one is a 1k
allocation, which jives with the size of nfsd_net (which is 976 bytes).

So, this seems to be consistent with where an nfsd_net would have come
from. Maybe not a UAF, but I do think we have some sort of mem
corruption going on.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>