Re: kmemleak handling of kfree_rcu

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Wed, 6 Sep 2023 14:35:29 +0000

Hi Catalin,

On Tue, Sep 05, 2023 at 03:41:32PM +0100, Catalin Marinas wrote:
> On Tue, Sep 05, 2023 at 11:17:25AM +0000, Joel Fernandes wrote:
> > On Mon, Sep 04, 2023 at 10:22:56PM +0100, Catalin Marinas wrote:
> > > On Wed, Aug 30, 2023 at 09:37:23AM -0700, Christoph Paasch wrote:
> > > > for the MPTCP upstream project, we are running syzkaller [1] continuously to
> > > > qualify our kernel changes.
> > > > 
> > > > We found one issue with kmemleak and its handling of kfree_rcu.
> > > > 
> > > > Specifically, it looks like kmemleak falsely reports memory-leaks when the
> > > > object is being freed by kfree_rcu after a certain grace-period.
> > > > 
> > > > For example, https://github.com/multipath-tcp/mptcp_net-next/issues/398#
> > > > issuecomment-1584819482 shows how the syzkaller program reliably produces a
> > > > kmemleak report, although the object is not leaked (we confirmed that by simply
> > > > increasing MSECS_MIN_AGE to something larger than the grace-period).
> > > 
> > > Not sure which RCU variant you are using but most likely the false
> > > positives are caused by the original reference to the object being lost
> > > and the pointer added to a new location that kmemleak does not track
> > > (e.g. bnode->records[] in the tree-based variant).
> > > 
> > > A quick attempt (untested, not even compiled):
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > 
> > I am not sure if that works. Correct me if I'm wrong but the issue is that
> > the allocated pointer being RCU-freed is no longer reachable by kmemleak (it
> > is scanned but not reachable), however it might still be reachable via an
> > RCU reader. In such a situation, it is a false-positive.
> 
> Yes, with a small correction that the object being RCU-freed is not
> scanned if it cannot be reached by kmemleak (unless explicitly marked as
> not a leak).

Yes I used wrong wording. I meant it is added to the kmemleak rbtree as
something that needs to be reachable, but cannot be reached. Hopefully that
makes sense.

> > The correct fix then should probably be to mark the object as
> > kmemleak_not_leak() until a grace period elapses. This will cause the object
> > to not be reported but still be scanned until eventually the lower layers
> > will remove the object from kmemleak-tracking after the grace period. Per the
> > docs also, that API is used to prevent false-positives.
> 
> This should work as well but I'd use kmemleak_ignore() instead of
> kmemleak_not_leak(). The former, apart from masking the false positive,
> also tells kmemleak not to scan the object. After a kvfree_rcu(), the
> object shouldn't have any valid references to other objects, so not
> worth scanning.

Yes I am also OK with that, however to me I consider the object as alive as
long as the grace period does not end. But I agree with you and it may not be
worth tracking them or scanning them.

> > Instead what the below diff appears to do is to mark the bnode cache as a
> > kmemleak object itself, which smells a bit wrong. The bnode is not an
> > allocated object in the traditional sense, it is simple an internal data
> > structure.
> 
> We do this in other places for objects containing pointers and which
> aren't tracked by kmemleak (it doesn't track page allocations as there
> would be too many leading to lots of false positives or overlapping
> objects - the slab itself uses the page allocator). An example where we
> do something similar is alloc_large_system_hash(). We could add a
> wrapper API if kmemleak_alloc() feels wrong but underlying in kmemleak
> it would use the same mechanism for recording and scanning the bnode.

Ah I see what you did, you basically made the bnode as a kmemleak object in
its own right, and perhaps the kmemleak detector can reach the object you
added. That actually (though sounding like a little of a hack) would work too.
I say hack because the object you added is not an allocated object in the
traditional sense :-D. But as you said, there is precendent for that.

> > That may not solve the issue as the false-positive unreachable
> > object is still unreachable.
> 
> It should solve the issue as long as the bnode is reachable from the
> root objects (e.g. the data/bss sections; I think it is via the
> per_cpu_ptr(&krc, cpu)->bkvcache llist_head). When the bnode is scanned,
> it will find the pointer to the RCU-freed object and mark it as
> referenced (not a leak).

Right! That's what I was missing.

> Anyway, as we trust the RCU freeing implementation not to leak data, we
> can go with your simpler suggestion of a kmemleak_ignore() call on the
> kvfree_rcu() path. Probably even better as the object pending to be
> freed will no longer be scanned.

Sounds good and thanks a lot Catalin! Unless you beat me to it, I'll send a
patch out along those lines by the weekend and CC you with your suggested-by
and the reported-by from the reporter ;-). Let me know if you have a preference.

thanks,

 - Joel