On Tue, Sep 05, 2023 at 11:17:25AM +0000, Joel Fernandes wrote: > On Mon, Sep 04, 2023 at 10:22:56PM +0100, Catalin Marinas wrote: > > On Wed, Aug 30, 2023 at 09:37:23AM -0700, Christoph Paasch wrote: > > > for the MPTCP upstream project, we are running syzkaller [1] continuously to > > > qualify our kernel changes. > > > > > > We found one issue with kmemleak and its handling of kfree_rcu. > > > > > > Specifically, it looks like kmemleak falsely reports memory-leaks when the > > > object is being freed by kfree_rcu after a certain grace-period. > > > > > > For example, https://github.com/multipath-tcp/mptcp_net-next/issues/398# > > > issuecomment-1584819482 shows how the syzkaller program reliably produces a > > > kmemleak report, although the object is not leaked (we confirmed that by simply > > > increasing MSECS_MIN_AGE to something larger than the grace-period). > > > > Not sure which RCU variant you are using but most likely the false > > positives are caused by the original reference to the object being lost > > and the pointer added to a new location that kmemleak does not track > > (e.g. bnode->records[] in the tree-based variant). > > > > A quick attempt (untested, not even compiled): > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > I am not sure if that works. Correct me if I'm wrong but the issue is that > the allocated pointer being RCU-freed is no longer reachable by kmemleak (it > is scanned but not reachable), however it might still be reachable via an > RCU reader. In such a situation, it is a false-positive. Yes, with a small correction that the object being RCU-freed is not scanned if it cannot be reached by kmemleak (unless explicitly marked as not a leak). > The correct fix then should probably be to mark the object as > kmemleak_not_leak() until a grace period elapses. This will cause the object > to not be reported but still be scanned until eventually the lower layers > will remove the object from kmemleak-tracking after the grace period. Per the > docs also, that API is used to prevent false-positives. This should work as well but I'd use kmemleak_ignore() instead of kmemleak_not_leak(). The former, apart from masking the false positive, also tells kmemleak not to scan the object. After a kvfree_rcu(), the object shouldn't have any valid references to other objects, so not worth scanning. > Instead what the below diff appears to do is to mark the bnode cache as a > kmemleak object itself, which smells a bit wrong. The bnode is not an > allocated object in the traditional sense, it is simple an internal data > structure. We do this in other places for objects containing pointers and which aren't tracked by kmemleak (it doesn't track page allocations as there would be too many leading to lots of false positives or overlapping objects - the slab itself uses the page allocator). An example where we do something similar is alloc_large_system_hash(). We could add a wrapper API if kmemleak_alloc() feels wrong but underlying in kmemleak it would use the same mechanism for recording and scanning the bnode. > That may not solve the issue as the false-positive unreachable > object is still unreachable. It should solve the issue as long as the bnode is reachable from the root objects (e.g. the data/bss sections; I think it is via the per_cpu_ptr(&krc, cpu)->bkvcache llist_head). When the bnode is scanned, it will find the pointer to the RCU-freed object and mark it as referenced (not a leak). Anyway, as we trust the RCU freeing implementation not to leak data, we can go with your simpler suggestion of a kmemleak_ignore() call on the kvfree_rcu() path. Probably even better as the object pending to be freed will no longer be scanned. -- Catalin