Re: [ovs-discuss] Double free in recent kernels after memleak fix

Johan Knöös <jknoos@xxxxxxxxxx> · Fri, 7 Aug 2020 16:05:36 -0700

On Fri, Aug 7, 2020 at 3:20 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Fri, Aug 07, 2020 at 04:47:56PM -0400, Joel Fernandes wrote:
> > Hi,
> > Adding more of us working on RCU as well. Johan from another team at
> > Google discovered a likely issue in openswitch, details below:
> >
> > On Fri, Aug 7, 2020 at 11:32 AM Johan Knöös <jknoos@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Aug 4, 2020 at 8:52 AM Gregory Rose <gvrose8192@xxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On 8/3/2020 12:01 PM, Johan Knöös via discuss wrote:
> > > > > Hi Open vSwitch contributors,
> > > > >
> > > > > We have found openvswitch is causing double-freeing of memory. The
> > > > > issue was not present in kernel version 5.5.17 but is present in
> > > > > 5.6.14 and newer kernels.
> > > > >
> > > > > After reverting the RCU commits below for debugging, enabling
> > > > > slub_debug, lockdep, and KASAN, we see the warnings at the end of this
> > > > > email in the kernel log (the last one shows the double-free). When I
> > > > > revert 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 ("net: openvswitch:
> > > > > fix possible memleak on destroy flow-table"), the symptoms disappear.
> > > > > While I have a reliable way to reproduce the issue, I unfortunately
> > > > > don't yet have a process that's amenable to sharing. Please take a
> > > > > look.
> > > > >
> > > > > 189a6883dcf7 rcu: Remove kfree_call_rcu_nobatch()
> > > > > 77a40f97030b rcu: Remove kfree_rcu() special casing and lazy-callback handling
> > > > > e99637becb2e rcu: Add support for debug_objects debugging for kfree_rcu()
> > > > > 0392bebebf26 rcu: Add multiple in-flight batches of kfree_rcu() work
> > > > > 569d767087ef rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
> > > > > a35d16905efc rcu: Add basic support for kfree_rcu() batching
> >
> > Note that these reverts were only for testing the same code, because
> > he was testing 2 different kernel versions. One of them did not have
> > this set. So I asked him to revert. There's no known bug in the
> > reverted code itself. But somehow these patches do make it harder for
> > him to reproduce the issue.

I'm not certain the frequency of the issue changes with and without
these commits on 5.6.14, but at least the symptoms/definition of the
issue changes. To clarify, this is what I've observed with different
kernels:
* 5.6.14:  "kernel BUG at mm/slub.c:304!". Easily reproducible.
* 5.6.14 with the above RCU commits reverted: the warnings reported in
my original email. Easily reproducible.
* 5.6.14 with the above RCU commits reverted and
50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 reverted: no warnings
observed (the frequency might be the same as on 5.5.17).
* 5.5.17: warning at kernel/rcu/tree.c#L2239. Difficult to reproduce.
Maybe a different root cause.

> Perhaps they adjust timing?