Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU

Karsten Graul <kgraul@xxxxxxxxxxxxx> · Thu, 3 Oct 2019 10:48:15 +0200

Hello, Roman!

On 03/10/2019 05:35, Roman Gushchin wrote:
> On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote:
>> Hello, Karsten!
>>
>> Thank you for the report!
>>
>> On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
>>>
>>> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
>>> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
>>> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
>>> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
>>> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
>>>
>>> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
>>> but I don't know if this is the correct way to fix it.
>>>
>>> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
>>>
>>>
>>> kmem_cache_destroy() 
>>>   -> shutdown_memcg_caches()
>>>     -> shutdown_cache()
>>>       -> __kmem_cache_shutdown()  (slub.c)
>>>         -> free_partial()
>>>           -> discard_slab()
>>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
>>> 	      -> call_rcu(rcu_free_slab)
>>>     -> memcg_unlink_cache()
>>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>>>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>>>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
>>>                                                             so work_fn is not further involved here...
>>> ... rcu grace period ...
>>> rcu_batch()
>>>   ...
>>>   -> rcu_free_slab()   (slub.c)
>>>     -> __free_slab()
>>>       -> uncharge_slab_page()
>>>         -> memcg_uncharge_slab()
>>> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
>>> 	  -> mem_cgroup_lruvec(memcg)
>>> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
>>> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
>>> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
>>>
>>> The crash log:
>>
>> Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
>> called from the rcu path races with kmem_cache_destroy(), which is supposed
>> to be called when there are no outstanding allocations (and corresponding pages).
>> Any charged slab page actually holds a reference to the kmem_cache, which prevents
>> its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
>> it.

I don't see a race between kmem_cache_destroy() and __fre_slab(). kmem_cache_destroy()
is already done when __free_slab() is called. Maybe the below trace shows you the order of calls on my
system: kmem_cache_destroy() unlinks the memcg caches, sets up the rcu callbacks, then 
it starts the slab_caches_to_rcu_destroy_workfn() worker and then kmem_cache_destroy() is done.
You see that the smc code is getting control again after that.
The worker starts in between (before the smc_exit trace), but keeps waiting on the rcu_barrier().
Ages later (see time difference) the rcu grace period ends and rcu_free_slab() is called, and it
crashes. 
I hope that helps!

[  145.539917] kmem_cache_destroy before shutdown_memcg_caches() for 0000000068106f00
[  145.539929] free_slab call_rcu() for 00000000392c2e00, page is 000003d080e45000
[  145.539961] memcg_unlink_cache clearing memcg for 00000000392c2e00
[  145.539970] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2e00

[  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
[  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
[  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900

[  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00

[  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
[  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
[  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00

[  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
   kmem_cache_destroy is done
[  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
   slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
[  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
   smc module exit code gets back control ...
[  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
[  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
   much later the rcu callbacks are invoked, and will crash

>>
>> If my thoughts are correct, the commit you've mentioned didn't introduced this
>> issue, it just made it easier to reproduce.
>>
>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
>> the slab statistics of the original memory cgroup.

I was quite sure that my approach is way to simple, it's better when the mm experts
work on that.

>>
>> What we probably need to do instead is to extend flush_memcg_workqueue() to
>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
>> way to fix it. How easy is to reproduce the problem?

I can reproduce this at will and I am happy to test any fixes you provide.

> 
> After a second thought, flush_memcg_workqueue() already contains
> a rcu_barrier() call, so now first suspicion is that the last free() call
> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
> a case?
> 

In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
that shutdown_memcg_caches() is called which setup the rcu callbacks. 
So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
is printed.

Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
readers should be done now, so the rcu callbacks and the worker are not needed?
What am I missing here (I am sure I miss something, I am completely new in the mm area)?

> Thanks!
> 
>>
>>>
>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
>>
>> Btw, haven't you noticed anything suspicious in dmesg before this line?

There is no error or warning line in dmesg before this line. Actually, I think that
all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
delayed in any case when RCU is in use, right?

Karsten

>>
>> Thank you!
>>
>> Roman
>>
>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
>>> 349.361619¨ Call Trace:
>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
>>> 349.361765¨ INFO: lockdep is turned off.
>>> 349.361769¨ Last Breaking-Event-Address:
>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
>>>
>>>
>>> A fix that works for me (RFC):
>>>
>>> diff --git a/mm/slab.h b/mm/slab.h
>>> index a62372d0f271..b19a3f940338 100644
>>> --- a/mm/slab.h
>>> +++ b/mm/slab.h
>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>>>
>>>         rcu_read_lock();
>>>         memcg = READ_ONCE(s->memcg_params.memcg);
>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>
>>> -- 
>>> Karsten
>>>
>>> (I'm a dude)
>>>
>>>