Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU

Karsten Graul <kgraul@xxxxxxxxxxxxx> · Fri, 4 Oct 2019 10:59:06 +0200

>> Let me show the 'call graph' again, call_rcu() is called by free_slab()
>> as part of kmem_cache_destroy(), and just before memcg_unlink_cache() clears
>> the memcg reference.
>>
>> kmem_cache_destroy() 
>>   -> shutdown_memcg_caches()
>>     -> shutdown_cache()
>>       -> __kmem_cache_shutdown()  (slub.c)
>>         -> free_partial()
>>           -> discard_slab()
>> 	    -> free_slab()                                      -- call to __free_slab() is delayed
>> 	      -> call_rcu(rcu_free_slab)
>>     -> memcg_unlink_cache()
>>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
> 
> Ah, got it, thank you!
> 
> Then something like this should work. Can you, please, confirm that is solves
> the problem?
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..d916e986f094 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -180,8 +180,11 @@ static void destroy_memcg_params(struct kmem_cache *s)
>  {
>         if (is_root_cache(s))
>                 kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
> -       else
> +       else {
> +               mem_cgroup_put(s->memcg_params.memcg);
> +               WRITE_ONCE(s->memcg_params.memcg, NULL);
>                 percpu_ref_exit(&s->memcg_params.refcnt);
> +       }
>  }
>  
>  static void free_memcg_params(struct rcu_head *rcu)
> @@ -253,8 +256,6 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> -               mem_cgroup_put(s->memcg_params.memcg);
> -               WRITE_ONCE(s->memcg_params.memcg, NULL);
>         }
>  }
>  #else
> 

I tested this fix and can confirm that it solved the problem!

> 
> --
> 
> Thank you!
> 
> Roman
> 
> 
>>
>>
>>> I'd add an atomic flag to the root kmem_cache, set it at the beginning of the
>>> kmem_cache_destroy() and check it in free_slab(). If set, dump the stacktrace.
>>> Just please make sure you're looking at the root kmem_cache flag, not the memcg
>>> one.
>>>
>>> Thank you!
>>>
>>> Roman
>>>
>>>>
>>>> [  145.540001] free_slab call_rcu() for 00000000392c2900, page is 000003d080e4a200
>>>> [  145.540031] memcg_unlink_cache clearing memcg for 00000000392c2900
>>>> [  145.540041] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 00000000392c2900
>>>>
>>>> [  145.540066] kmem_cache_destroy after shutdown_memcg_caches() for 0000000068106f00
>>>>
>>>> [  145.540075] kmem_cache_destroy before final shutdown_cache() for 0000000068106f00
>>>> [  145.540086] free_slab call_rcu() for 0000000068106f00, page is 000003d080e0a800
>>>> [  145.540189] shutdown_cache adding to slab_caches_to_rcu_destroy queue for work: 0000000068106f00
>>>>
>>>> [  145.540548] kmem_cache_destroy after final shutdown_cache() for 0000000068106f00
>>>>    kmem_cache_destroy is done
>>>> [  145.540573] slab_caches_to_rcu_destroy_workfn before rcu_barrier() in workfunc
>>>>    slab_caches_to_rcu_destroy_workfn started and waits in rcu_barrier() now
>>>> [  145.540619] smc.0698ae: smc_exit before smc_pnet_exit
>>>>    smc module exit code gets back control ...
>>>> [  145.540699] smc.616283: smc_exit before unregister_pernet_subsys
>>>> [  145.619747] rcu_free_slab called for 00000000392c2e00, page is 000003d080e45000
>>>>    much later the rcu callbacks are invoked, and will crash
>>>>
>>>>>>
>>>>>> If my thoughts are correct, the commit you've mentioned didn't introduced this
>>>>>> issue, it just made it easier to reproduce.
>>>>>>
>>>>>> The proposed fix looks dubious to me: the problem isn't in the memcg pointer
>>>>>> (it's just a luck that it crashes on it), and it seems incorrect to not decrease
>>>>>> the slab statistics of the original memory cgroup.
>>>>
>>>> I was quite sure that my approach is way to simple, it's better when the mm experts
>>>> work on that.
>>>>
>>>>>>
>>>>>> What we probably need to do instead is to extend flush_memcg_workqueue() to
>>>>>> wait for all outstanding rcu free callbacks. I have to think a bit what's the best
>>>>>> way to fix it. How easy is to reproduce the problem?
>>>>
>>>> I can reproduce this at will and I am happy to test any fixes you provide.
>>>>
>>>>>
>>>>> After a second thought, flush_memcg_workqueue() already contains
>>>>> a rcu_barrier() call, so now first suspicion is that the last free() call
>>>>> occurs after the kmem_cache_destroy() call. Can you, please, check if it's not
>>>>> a case?
>>>>>
>>>>
>>>> In kmem_cache_destroy(), the flush_memcg_workqueue() call is the first one, and after
>>>> that shutdown_memcg_caches() is called which setup the rcu callbacks.
>>>
>>> These are callbacks to destroy kmem_caches, not pages.
>>>
>>>> So flush_memcg_workqueue() can not wait for them. If you follow the 'call graph' above 
>>>> using the RCU path in slub.c you can see when the callbacks are set up and why no warning 
>>>> is printed.
>>>>
>>>>
>>>> Second thought after I wrote all of the above: when flush_memcg_workqueue() already contains
>>>> an rcu_barrier(), whats the point of delaying the slab freeing in the rcu case? All rcu
>>>> readers should be done now, so the rcu callbacks and the worker are not needed?
>>>> What am I missing here (I am sure I miss something, I am completely new in the mm area)?
>>>>
>>>>> Thanks!
>>>>>
>>>>>>
>>>>>>>
>>>>>>> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space
>>>>>>
>>>>>> Btw, haven't you noticed anything suspicious in dmesg before this line?
>>>>
>>>> There is no error or warning line in dmesg before this line. Actually, I think that
>>>> all pages are no longer in use so no warning is printed. Anyway, the slab freeing is
>>>> delayed in any case when RCU is in use, right?
>>>>
>>>>
>>>> Karsten
>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Roman
>>>>>>
>>>>>>> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
>>>>>>> 349.361223¨ Fault in home space mode while using kernel ASCE.
>>>>>>> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
>>>>>>> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
>>>>>>> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
>>>>>>> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
>>>>>>> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
>>>>>>> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
>>>>>>> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
>>>>>>> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
>>>>>>> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
>>>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>>>> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
>>>>>>> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
>>>>>>> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
>>>>>>> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
>>>>>>> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
>>>>>>> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
>>>>>>> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
>>>>>>> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
>>>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
>>>>>>> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
>>>>>>> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
>>>>>>> 349.361619¨ Call Trace:
>>>>>>> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
>>>>>>> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
>>>>>>> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
>>>>>>> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
>>>>>>> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
>>>>>>> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
>>>>>>> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
>>>>>>> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
>>>>>>> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
>>>>>>> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
>>>>>>> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
>>>>>>> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
>>>>>>> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
>>>>>>> 349.361761¨  Ý<0000000000000000>¨ 0x0
>>>>>>> 349.361765¨ INFO: lockdep is turned off.
>>>>>>> 349.361769¨ Last Breaking-Event-Address:
>>>>>>> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
>>>>>>> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
>>>>>>>
>>>>>>>
>>>>>>> A fix that works for me (RFC):
>>>>>>>
>>>>>>> diff --git a/mm/slab.h b/mm/slab.h
>>>>>>> index a62372d0f271..b19a3f940338 100644
>>>>>>> --- a/mm/slab.h
>>>>>>> +++ b/mm/slab.h
>>>>>>> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>>>>>>>
>>>>>>>         rcu_read_lock();
>>>>>>>         memcg = READ_ONCE(s->memcg_params.memcg);
>>>>>>> -       if (likely(!mem_cgroup_is_root(memcg))) {
>>>>>>> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>>>>>>>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>>>>>>>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>>>>>>>                 memcg_kmem_uncharge_memcg(page, order, memcg);
>>>>>>>
>>>>>>> -- 
>>>>>>> Karsten
>>>>>>>
>>>>>>> (I'm a dude)
>>>>>>>
>>>>>>>
>>>>
>>
>> -- 
>> Karsten
>>
>> (I'm a dude)
>>

-- 
Karsten

(I'm a dude)