Re: BUG: Crash in __free_slab() using SLAB_TYPESAFE_BY_RCU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, Karsten!

Thank you for the report!

On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote:
> 
> net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU.
> Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which 
> calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab().
> Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit
> fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
> 
> I added a 'call graph', below of that is the crash log and a (simple) patch that works for me,
> but I don't know if this is the correct way to fix it.
> 
> (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you)
> 
> 
> kmem_cache_destroy() 
>   -> shutdown_memcg_caches()
>     -> shutdown_cache()
>       -> __kmem_cache_shutdown()  (slub.c)
>         -> free_partial()
>           -> discard_slab()
> 	    -> free_slab()                                      -- call to __free_slab() is delayed
> 	      -> call_rcu(rcu_free_slab)
>     -> memcg_unlink_cache()
>       -> WRITE_ONCE(s->memcg_params.memcg, NULL);               -- !!!
>     -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy);
>     -> schedule_work(&slab_caches_to_rcu_destroy_work);  -> work_fn uses rcu_barrier() to wait for rcu_batch, 
>                                                             so work_fn is not further involved here...
> ... rcu grace period ...
> rcu_batch()
>   ...
>   -> rcu_free_slab()   (slub.c)
>     -> __free_slab()
>       -> uncharge_slab_page()
>         -> memcg_uncharge_slab()
> 	  -> memcg = READ_ONCE(s->memcg_params.memcg);          -- !!! memcg == NULL
> 	  -> mem_cgroup_lruvec(memcg)
> 	    -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL
> 	    -> lruvec = &mz->lruvec;                            -- lruvec == NULL
> 	    -> lruvec->pgdat = pgdat;                           -- *crash*
> 
> The crash log:

Hm, I might be wrong, but it seems that the problem is deeper: __free_slab()
called from the rcu path races with kmem_cache_destroy(), which is supposed
to be called when there are no outstanding allocations (and corresponding pages).
Any charged slab page actually holds a reference to the kmem_cache, which prevents
its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores
it.

If my thoughts are correct, the commit you've mentioned didn't introduced this
issue, it just made it easier to reproduce.

The proposed fix looks dubious to me: the problem isn't in the memcg pointer
(it's just a luck that it crashes on it), and it seems incorrect to not decrease
the slab statistics of the original memory cgroup.

What we probably need to do instead is to extend flush_memcg_workqueue() to
wait for all outstanding rcu free callbacks. I have to think a bit what's the best
way to fix it. How easy is to reproduce the problem?

> 
> 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space

Btw, haven't you noticed anything suspicious in dmesg before this line?

Thank you!

Roman

> 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483
> 349.361223¨ Fault in home space mode while using kernel ASCE.
> 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
> 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
> 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de
> 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
> 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
> 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
> 349.361464¨            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
> 349.361475¨            0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361481¨            0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
> 349.361486¨            000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
> 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
> 349.361500¨            00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
> 349.361500¨           #00000000003cadb2: a7f4fe36            brc     15,3caa1e
> 349.361500¨           >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
> 349.361500¨            00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
> 349.361500¨            00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa
> 349.361500¨            00000000003cadc6: a7f4fe2c            brc     15,3caa1e
> 349.361500¨            00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
> 349.361619¨ Call Trace:
> 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0)
> 349.361634¨  Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0
> 349.361643¨  Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0
> 349.361652¨  Ý<0000000000152644>¨ irq_exit+0x104/0x130
> 349.361659¨  Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0
> 349.361667¨  Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134
> 349.361674¨  Ý<0000000000103648>¨ enabled_wait+0x58/0x128
> 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128)
> 349.361688¨  Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58
> 349.361695¨  Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68
> 349.361704¨  Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0
> 349.361748¨  Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40
> 349.361756¨  Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88
> 349.361761¨  Ý<0000000000000000>¨ 0x0
> 349.361765¨ INFO: lockdep is turned off.
> 349.361769¨ Last Breaking-Event-Address:
> 349.361774¨  Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0
> 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> A fix that works for me (RFC):
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index a62372d0f271..b19a3f940338 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> 
>         rcu_read_lock();
>         memcg = READ_ONCE(s->memcg_params.memcg);
> -       if (likely(!mem_cgroup_is_root(memcg))) {
> +       if (likely(memcg && !mem_cgroup_is_root(memcg))) {
>                 lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>                 mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>                 memcg_kmem_uncharge_memcg(page, order, memcg);
> 
> -- 
> Karsten
> 
> (I'm a dude)
> 
> 





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux