On Wed, Oct 02, 2019 at 12:41:29PM -0700, Roman Gushchin wrote: > Hello, Karsten! > > Thank you for the report! > > On Wed, Oct 02, 2019 at 04:50:53PM +0200, Karsten Graul wrote: > > > > net/smc is calling proto_register(&smc_proto, 1) with smc_proto.slab_flags = SLAB_TYPESAFE_BY_RCU. > > Right after the last SMC socket is destroyed, proto_unregister(&smc_proto) is called, which > > calls kmem_cache_destroy(prot->slab). This results in a kernel crash in __free_slab(). > > Platform is s390x, reproduced on kernel 5.4-rc1. The problem was introduced by commit > > fb2f2b0adb98 ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal") > > > > I added a 'call graph', below of that is the crash log and a (simple) patch that works for me, > > but I don't know if this is the correct way to fix it. > > > > (Please keep me on CC of this thread because I do not follow the mm mailing list, thank you) > > > > > > kmem_cache_destroy() > > -> shutdown_memcg_caches() > > -> shutdown_cache() > > -> __kmem_cache_shutdown() (slub.c) > > -> free_partial() > > -> discard_slab() > > -> free_slab() -- call to __free_slab() is delayed > > -> call_rcu(rcu_free_slab) > > -> memcg_unlink_cache() > > -> WRITE_ONCE(s->memcg_params.memcg, NULL); -- !!! > > -> list_add_tail(&s->list, &slab_caches_to_rcu_destroy); > > -> schedule_work(&slab_caches_to_rcu_destroy_work); -> work_fn uses rcu_barrier() to wait for rcu_batch, > > so work_fn is not further involved here... > > ... rcu grace period ... > > rcu_batch() > > ... > > -> rcu_free_slab() (slub.c) > > -> __free_slab() > > -> uncharge_slab_page() > > -> memcg_uncharge_slab() > > -> memcg = READ_ONCE(s->memcg_params.memcg); -- !!! memcg == NULL > > -> mem_cgroup_lruvec(memcg) > > -> mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id); -- mz == NULL > > -> lruvec = &mz->lruvec; -- lruvec == NULL > > -> lruvec->pgdat = pgdat; -- *crash* > > > > The crash log: > > Hm, I might be wrong, but it seems that the problem is deeper: __free_slab() > called from the rcu path races with kmem_cache_destroy(), which is supposed > to be called when there are no outstanding allocations (and corresponding pages). > Any charged slab page actually holds a reference to the kmem_cache, which prevents > its destruction (look at s->memcg_params.refcnt), but kmem_cache_destroy() ignores > it. > > If my thoughts are correct, the commit you've mentioned didn't introduced this > issue, it just made it easier to reproduce. > > The proposed fix looks dubious to me: the problem isn't in the memcg pointer > (it's just a luck that it crashes on it), and it seems incorrect to not decrease > the slab statistics of the original memory cgroup. > > What we probably need to do instead is to extend flush_memcg_workqueue() to > wait for all outstanding rcu free callbacks. I have to think a bit what's the best > way to fix it. How easy is to reproduce the problem? After a second thought, flush_memcg_workqueue() already contains a rcu_barrier() call, so now first suspicion is that the last free() call occurs after the kmem_cache_destroy() call. Can you, please, check if it's not a case? Thanks! > > > > > 349.361168¨ Unable to handle kernel pointer dereference in virtual kernel address space > > Btw, haven't you noticed anything suspicious in dmesg before this line? > > Thank you! > > Roman > > > 349.361210¨ Failing address: 0000000000000000 TEID: 0000000000000483 > > 349.361223¨ Fault in home space mode while using kernel ASCE. > > 349.361240¨ AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d > > 349.361340¨ Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP > > 349.361349¨ Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_de > > 349.361436¨ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14 > > 349.361445¨ Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) > > 349.361450¨ Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0) > > 349.361464¨ R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 > > 349.361470¨ Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8 > > 349.361475¨ 0000000000000000 000000009e3291c1 0000000000000000 0000000000000000 > > 349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 > > 349.361481¨ 0000000000000003 0000000000000008 000000002b478b00 000003d080a97600 > > 349.361486¨ 000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78 > > 349.361500¨ Krnl Code: 00000000003cada6: e310a1400004 lg %r1,320(%r10) > > 349.361500¨ 00000000003cadac: c0e50046c286 brasl %r14,ca32b8 > > 349.361500¨ #00000000003cadb2: a7f4fe36 brc 15,3caa1e > > 349.361500¨ >00000000003cadb6: e32060800024 stg %r2,128(%r6) > > 349.361500¨ 00000000003cadbc: a7f4fd9e brc 15,3ca8f8 > > 349.361500¨ 00000000003cadc0: c0e50046790c brasl %r14,c99fd8 > > 349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa > > 349.361500¨ 00000000003cadc6: a7f4fe2c brc 15,3caa1e > > 349.361500¨ 00000000003cadca: ecb1ffff00d9 aghik %r11,%r1,-1 > > 349.361619¨ Call Trace: > > 349.361627¨ (Ý<00000000003cabcc>¨ __free_slab+0x49c/0x6b0) > > 349.361634¨ Ý<00000000001f5886>¨ rcu_core+0x5a6/0x7e0 > > 349.361643¨ Ý<0000000000ca2dea>¨ __do_softirq+0xf2/0x5c0 > > 349.361652¨ Ý<0000000000152644>¨ irq_exit+0x104/0x130 > > 349.361659¨ Ý<000000000010d222>¨ do_IRQ+0x9a/0xf0 > > 349.361667¨ Ý<0000000000ca2344>¨ ext_int_handler+0x130/0x134 > > 349.361674¨ Ý<0000000000103648>¨ enabled_wait+0x58/0x128 > > 349.361681¨ (Ý<0000000000103634>¨ enabled_wait+0x44/0x128) > > 349.361688¨ Ý<0000000000103b00>¨ arch_cpu_idle+0x40/0x58 > > 349.361695¨ Ý<0000000000ca0544>¨ default_idle_call+0x3c/0x68 > > 349.361704¨ Ý<000000000018eaa4>¨ do_idle+0xec/0x1c0 > > 349.361748¨ Ý<000000000018ee0e>¨ cpu_startup_entry+0x36/0x40 > > 349.361756¨ Ý<000000000122df34>¨ arch_call_rest_init+0x5c/0x88 > > 349.361761¨ Ý<0000000000000000>¨ 0x0 > > 349.361765¨ INFO: lockdep is turned off. > > 349.361769¨ Last Breaking-Event-Address: > > 349.361774¨ Ý<00000000003ca8f4>¨ __free_slab+0x1c4/0x6b0 > > 349.361781¨ Kernel panic - not syncing: Fatal exception in interrupt > > > > > > A fix that works for me (RFC): > > > > diff --git a/mm/slab.h b/mm/slab.h > > index a62372d0f271..b19a3f940338 100644 > > --- a/mm/slab.h > > +++ b/mm/slab.h > > @@ -328,7 +328,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order, > > > > rcu_read_lock(); > > memcg = READ_ONCE(s->memcg_params.memcg); > > - if (likely(!mem_cgroup_is_root(memcg))) { > > + if (likely(memcg && !mem_cgroup_is_root(memcg))) { > > lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg); > > mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order)); > > memcg_kmem_uncharge_memcg(page, order, memcg); > > > > -- > > Karsten > > > > (I'm a dude) > > > >