On 3/4/25 11:26, Vlastimil Babka wrote: > +Cc NETWORKING [TLS] maintainers and netdev for input, thanks. > > The full error is here: > https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@xxxxxxx/ > > On 3/4/25 11:20, Hannes Reinecke wrote: >> On 3/4/25 09:18, Vlastimil Babka wrote: >>> On 3/4/25 08:58, Hannes Reinecke wrote: >>>> On 3/3/25 23:02, Vlastimil Babka wrote: >>>>> On 3/3/25 17:15, Vlastimil Babka wrote: >>>>>> On 3/3/25 16:48, Matthew Wilcox wrote: >>>>>>> You need to turn on the debugging options Vlastimil mentioned and try to >>>>>>> figure out what nvme is doing wrong. >>>>>> >>>>>> Agree, looks like some error path going wrong? >>>>>> Since there seems to be actual non-large kmalloc usage involved, another >>>>>> debug parameter that could help: CONFIG_SLUB_DEBUG=y, and boot with >>>>>> "slab_debug=FZPU,kmalloc-*" >>>>> >>>>> Also make sure you have CONFIG_DEBUG_VM please. >>>>> >>>> Here you go: >>>> >>>> [ 134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000 >>>> index:0x0 pfn:0x101ef8 >>>> [ 134.509253] head: order:3 mapcount:0 entire_mapcount:0 >>>> nr_pages_mapped:0 pincount:0 >>>> [ 134.511594] flags: >>>> 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) >>>> [ 134.513556] page_type: f5(slab) >>>> [ 134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 >>>> ffff8881000402f0 >>>> [ 134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000 >>>> 0000000000000000 >>>> [ 134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810 >>>> ffff8881000402f0 >>>> [ 134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000 >>>> 0000000000000000 >>>> [ 134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff >>>> 0000000000000000 >>>> [ 134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff >>>> 0000000000000000 >>>> [ 134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) >>>> folio_ref_count(folio) + 127u <= 127u)) >>>> [ 134.513615] ------------[ cut here ]------------ >>>> [ 134.529822] kernel BUG at ./include/linux/mm.h:1455! >>> >>> Yeah, just as I suspected, folio_get() says the refcount is 0. >>> >>>> [ 134.529835] Oops: invalid opcode: 0000 [#1] PREEMPT SMP >>>> DEBUG_PAGEALLOC NOPTI >>>> [ 134.529843] CPU: 0 UID: 0 PID: 274 Comm: kworker/0:1H Kdump: loaded >>>> Tainted: G E 6.14.0-rc4-default+ #309 >>>> 03b131f1ef70944969b40df9d90a283ed638556f >>>> [ 134.536577] Tainted: [E]=UNSIGNED_MODULE >>>> [ 134.536580] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >>>> 0.0.0 02/06/2015 >>>> [ 134.536583] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] >>>> [ 134.536595] RIP: 0010:__iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.542810] Code: e8 4c 39 e0 49 0f 47 c4 48 01 45 08 48 29 45 18 e9 >>>> 90 fa ff ff 48 83 ef 01 e9 7f fe ff ff 48 c7 c6 40 57 4f 82 e8 6a e2 ce >>>> ff <0f> 0b e8 43 b8 b1 ff eb c5 f7 c1 ff 0f 00 00 48 89 cf 0f 85 4f ff >>>> [ 134.542816] RSP: 0018:ffffc900004579d8 EFLAGS: 00010282 >>>> [ 134.542821] RAX: 000000000000005c RBX: ffffc90000457a90 RCX: >>>> 0000000000000027 >>>> [ 134.542825] RDX: 0000000000000000 RSI: 0000000000000002 RDI: >>>> ffff88817f423748 >>>> [ 134.542828] RBP: ffffc90000457d60 R08: 0000000000000000 R09: >>>> 0000000000000001 >>>> [ 134.554485] R10: ffffc900004579c0 R11: ffffc90000457720 R12: >>>> 0000000000000000 >>>> [ 134.554488] R13: ffffea000407be40 R14: ffffc90000457a70 R15: >>>> ffffc90000457d60 >>>> [ 134.554495] FS: 0000000000000000(0000) GS:ffff88817f400000(0000) >>>> knlGS:0000000000000000 >>>> [ 134.554499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 134.554502] CR2: 0000556b0675b600 CR3: 0000000106bd8000 CR4: >>>> 0000000000350ef0 >>>> [ 134.554509] Call Trace: >>>> [ 134.554512] <TASK> >>>> [ 134.554516] ? __die_body+0x1a/0x60 >>>> [ 134.554525] ? die+0x38/0x60 >>>> [ 134.554531] ? do_trap+0x10f/0x120 >>>> [ 134.554538] ? __iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.568839] ? do_error_trap+0x64/0xa0 >>>> [ 134.568847] ? __iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.568855] ? exc_invalid_op+0x53/0x60 >>>> [ 134.572489] ? __iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.572496] ? asm_exc_invalid_op+0x16/0x20 >>>> [ 134.572512] ? __iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.576726] ? __iov_iter_get_pages_alloc+0x676/0x710 >>>> [ 134.576733] ? srso_return_thunk+0x5/0x5f >>>> [ 134.576740] ? ___slab_alloc+0x924/0xb60 >>>> [ 134.580253] ? mempool_alloc_noprof+0x41/0x190 >>>> [ 134.580262] ? tls_get_rec+0x3d/0x1b0 [tls >>>> 47f199c97f69357468c91efdbba24395e9dbfa77] >>>> [ 134.580282] iov_iter_get_pages2+0x19/0x30 >>> >>> Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in >>> the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()? >>> >> Looks like it. >> >>> Which doesn't work for a sub-size kmalloc() from a slab folio, which after >>> the frozen refcount conversion no longer supports get_page(). >>> >>> The question is if this is a mistake specific for this path that's easy to >>> fix or there are more paths that do this. At the very least the pinning of >>> page through a kmalloc() allocation from it is useless - the object itself >>> has to be kfree()'d and that would never happen through a put_page() >>> reaching zero. >>> >> Looks like a specific mistake. >> tls_sw is the only user of sk_msg_zerocopy_from_iter() >> (which is calling into __iov_iter_get_pages_alloc()). That's from tls_sw_sendmsg_locked(), right? But that's under: if (!is_kvec && (full_record || eor) && !async_capable) { Shouldn't is_kvec be true if we're dealing a kernel buffer (kmalloc()) there? >> And, more to the point, tls_sw messes up iov pacing coming in from >> the upper layers. >> So even if the upper layers send individual iovs (where each iov might >> contain different allocation types), tls_sw is packing them together >> into full records. So it might end up with iovs having _different_ >> allocations. >> Which would explain why we only see it with TLS, but not with normal >> connections. >> >> Or so my reasoning goes. Not sure if that's correct. >> >> So I'd be happy with an 'easy' fix for now. Obviously :-) >> >> Cheers, >> >> Hannes >