On 3/4/25 17:14, Matthew Wilcox wrote:
On Tue, Mar 04, 2025 at 11:26:07AM +0100, Vlastimil Babka wrote:
+Cc NETWORKING [TLS] maintainers and netdev for input, thanks.
The full error is here:
https://lore.kernel.org/all/fcfa11c6-2738-4a2e-baa8-09fa8f79cbf3@xxxxxxx/
On 3/4/25 11:20, Hannes Reinecke wrote:
On 3/4/25 09:18, Vlastimil Babka wrote:
On 3/4/25 08:58, Hannes Reinecke wrote:
On 3/3/25 23:02, Vlastimil Babka wrote:
Also make sure you have CONFIG_DEBUG_VM please.
Here you go:
[ 134.506802] page: refcount:0 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x101ef8
[ 134.509253] head: order:3 mapcount:0 entire_mapcount:0
nr_pages_mapped:0 pincount:0
[ 134.511594] flags:
0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[ 134.513556] page_type: f5(slab)
[ 134.513563] raw: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
ffff8881000402f0
[ 134.513568] raw: 0000000000000000 00000000000a000a 00000000f5000000
0000000000000000
[ 134.513572] head: 0017ffffc0000040 ffff888100041b00 ffffea0004a90810
ffff8881000402f0
[ 134.513575] head: 0000000000000000 00000000000a000a 00000000f5000000
0000000000000000
[ 134.513579] head: 0017ffffc0000003 ffffea000407be01 ffffffffffffffff
0000000000000000
[ 134.513583] head: 0000000000000008 0000000000000000 00000000ffffffff
0000000000000000
[ 134.513585] page dumped because: VM_BUG_ON_FOLIO(((unsigned int)
folio_ref_count(folio) + 127u <= 127u))
[ 134.513615] ------------[ cut here ]------------
[ 134.529822] kernel BUG at ./include/linux/mm.h:1455!
Yeah, just as I suspected, folio_get() says the refcount is 0.
... and it has a page_type of f5 (slab)
[ 134.554509] Call Trace:
[ 134.580282] iov_iter_get_pages2+0x19/0x30
Presumably that's __iov_iter_get_pages_alloc() doing get_page() either in
the " if (iov_iter_is_bvec(i)) " branch or via iter_folioq_get_pages()?
It's the bvec path:
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
Which doesn't work for a sub-size kmalloc() from a slab folio, which after
the frozen refcount conversion no longer supports get_page().
The question is if this is a mistake specific for this path that's easy to
fix or there are more paths that do this. At the very least the pinning of
page through a kmalloc() allocation from it is useless - the object itself
has to be kfree()'d and that would never happen through a put_page()
reaching zero.
Looks like a specific mistake.
tls_sw is the only user of sk_msg_zerocopy_from_iter()
(which is calling into __iov_iter_get_pages_alloc()).
And, more to the point, tls_sw messes up iov pacing coming in from
the upper layers.
So even if the upper layers send individual iovs (where each iov might
contain different allocation types), tls_sw is packing them together
into full records. So it might end up with iovs having _different_
allocations.
Which would explain why we only see it with TLS, but not with normal
connections.
I thought we'd done all the work needed to get rid of these pointless
refcount bumps. Turns out that's only on the block side (eg commit
e4cc64657bec). So what does networking need in order to understand
that some iovecs do not need to mess with the refcount?
The network stack needs to get hold of the page while transmission is
ongoing, as there is potentially rather deep queueing involved,
requiring several calls to sendmsg() and friends before the page is
finally transmitted. And maybe some post-processing (checksums,
digests, you name it), too, all of which require the page to be there.
It's all so jumbled up ... personally, I would _love_ to do away with
__iov_iter_get_pages_alloc(). Allocating a page array? Seriously?
And the problem with that is that it's always takes a page(!) reference,
completely oblivious to the fact whether you even _can_ take a page
reference (eg for tail pages); we've hit this problem several times now
(check for sendpage_ok() ...).
But that's not the real issue; real issue is that the page reference is
taken down in the very bowels of __iov_iter_get_pages_alloc(), but needs
to be undone by the _caller_. Who might (or might not) have an idea
that he needs to drop the reference here.
That's why there is no straightforward conversion; you need to audit
each and every caller and try to find out where the page reference (if
any) is dropped.
Bah.
Can't we (at the very least) leave it to the caller of
__iov_iter_get_pages() to get a page reference (he has access to the
page array, after all ...)? That would make the interface slightly
better, and it'll be far more obvious to the caller what needs
to be done.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich