On 06.12.24 17:28, Matthew Wilcox wrote:
Sorry for the late reply, interesting topic.
Today we have a very useful helper, remap_vmalloc_range() (and _partial())
which lets drivers call vmalloc(), then map that memory to userspace.
It does so using vm_insert_page() which ends up calling folio_get() and
folio_add_file_rmap_pte(), so jiggling both the refcount and the mapcount.
> > As you all know by now, we're looking to eliminate both mapcount and
refcount from struct page. I have four options for consideration, some
of which I like more than others.
1. We could introduce a vmalloc memdesc that has a per-page mapcount and
refcount. This seems like unnecessarily high overhead for a precision
of tracking that is, perhaps, not warranted.
Especially the mapcount is probably of no use at all here. As discussed
with Lorenzo recently, I assume we only perform this in vm_insert_page()
because there is (was) no easy way to distinguish these pages on the zap
path to *not* decrement the refcounts.
With memdescs that would be easy (late: no folio -> no mapcount changes)
2. We could do no tracking at all of vmalloc pages. Insert the PFNs
of the allocated pages and rely on the driver to track everything
correctly, not freeing the vmalloc allocation until the mmap has been
torn down. This implies not supporting GUP. This option feels risky to
me; we're depending on device driver writers to get this right, and if
they get it wrong, it's quite the UAF hole; letting an attacker get
access to pages which could be allocated to any purpose.
Fully agreed.
3. Embed a refcount into struct vm_struct. We can support GUP if we want.
Calling GUP bumps the refcount on the entire struct. When the refcount
hits zero, we free the entire allocation. There's no need for a mapcount
or pincount because we don't need to distinguish between temporary and
longterm gups.
The pincount+mapcount should be specific to folios, agreed.
> > 4. Introduce an indirection structure between the page and
vm_struct which
contains the refcount.
I'm most in favour of #3, but there's probably ramifications I haven't
considered.
I wonder if #1 only with the refcount would be doable. Maybe to a
vmalloc memdesc, but a more generic kmem memdesc.
Because for "oridnary" pages that a driver allocated I suspect we might
want to do the same.
But #3 sounds interesting as well. In any case, we'll have to teach
vm_normal_page() users that blindly assume that they get a folio, that
they could get something different instead. Using memdescs for that
sounds reasonable.
--
Cheers,
David / dhildenb