On 28.11.24 14:20, Lorenzo Stoakes wrote:
On Thu, Nov 28, 2024 at 02:08:27PM +0100, David Hildenbrand wrote:
On 28.11.24 12:37, Lorenzo Stoakes wrote:
We are current refactoring struct page to make it smaller, removing
unneeded fields that correctly belong to struct folio.
Two of those fields are page->index and page->mapping. Perf is currently
making use of both of these, so this patch removes this usage as it turns
out it is unnecessary.
Perf establishes its own internally controlled memory-mapped pages using
vm_ops hooks. The first page in the mapping is the read/write user control
page, and the rest of the mapping consists of read-only pages.
The VMA is backed by kernel memory either from the buddy allocator or
vmalloc depending on configuration. It is intended to be mapped read/write,
but because it has a page_mkwrite() hook, vma_wants_writenotify() indicaets
that it should be mapped read-only.
When a write fault occurs, the provided page_mkwrite() hook,
perf_mmap_fault() (doing double duty handing faults as well) uses the
vmf->pgoff field to determine if this is the first page, allowing for the
desired read/write first page, read-only rest mapping.
For this to work the implementation has to carefully work around faulting
logic. When a page is write-faulted, the fault() hook is called first, then
its page_mkwrite() hook is called (to allow for dirty tracking in file
systems).
On fault we set the folio's mapping in perf_mmap_fault(), this is because
when do_page_mkwrite() is subsequently invoked, it treats a missing mapping
as an indicator that the fault should be retried.
We also set the folio's index so, given the folio is being treated as faux
user memory, it correctly references its offset within the VMA.
This explains why the mapping and index fields are used - but it's not
necessary.
We preallocate pages when perf_mmap() is called for the first time via
rb_alloc(), and further allocate auxiliary pages via rb_aux_alloc() as
needed if the mapping requires it.
This allocation is done in the f_ops->mmap() hook provided in perf_mmap(),
and so we can instead simply map all the memory right away here - there's
no point in handling (read) page faults when we don't demand page nor need
to be notified about them (perf does not).
This patch therefore changes this logic to map everything when the mmap()
hook is called, establishing a PFN map. It implements vm_ops->pfn_mkwrite()
to provide the required read/write vs. read-only behaviour, which does not
require the previously implemented workarounds.
It makes sense semantically to establish a PFN map too - we are managing
the pages internally and so it is appropriate to mark this as a special
mapping.
It's rather sad seeing more PFNMAP users where PFNMAP is not really required
(-> this is struct page backed).
Especially having to perform several independent remap_pfn_range() calls
rather looks like yet another workaround ...
Would we be able to achieve something comparable with vm_insert_pages(), to
just map them in advance?
Well, that's the thing, we can't use VM_MIXEDMAP as vm_insert_pages() and
friends all refer vma->vm_page_prot which is not yet _correctly_ established at
the point of the f_op->mmap() hook being invoked :)
So all you want is a vm_insert_pages() variant where we can pass in the
vm_page_prot?
Or a way detect internally that it is not setup yet and fallback to
vm_get_page_prot(vma->vm_flags & ~VM_SHARED)?
Or a way to just remove write permissions?
We set the field in __mmap_new_vma(), _but_ importantly, we defer the
writenotify check to __mmap_complete() (set in vma_set_page_prot()) - so if we
were to try to map using VM_MIXEDMAP in the f_op->mmap() hook, we'd get
read/write mappings, which is emphatically not what we want - we want them
read-only mapped, and for vm_ops->pfn_mkwrite() to be called so we can make the
first page read/write and the rest read-only.
It's this requirement that means this is really the only way to do this as far
as I can tell.
It is appropriate and correct that this is either a VM_PFNMAP or VM_MIXEDMAP
mapping, as the pages reference kernel-allocated memory and are managed by perf,
not put on any LRU, etc.
It sucks to have to loop like this and it feels like a workaround, which makes
me wonder if we need a new interface to better allow this stuff on mmap...
In any case I think this is the most sensible solution currently available that
avoids the pre-existing situation of pretending the pages are folios but
somewhat abusing the interface to allow page_mkwrite() to work correctly by
setting page->index, mapping.
Yes, that page->index stuff is nasty.
The alternative to this would be to folio-fy, but these are emphatically _not_
folios, that is userland memory managed as userland memory, it's a mapping onto
kernel memory exposed to userspace.
Yes, we should even move away from folios completely in the future for
vm_insert_page().
It feels like probably VM_MIXEDMAP is a better way of doing it, but you'd need
to expose an interface that doesn't assume the VMA is already fully set
up... but I think one for a future series perhaps.
If the solution to your problem is as easy as making vm_insert_pages()
pass something else than vma->vm_page_prot to insert_pages(), then I
think we should go for that. Like ... vm_insert_pages_prot().
Observe how we already have vmf_insert_pfn() vs. vmf_insert_pfn_prot().
But yes, in an ideal world we'd avoid having temporarily messed up
vma->vm_page_prot. So we'd then document clearly how
vm_insert_pages_prot() may be used.
--
Cheers,
David / dhildenb