Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio

Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> · Thu, 9 May 2024 09:10:36 +0800

On 2024/5/8 18:47, Ryan Roberts wrote:
On 08/05/2024 10:31, Baolin Wang wrote:


On 2024/5/8 16:53, Ryan Roberts wrote:
On 08/05/2024 04:44, Baolin Wang wrote:


On 2024/5/7 18:37, Ryan Roberts wrote:
On 06/05/2024 09:46, Baolin Wang wrote:
Add large folio mapping establishment support for finish_fault() as a
preparation,
to support multi-size THP allocation of anonymous shmem pages in the following
patches.

Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
    mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
    1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eea6e4984eae..936377220b77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
    {
        struct vm_area_struct *vma = vmf->vma;
        struct page *page;
+    struct folio *folio;
        vm_fault_t ret;
        bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
                  !(vma->vm_flags & VM_SHARED);
+    int type, nr_pages, i;
+    unsigned long addr = vmf->address;
          /* Did we COW the page? */
        if (is_cow)
@@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
                return VM_FAULT_OOM;
        }
    +    folio = page_folio(page);
+    nr_pages = folio_nr_pages(folio);
+
+    if (unlikely(userfaultfd_armed(vma))) {
+        nr_pages = 1;
+    } else if (nr_pages > 1) {
+        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+        unsigned long end = start + nr_pages * PAGE_SIZE;
+
+        /* In case the folio size in page cache beyond the VMA limits. */
+        addr = max(start, vma->vm_start);
+        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT;
+
+        page = folio_page(folio, (addr - start) >> PAGE_SHIFT);

I still don't really follow the logic in this else if block. Isn't it possible
that finish_fault() gets called with a page from a folio that isn't aligned
with
vmf->address?

For example, let's say we have a file who's size is 64K and which is cached
in a
single large folio in the page cache. But the file is mapped into a process at
VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will
calculate

For shmem, this doesn't happen because the VA is aligned with the hugepage size
in the shmem_get_unmapped_area() function. See patch 7.

Certainly agree that shmem can always make sure that it packs a vma in a way
such that its folios are naturally aligned in VA when faulting in memory. If you
mremap it, that alignment will be lost; I don't think that would be a problem

When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but
for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be
not work perfectly.

Assuming this works similarly to anon mTHP, remapping to an arbitrary address
shouldn't be a problem within a single process; the previously allocated folios
will now be unaligned, but they will be correctly mapped so it doesn't break
anything. And new faults will allocate folios so that they are as large as
allowed by the sysfs interface AND which do not overlap with any non-none pte
AND which are naturally aligned. It's when you start sharing with other
processes that the fun and games start...


for a single process; mremap will take care of moving the ptes correctly and
this path is not involved.

But what about the case when a process mmaps a shmem region, then forks, then
the child mremaps the shmem region. Then the parent faults in a THP into the
region (nicely aligned). Then the child faults in the same offset in the region
and gets the THP that the parent allocated; that THP will be aligned in the
parent's VM space but not in the child's.

Sorry, I did not get your point here. IIUC, the child's VA will also be aligned
if the child mremap is not set MAP_FIXED, since the child's mremap will still
call shmem_get_unmapped_area() to find an aligned new VA.

In general, you shouldn't be relying on the vma bounds being aligned to a THP
boundary.

Please correct me if I missed your point.

(I'm not 100% sure this is definitely how it works, but seems the only sane way
to me):

Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K:

   mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...)

Then it forks a child, and the child moves the mapping to VA=68K:

   mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K)

Then the parent writes to address 64K (offset 0 in the shared region); this will
fault and cause a 16K mTHP to be allocated and mapped, covering the whole region
at 64K-80K in the parent.

Then the child reads address 68K (offset 0 in the shared region); this will
fault and cause the previously allocated 16K folio to be looked up and it must
be mapped in the child between 68K-84K. This is not naturally aligned in the child.

For the child, your code will incorrectly calculate start/end as 64K-80K.

OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it 
will not aligned in the child for this case. Thanks for the explanation.