Re: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios

John Hubbard <jhubbard@xxxxxxxxxx> · Mon, 30 Oct 2023 16:25:38 -0700

On 10/30/23 04:43, Ryan Roberts wrote:
On 28/10/2023 00:04, John Hubbard wrote:
On 9/29/23 04:44, Ryan Roberts wrote:
...
   +static bool vmf_pte_range_changed(struct vm_fault *vmf, int nr_pages)
+{
+    int i;
+
+    if (nr_pages == 1)
+        return vmf_pte_changed(vmf);
+
+    for (i = 0; i < nr_pages; i++) {
+        if (!pte_none(ptep_get_lockless(vmf->pte + i)))
+            return true;

This seems like something different than the function name implies.
It's really confusing: for a single page case, return true if the
pte in the page tables has changed, yes that is very clear.

But then for multiple page cases, which is really the main
focus here--for that, claim that the range has changed if any
pte is present (!pte_none). Can you please help me understand
what this means?

Yes I understand your confusion. Although I'm confident that the code is
correct, its a bad name - I'll make the excuse that this has evolved through
rebasing to cope with additions to UFFD. Perhaps something like
vmf_is_large_folio_suitable() is a better name.

It used to be that we would only take the do_anonymous_page() path if the pte
was none; i.e. this is the first time we are faulting on an address covered by
an anon VMA and we need to allocate some memory. But more recently we also end
up here if the pte is a uffd_wp marker. So for a single pte, instead of checking
none, we can check if the pte has changed from our original check (where we
determined it was a uffd_wp marker or none). But for multiple ptes, we don't
have storage to store all the original ptes from the first check.

Fortunately, if uffd is in use for a vma, then we don't want to use a large
folio anyway (this would break uffd semantics because we would no longer get a
fault for every page). So we only care about the "same but not none" case for
nr_pages=1.

Would changing the name to vmf_is_large_folio_suitable() help here?

Yes it would! And adding in a sentence or two from above about the uffd, as
a function-level comment might be just the right of demystification for
the code.

...
pte_offset_map() can only fail due to:

     a) Wrong pmd type. These include:
         pmd_none
         pmd_bad
         pmd migration entry
         pmd_trans_huge
         pmd_devmap

     b) __pte_map() failure

For (a), why is it that -EAGAIN is used here? I see that that
will lead to a re-fault, I got that far, but am missing something
still.

For (b), same question, actually. I'm not completely sure why
why a retry is going to fix a __pte_map() failure?

I'm not going to claim to understand all the details of this. But this is due to
a change that Hugh introduced and we concluded at [1] that its always correct to
return EAGAIN here to rerun the fault. In fact, with the current implementation
pte_offset_map() should never fail for anon IIUC, but the view was that EAGAIN
makes it safe for tomorrow, and because this would only fail due to a race,
retrying is correct.

[1] https://lore.kernel.org/linux-mm/8bdfd8d8-5662-4615-86dc-d60259bd16d@xxxxxxxxxx/

OK, got it.

...
And finally: is it accurate to say that there are *no* special
page flags being set, for PTE-mapped THPs? I don't see any here,
but want to confirm.

The page flags are coming from 'gfp = vma_thp_gfp_mask(vma)', which pulls in the
correct flags based on transparent_hugepage/defrag file.

OK that all is pretty clear now, thanks for the answers!

thanks,
--
John Hubbard
NVIDIA