Re: [PATCH v2 2/5] mm: add PTE_MARKER_GUARD PTE marker

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 21 Oct 2024 16:33:19 +0100

On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
> On 10/21/24 16:33, Lorenzo Stoakes wrote:
> > On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
> >> On 10/20/24 18:20, Lorenzo Stoakes wrote:
> >> > Add a new PTE marker that results in any access causing the accessing
> >> > process to segfault.
> >> >
> >> > This is preferable to PTE_MARKER_POISONED, which results in the same
> >> > handling as hardware poisoned memory, and is thus undesirable for cases
> >> > where we simply wish to 'soft' poison a range.
> >> >
> >> > This is in preparation for implementing the ability to specify guard pages
> >> > at the page table level, i.e. ranges that, when accessed, should cause
> >> > process termination.
> >> >
> >> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
> >> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
> >> > purpose was simply incorrect.
> >> >
> >> > We then reuse the same logic to determine whether a zap should clear a
> >> > guard entry - this should only be performed on teardown and never on
> >> > MADV_DONTNEED or the like.
> >>
> >> Since I would have personally put MADV_FREE among "or the like" here, it's
> >> surprising to me that it in fact it's tearing down the guard entries now. Is
> >> that intentional? It should be at least mentioned very explicitly. But I'd
> >> really argue against it, as MADV_FREE is to me a weaker form of
> >> MADV_DONTNEED - the existing pages are not zapped immediately but
> >> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
> >> shouldn't MADV_FREE too?
> >
> > That is not, as I understand it, what MADV_FREE is, semantically. From the
> > man pages:
> >
> >        MADV_FREE (since Linux 4.5)
> >
> >               The application no longer requires the pages in the range
> >               specified by addr and len.  The kernel can thus free these
> >               pages, but the freeing could be delayed until memory pressure
> >               occurs.
> >
> >        MADV_DONTNEED
> >
> >               Do not expect access in the near future.  (For the time
> >               being, the application is finished with the given range, so
> >               the kernel can free resources associated with it.)
> >
> > MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
> > don't expect to use it in the near future'.
>
> I think the description gives a wrong impression. What I think matters it
> what happens (limited to anon private case as MADV_FREE doesn't support any
> other)
>
> MADV_DONTNEED - pages discarded immediately, further access gives new
> zero-filled pages
>
> MADV_FREE - pages prioritized for discarding, if that happens before next
> write, it gets zero-filled page on next access, but a write done soon enough
>  can cancel the upcoming discard.
>
> In that sense, MADV_FREE is a weaker form of DONTNEED, no?
>
> >>
> >> Seems to me rather currently an artifact of MADV_FREE implementation - if it
> >> encounters hwpoison entries it will tear them down because why not, we have
> >> detected a hw memory error and are lucky the program wants to discard the
> >> pages and not access them, so best use the opportunity and get rid of the
> >> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
> >> could).
> >
> > Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
> > which matches the description in the manpages that the user _might_ come
> > back to the range, whereas MADV_FREE means they are truly done but just
> > don't want the overhead of actually unmapping at this point.
>
> But it's also defined what happens if user comes back to the range after a
> MADV_FREE. I think the overhead saved happens in the case of actually coming
> back soon enough to prevent the discard. With MADV_DONTNEED its immediate
> and unconditional.
>
> > Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
> > version of what Rik wants to do with his MADV_LAZYFREE thing.
>
> I think that further optimizes MADV_FREE, which is already more optimized
> than MADV_DONTNEED.
>
> >>
> >> But to extend this to guard PTEs which are result of an explicit userspace
> >> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
> >> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
> >> the same?
> >
> > My understanding from the above is that MADV_FREE is a softer version of
> > munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
> > 'revert state to when I first mapped this stuff because I'm done with it
> > for now but might use it later'.
>
> From the implementation I get the opposite understanding. Neither tears down
> the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
> MADV_FREE effectively too but with a delay depending on memory pressure.
>

OK so based on IRC chat I think the conclusion here is TL;DR yes we have to
change this, you're right :)

To summarise for on-list:

* MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
  ability to be 'cancelled' if you write to the memory. Also, after the
  freeing is complete, you can write to the memory to reuse it, the mapping
  is still there.

* For hardware poison markers it makes sense to drop them as you're
  effectively saying 'I am done with this range that is now unbacked and
  expect to get an empty page should I use it now'. UFFD WP I am not sure
  about but presumably also fine.

* However, guard pages are different - if you 'cancel' and you are left
  with a block of memory allocated to you by a pthread or userland
  allocator implementation, you don't want to then no longer be protected
  from overrunning into other thread memory.

So, while I implemented this based on a. the semantics as apparently
expressed in the man page and b. existing PTE marker behaviour, it seems
that it would actually be problematic to do so.

So yeah, I'll change that in v3!