Re: [PATCH v2 2/5] mm: add PTE_MARKER_GUARD PTE marker

David Hildenbrand <david@xxxxxxxxxx> · Mon, 21 Oct 2024 18:00:20 +0200

On 21.10.24 17:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
On 10/21/24 16:33, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
On 10/20/24 18:20, Lorenzo Stoakes wrote:
Add a new PTE marker that results in any access causing the accessing
process to segfault.

This is preferable to PTE_MARKER_POISONED, which results in the same
handling as hardware poisoned memory, and is thus undesirable for cases
where we simply wish to 'soft' poison a range.

This is in preparation for implementing the ability to specify guard pages
at the page table level, i.e. ranges that, when accessed, should cause
process termination.

Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
purpose was simply incorrect.

We then reuse the same logic to determine whether a zap should clear a
guard entry - this should only be performed on teardown and never on
MADV_DONTNEED or the like.

Since I would have personally put MADV_FREE among "or the like" here, it's
surprising to me that it in fact it's tearing down the guard entries now. Is
that intentional? It should be at least mentioned very explicitly. But I'd
really argue against it, as MADV_FREE is to me a weaker form of
MADV_DONTNEED - the existing pages are not zapped immediately but
prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
shouldn't MADV_FREE too?

That is not, as I understand it, what MADV_FREE is, semantically. From the
man pages:

        MADV_FREE (since Linux 4.5)

               The application no longer requires the pages in the range
               specified by addr and len.  The kernel can thus free these
               pages, but the freeing could be delayed until memory pressure
               occurs.

        MADV_DONTNEED

               Do not expect access in the near future.  (For the time
               being, the application is finished with the given range, so
               the kernel can free resources associated with it.)

MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
don't expect to use it in the near future'.

I think the description gives a wrong impression. What I think matters it
what happens (limited to anon private case as MADV_FREE doesn't support any
other)

MADV_DONTNEED - pages discarded immediately, further access gives new
zero-filled pages

MADV_FREE - pages prioritized for discarding, if that happens before next
write, it gets zero-filled page on next access, but a write done soon enough
  can cancel the upcoming discard.

In that sense, MADV_FREE is a weaker form of DONTNEED, no?

Seems to me rather currently an artifact of MADV_FREE implementation - if it
encounters hwpoison entries it will tear them down because why not, we have
detected a hw memory error and are lucky the program wants to discard the
pages and not access them, so best use the opportunity and get rid of the
PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
could).

Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
which matches the description in the manpages that the user _might_ come
back to the range, whereas MADV_FREE means they are truly done but just
don't want the overhead of actually unmapping at this point.

But it's also defined what happens if user comes back to the range after a
MADV_FREE. I think the overhead saved happens in the case of actually coming
back soon enough to prevent the discard. With MADV_DONTNEED its immediate
and unconditional.

Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
version of what Rik wants to do with his MADV_LAZYFREE thing.

I think that further optimizes MADV_FREE, which is already more optimized
than MADV_DONTNEED.

But to extend this to guard PTEs which are result of an explicit userspace
action feels wrong, unless the semantics is the same for MADV_DONTEED. The
semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
the same?

My understanding from the above is that MADV_FREE is a softer version of
munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
'revert state to when I first mapped this stuff because I'm done with it
for now but might use it later'.

 From the implementation I get the opposite understanding. Neither tears down
the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
MADV_FREE effectively too but with a delay depending on memory pressure.

OK so based on IRC chat I think the conclusion here is TL;DR yes we have to
change this, you're right :)

To summarise for on-list:

* MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
   ability to be 'cancelled' if you write to the memory. Also, after the
   freeing is complete, you can write to the memory to reuse it, the mapping
   is still there.

* For hardware poison markers it makes sense to drop them as you're
   effectively saying 'I am done with this range that is now unbacked and
   expect to get an empty page should I use it now'. UFFD WP I am not sure
   about but presumably also fine.

* However, guard pages are different - if you 'cancel' and you are left
   with a block of memory allocated to you by a pthread or userland
   allocator implementation, you don't want to then no longer be protected
   from overrunning into other thread memory.

Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored 
or error? It sounds like a usage "error" to me (in contrast to munmap()).

--
Cheers,

David / dhildenb