Re: [PATCH v2 2/5] mm: add PTE_MARKER_GUARD PTE marker

Vlastimil Babka <vbabka@xxxxxxx> · Mon, 21 Oct 2024 16:54:06 +0200

On 10/21/24 16:33, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
>> On 10/20/24 18:20, Lorenzo Stoakes wrote:
>> > Add a new PTE marker that results in any access causing the accessing
>> > process to segfault.
>> >
>> > This is preferable to PTE_MARKER_POISONED, which results in the same
>> > handling as hardware poisoned memory, and is thus undesirable for cases
>> > where we simply wish to 'soft' poison a range.
>> >
>> > This is in preparation for implementing the ability to specify guard pages
>> > at the page table level, i.e. ranges that, when accessed, should cause
>> > process termination.
>> >
>> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
>> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
>> > purpose was simply incorrect.
>> >
>> > We then reuse the same logic to determine whether a zap should clear a
>> > guard entry - this should only be performed on teardown and never on
>> > MADV_DONTNEED or the like.
>>
>> Since I would have personally put MADV_FREE among "or the like" here, it's
>> surprising to me that it in fact it's tearing down the guard entries now. Is
>> that intentional? It should be at least mentioned very explicitly. But I'd
>> really argue against it, as MADV_FREE is to me a weaker form of
>> MADV_DONTNEED - the existing pages are not zapped immediately but
>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
>> shouldn't MADV_FREE too?
> 
> That is not, as I understand it, what MADV_FREE is, semantically. From the
> man pages:
> 
>        MADV_FREE (since Linux 4.5)
> 
>               The application no longer requires the pages in the range
>               specified by addr and len.  The kernel can thus free these
>               pages, but the freeing could be delayed until memory pressure
>               occurs.
> 
>        MADV_DONTNEED
> 
>               Do not expect access in the near future.  (For the time
>               being, the application is finished with the given range, so
>               the kernel can free resources associated with it.)
> 
> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
> don't expect to use it in the near future'.

I think the description gives a wrong impression. What I think matters it
what happens (limited to anon private case as MADV_FREE doesn't support any
other)

MADV_DONTNEED - pages discarded immediately, further access gives new
zero-filled pages

MADV_FREE - pages prioritized for discarding, if that happens before next
write, it gets zero-filled page on next access, but a write done soon enough
 can cancel the upcoming discard.

In that sense, MADV_FREE is a weaker form of DONTNEED, no?

>>
>> Seems to me rather currently an artifact of MADV_FREE implementation - if it
>> encounters hwpoison entries it will tear them down because why not, we have
>> detected a hw memory error and are lucky the program wants to discard the
>> pages and not access them, so best use the opportunity and get rid of the
>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
>> could).
> 
> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
> which matches the description in the manpages that the user _might_ come
> back to the range, whereas MADV_FREE means they are truly done but just
> don't want the overhead of actually unmapping at this point.

But it's also defined what happens if user comes back to the range after a
MADV_FREE. I think the overhead saved happens in the case of actually coming
back soon enough to prevent the discard. With MADV_DONTNEED its immediate
and unconditional.

> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
> version of what Rik wants to do with his MADV_LAZYFREE thing.

I think that further optimizes MADV_FREE, which is already more optimized
than MADV_DONTNEED.

>>
>> But to extend this to guard PTEs which are result of an explicit userspace
>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
>> the same?
> 
> My understanding from the above is that MADV_FREE is a softer version of
> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
> 'revert state to when I first mapped this stuff because I'm done with it
> for now but might use it later'.