On 10/21/24 16:33, Lorenzo Stoakes wrote: > On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote: >> On 10/20/24 18:20, Lorenzo Stoakes wrote: >> > Add a new PTE marker that results in any access causing the accessing >> > process to segfault. >> > >> > This is preferable to PTE_MARKER_POISONED, which results in the same >> > handling as hardware poisoned memory, and is thus undesirable for cases >> > where we simply wish to 'soft' poison a range. >> > >> > This is in preparation for implementing the ability to specify guard pages >> > at the page table level, i.e. ranges that, when accessed, should cause >> > process termination. >> > >> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the >> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single >> > purpose was simply incorrect. >> > >> > We then reuse the same logic to determine whether a zap should clear a >> > guard entry - this should only be performed on teardown and never on >> > MADV_DONTNEED or the like. >> >> Since I would have personally put MADV_FREE among "or the like" here, it's >> surprising to me that it in fact it's tearing down the guard entries now. Is >> that intentional? It should be at least mentioned very explicitly. But I'd >> really argue against it, as MADV_FREE is to me a weaker form of >> MADV_DONTNEED - the existing pages are not zapped immediately but >> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why >> shouldn't MADV_FREE too? > > That is not, as I understand it, what MADV_FREE is, semantically. From the > man pages: > > MADV_FREE (since Linux 4.5) > > The application no longer requires the pages in the range > specified by addr and len. The kernel can thus free these > pages, but the freeing could be delayed until memory pressure > occurs. > > MADV_DONTNEED > > Do not expect access in the near future. (For the time > being, the application is finished with the given range, so > the kernel can free resources associated with it.) > > MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we > don't expect to use it in the near future'. I think the description gives a wrong impression. What I think matters it what happens (limited to anon private case as MADV_FREE doesn't support any other) MADV_DONTNEED - pages discarded immediately, further access gives new zero-filled pages MADV_FREE - pages prioritized for discarding, if that happens before next write, it gets zero-filled page on next access, but a write done soon enough can cancel the upcoming discard. In that sense, MADV_FREE is a weaker form of DONTNEED, no? >> >> Seems to me rather currently an artifact of MADV_FREE implementation - if it >> encounters hwpoison entries it will tear them down because why not, we have >> detected a hw memory error and are lucky the program wants to discard the >> pages and not access them, so best use the opportunity and get rid of the >> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly >> could). > > Right, but we explicitly do not tear them down in the case of MADV_DONTNEED > which matches the description in the manpages that the user _might_ come > back to the range, whereas MADV_FREE means they are truly done but just > don't want the overhead of actually unmapping at this point. But it's also defined what happens if user comes back to the range after a MADV_FREE. I think the overhead saved happens in the case of actually coming back soon enough to prevent the discard. With MADV_DONTNEED its immediate and unconditional. > Seems to be this is moreso that MADV_FREE is a not-really-as-efficient > version of what Rik wants to do with his MADV_LAZYFREE thing. I think that further optimizes MADV_FREE, which is already more optimized than MADV_DONTNEED. >> >> But to extend this to guard PTEs which are result of an explicit userspace >> action feels wrong, unless the semantics is the same for MADV_DONTEED. The >> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave >> the same? > > My understanding from the above is that MADV_FREE is a softer version of > munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a > 'revert state to when I first mapped this stuff because I'm done with it > for now but might use it later'.