On Fri, Oct 25, 2024 at 11:44:56PM +0200, Vlastimil Babka wrote: > On 10/23/24 18:24, Lorenzo Stoakes wrote: > > Implement a new lightweight guard page feature, that is regions of userland > > virtual memory that, when accessed, cause a fatal signal to arise. > > > > Currently users must establish PROT_NONE ranges to achieve this. > > > > However this is very costly memory-wise - we need a VMA for each and every > > one of these regions AND they become unmergeable with surrounding VMAs. > > > > In addition repeated mmap() calls require repeated kernel context switches > > and contention of the mmap lock to install these ranges, potentially also > > having to unmap memory if installed over existing ranges. > > > > The lightweight guard approach eliminates the VMA cost altogether - rather > > than establishing a PROT_NONE VMA, it operates at the level of page table > > entries - establishing PTE markers such that accesses to them cause a fault > > followed by a SIGSGEV signal being raised. > > > > This is achieved through the PTE marker mechanism, which we have already > > extended to provide PTE_MARKER_GUARD, which we installed via the generic > > page walking logic which we have extended for this purpose. > > > > These guard ranges are established with MADV_GUARD_INSTALL. If the range in > > which they are installed contain any existing mappings, they will be > > zapped, i.e. free the range and unmap memory (thus mimicking the behaviour > > of MADV_DONTNEED in this respect). > > > > Any existing guard entries will be left untouched. There is therefore no > > nesting of guarded pages. > > > > Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both > > instances the memory range may be reused at which point a user would expect > > guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, > > process teardown or unmapping of memory ranges. > > > > The guard property can be removed from ranges via MADV_GUARD_REMOVE. The > > ranges over which this is applied, should they contain non-guard entries, > > will be untouched, with only guard entries being cleared. > > > > We permit this operation on anonymous memory only, and only VMAs which are > > non-special, non-huge and not mlock()'d (if we permitted this we'd have to > > drop locked pages which would be rather counterintuitive). > > > > Racing page faults can cause repeated attempts to install guard pages that > > are interrupted, result in a zap, and this process can end up being > > repeated. If this happens more than would be expected in normal operation, > > we rescind locks and retry the whole thing, which avoids lock contention in > > this scenario. > > > > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> > > Suggested-by: Jann Horn <jannh@xxxxxxxxxx> > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx> > Thanks! > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn; > > */ > > #define MAX_RECLAIM_RETRIES 16 > > > > +/* > > + * Maximum number of attempts we make to install guard pages before we give up > > + * and return -ERESTARTNOINTR to have userspace try again. > > + */ > > +#define MAX_MADVISE_GUARD_RETRIES 3 > > Can't we simply put this in mm/madvise.c ? Didn't find usage elsewhere. > > Sure, will move if respin/can send a quick fixpatch next week if otherwise settled. Just felt vaguely 'neater' here for... spurious subjective squishy brained reasons :)