On 10/23/24 18:24, Lorenzo Stoakes wrote: > Implement a new lightweight guard page feature, that is regions of userland > virtual memory that, when accessed, cause a fatal signal to arise. > > Currently users must establish PROT_NONE ranges to achieve this. > > However this is very costly memory-wise - we need a VMA for each and every > one of these regions AND they become unmergeable with surrounding VMAs. > > In addition repeated mmap() calls require repeated kernel context switches > and contention of the mmap lock to install these ranges, potentially also > having to unmap memory if installed over existing ranges. > > The lightweight guard approach eliminates the VMA cost altogether - rather > than establishing a PROT_NONE VMA, it operates at the level of page table > entries - establishing PTE markers such that accesses to them cause a fault > followed by a SIGSGEV signal being raised. > > This is achieved through the PTE marker mechanism, which we have already > extended to provide PTE_MARKER_GUARD, which we installed via the generic > page walking logic which we have extended for this purpose. > > These guard ranges are established with MADV_GUARD_INSTALL. If the range in > which they are installed contain any existing mappings, they will be > zapped, i.e. free the range and unmap memory (thus mimicking the behaviour > of MADV_DONTNEED in this respect). > > Any existing guard entries will be left untouched. There is therefore no > nesting of guarded pages. > > Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both > instances the memory range may be reused at which point a user would expect > guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, > process teardown or unmapping of memory ranges. > > The guard property can be removed from ranges via MADV_GUARD_REMOVE. The > ranges over which this is applied, should they contain non-guard entries, > will be untouched, with only guard entries being cleared. > > We permit this operation on anonymous memory only, and only VMAs which are > non-special, non-huge and not mlock()'d (if we permitted this we'd have to > drop locked pages which would be rather counterintuitive). > > Racing page faults can cause repeated attempts to install guard pages that > are interrupted, result in a zap, and this process can end up being > repeated. If this happens more than would be expected in normal operation, > we rescind locks and retry the whole thing, which avoids lock contention in > this scenario. > > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> > Suggested-by: Jann Horn <jannh@xxxxxxxxxx> > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx> > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -423,6 +423,12 @@ extern unsigned long highest_memmap_pfn; > */ > #define MAX_RECLAIM_RETRIES 16 > > +/* > + * Maximum number of attempts we make to install guard pages before we give up > + * and return -ERESTARTNOINTR to have userspace try again. > + */ > +#define MAX_MADVISE_GUARD_RETRIES 3 Can't we simply put this in mm/madvise.c ? Didn't find usage elsewhere.