Re: [RFC PATCH 3/4] mm: madvise: implement lightweight guard page mechanism

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Mon, 14 Oct 2024 12:08:56 +0100

On Fri, Oct 11, 2024 at 08:11:36PM +0200, Jann Horn wrote:
> On Fri, Sep 27, 2024 at 2:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > Implement a new lightweight guard page feature, that is regions of userland
> > virtual memory that, when accessed, cause a fatal signal to arise.
> [...]
> > ---
> >  arch/alpha/include/uapi/asm/mman.h     |   3 +
> >  arch/mips/include/uapi/asm/mman.h      |   3 +
> >  arch/parisc/include/uapi/asm/mman.h    |   3 +
> >  arch/xtensa/include/uapi/asm/mman.h    |   3 +
> >  include/uapi/asm-generic/mman-common.h |   3 +
>
> I kinda wonder if we could start moving the parts of those headers
> that are the same for all architectures to include/uapi/linux/mman.h
> instead... but that's maybe out of scope for this series.

Arnd already had a look at this in a recent series. I had the same feeling doing
this...

>
> [...]
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index e871a72a6c32..7216e10723ae 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > +       case MADV_GUARD_UNPOISON: /* Only poisoning needs a write lock. */
>
> What does poisoning need a write lock for? anon_vma_prepare() doesn't
> need it (it only needs mmap_lock held for reading),
> zap_page_range_single() doesn't need it, and pagewalk also doesn't
> need it as long as the range being walked is covered by a VMA, which
> it is...
>
> I see you set PGWALK_WRLOCK in guard_poison_walk_ops with a comment
> saying "We might need to install an anon_vma" - is that referring to
> an older version of the patch where the anon_vma_prepare() call was
> inside the pagewalk callback or something like that? Either way,
> anon_vma_prepare() doesn't need write locks (it can't, it has to work
> from the page fault handling path).

OK this was a misunderstanding. Actually there have been more than one, at first
I thought a write lock would protect us against racing faults (nope, due to RCU
vma locking now :) and then I had assumed literally changing a vma field
_surely_ must require a write lock, also it appears no as __anon_vma_prepare(),
amusingly, uses the mm->page_table_lock to protect against accesses to
vma->anon_vma.

And yes you're right it is triggered on the fault path so has to work that way.

TL;DR will change to read lock.

>
> >                 return 0;
> >         default:
> >                 /* be safe, default to 1. list exceptions explicitly */
> [...]
> > +static long madvise_guard_poison(struct vm_area_struct *vma,
> > +                                struct vm_area_struct **prev,
> > +                                unsigned long start, unsigned long end)
> > +{
> > +       long err;
> > +       bool retried = false;
> > +
> > +       *prev = vma;
> > +       if (!is_valid_guard_vma(vma, /* allow_locked = */false))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * Optimistically try to install the guard poison pages first. If any
> > +        * non-guard pages are encountered, give up and zap the range before
> > +        * trying again.
> > +        */
> > +       while (true) {
> > +               unsigned long num_installed = 0;
> > +
> > +               /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> > +               err = walk_page_range_mm(vma->vm_mm, start, end,
> > +                                        &guard_poison_walk_ops,
> > +                                        &num_installed);
> > +               /*
> > +                * If we install poison markers, then the range is no longer
> > +                * empty from a page table perspective and therefore it's
> > +                * appropriate to have an anon_vma.
> > +                *
> > +                * This ensures that on fork, we copy page tables correctly.
> > +                */
> > +               if (err >= 0 && num_installed > 0) {
> > +                       int err_anon = anon_vma_prepare(vma);
>
> I'd move this up, to before we create poison PTEs. There's no harm in
> attaching an anon_vma to the VMA even if the rest of the operation
> fails; and I think it would be weird to have error paths that don't
> attach an anon_vma even though they .

I think you didn't finish this sentence :)

I disagree, we might have absolutely no need to do it, and I'd rather only
do so _if_ we have to.

It feels like the logical spot to do it and, while the cases where it
wouldn't happen are ones where pages are already poisoned (the
vma->anon_vma == NULL test will fail so basically a no-op) or error on page
walk.

>
> > +                       if (err_anon)
> > +                               err = err_anon;
> > +               }
> > +
> > +               if (err <= 0)
> > +                       return err;
> > +
> > +               if (!retried)
> > +                       /*
> > +                        * OK some of the range have non-guard pages mapped, zap
> > +                        * them. This leaves existing guard pages in place.
> > +                        */
> > +                       zap_page_range_single(vma, start, end - start, NULL);
> > +               else
> > +                       /*
> > +                        * If we reach here, then there is a racing fault that
> > +                        * has populated the PTE after we zapped. Give up and
> > +                        * let the user know to try again.
> > +                        */
> > +                       return -EAGAIN;
>
> Hmm, yeah, it would be nice if we could avoid telling userspace to
> loop on -EAGAIN but I guess we don't have any particularly good
> options here? Well, we could bail out with -EINTR if a (fatal?) signal
> is pending and otherwise keep looping... if we'd tell userspace "try
> again on -EAGAIN", we might as well do that in the kernel...

The problem is you could conceivably go on for quite some time, while
holding and contending a HIGHLY contended lock (mm->mmap_lock) so I'd
really rather let userspace take care of it.

You could avoid this by having the walker be a _replace_ operation, that is
- if we encounter an existing mapping, replace in-place with a poison
marker rather than install marker/zap.

However doing that would involve either completely abstracting such logic
from scratch (a significant task in itself) to avoid duplication which be
hugely off-topic for the patch set or worse, duplicating a whole bunch of
page walking logic once again.

By being optimistic and simply having the user having to handle looping
which seems reasonable (again, it's weird if you're installing poison
markers and another thread could be racing you) we avoid all of that.

>
> (Personally I would put curly braces around these branches because
> they occupy multiple lines, though the coding style doesn't explicitly
> say that, so I guess maybe it's a matter of personal preference...
> adding curly braces here would match what is done, for example, in
> relocate_vma_down().)

Hey I wrote that too! ;) Sure I can change that.

>
> > +               retried = true;
> > +       }
> > +}
> > +
> > +static int guard_unpoison_pte_entry(pte_t *pte, unsigned long addr,
> > +                                   unsigned long next, struct mm_walk *walk)
> > +{
> > +       pte_t ptent = ptep_get(pte);
> > +
> > +       if (is_guard_pte_marker(ptent)) {
> > +               /* Simply clear the PTE marker. */
> > +               pte_clear_not_present_full(walk->mm, addr, pte, true);
>
> I think that last parameter probably should be "false"? The sparc code
> calls it "fullmm", which is a term the MM code uses when talking about
> operations that remove all mappings in the entire mm_struct because
> the process has died, which allows using some faster special-case
> version of TLB shootdown or something along those lines.

Yeah I think you're right. Will change.

>
> > +               update_mmu_cache(walk->vma, addr, pte);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static const struct mm_walk_ops guard_unpoison_walk_ops = {
> > +       .pte_entry              = guard_unpoison_pte_entry,
> > +       .walk_lock              = PGWALK_RDLOCK,
> > +};
>
> It is a _little_ weird that unpoisoning creates page tables when they
> don't already exist, which will also prevent creating THP entries on
> fault in such areas afterwards... but I guess it doesn't really matter
> given that poisoning has that effect, too, and you probably usually
> won't call MADV_GUARD_UNPOISON on an area that hasn't been poisoned
> before... so I guess this is not an actionable comment.

It doesn't? There's no .install_pte so if an entries are non-present we
ignore.

HOWEVER, we do split THP. I don't think there's any way around it unless we
extended the page walker to handle this more gracefully (pmd level being
able to hint that we shouldn't do that or something), but that's really out
of scope here.

The idea is that a caller can lazily call MADV_GUARD_UNPOISON on a range
knowing things stay as they were, I guess we can add to the manpage a note
that this will split THP?