Re: [RFC v2] Support volatile range for anon vma

Paul Turner <pjt@xxxxxxxxxx> · Wed, 31 Oct 2012 14:59:07 -0700

On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux.    For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns).  Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space.  Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).
>
> Also, it is a userspace API change.  Please cc mtk.manpages@xxxxxxxxx.
>
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >               if (error)
> >                       goto out;
> >               break;
> > +     case MADV_VOLATILE:
> > +             if (vma->vm_flags & VM_LOCKED) {
> > +                     error = -EINVAL;
> > +                     goto out;
> > +             }
> > +             new_flags |= VM_VOLATILE;
> > +             vma->purged = false;
> > +             break;
> > +     case MADV_NOVOLATILE:
> > +             if (!(vma->vm_flags & VM_VOLATILE)) {
> > +                     error = -EINVAL;
> > +                     goto out;
>
> I wonder if this really should return an error.  Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > +             }
> > +
> > +             new_flags &= ~VM_VOLATILE;
> > +             break;
> >       }
> >
> >       if (new_flags == vma->vm_flags) {
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>