On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Tue, 30 Oct 2012 10:29:54 +0900 > Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > This patch introudces new madvise behavior MADV_VOLATILE and > > MADV_NOVOLATILE for anonymous pages. It's different with > > John Stultz's version which considers only tmpfs while this patch > > considers only anonymous pages so this cannot cover John's one. > > If below idea is proved as reasonable, I hope we can unify both > > concepts by madvise/fadvise. > > > > Rationale is following as. > > Many allocators call munmap(2) when user call free(3) if ptr is > > in mmaped area. But munmap isn't cheap because it have to clean up > > all pte entries and unlinking a vma so overhead would be increased > > linearly by mmaped area's size. > > Presumably the userspace allocator will internally manage memory in > large chunks, so the munmap() call frequency will be much lower than > the free() call frequency. So the performance gains from this change > might be very small. I don't think I strictly understand the motivation from a malloc-standpoint here. These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want to perform discards on Linux. For any reasonable allocator (short of binding malloc --> mmap, free --> unmap) this seems a better choice. Note also from a performance stand-point I doubt any allocator (which case about performance) is going to want to pay the cost of even a null syscall about typical malloc/free usage (consider: a tcmalloc malloc/free pairis currently <20ns). Given then that this cost is amortized once you start doing discards on larger blocks MADV_DONTNEED seems a preferable interface: - You don't need to reconstruct an arena when you do want to allocate since there's no munmap/mmap for the region to change about - There are no syscalls involved in later reallocating the block. The only real additional cost is address-space. Are you strongly concerned about the 32-bit case? > > The whole point of the patch is to improve performance, but we have no > evidence that it was successful in doing that! I do think we'll need > good quantitative testing results before proceeding with such a patch, > please. > > Also, it is very desirable that we involve the relevant userspace > (glibc, etc) developers in this. And I understand that the google > tcmalloc project will probably have interest in this - I've cc'ed > various people@google in the hope that they can provide input (please). > > Also, it is a userspace API change. Please cc mtk.manpages@xxxxxxxxx. > > Also, I assume that you have userspace test code. At some stage, > please consider adding a case to tools/testing/selftests. Such a test > would require to creation of memory pressure, which is rather contrary > to the selftests' current philosopy of being a bunch of short-running > little tests. Perhaps you can come up with something. But I suggest > that such work be done later, once it becomes clearer that this code is > actually headed into the kernel. > > > Allocator should call madvise(MADV_NOVOLATILE) before reusing for > > allocating that area to user. Otherwise, accessing of volatile range > > will meet SIGBUS error. > > Well, why? It would be easy enough for the fault handler to give > userspace a new, zeroed page at that address. Note: MADV_DONTNEED already has this (nice) property. > > Or we could simply leave the old page in place at that address. If the > page gets touched, we clear MADV_NOVOLATILE on its VMA and give the > page (or all the not-yet-reclaimed pages) back to userspace at their > old addresses. > > Various options suggest themselves here. You've chosen one of them but > I would like to see a pretty exhaustive description of the reasoning > behind that decision. > > Also, I wonder about the interaction with other vma manipulation > operations. For example, can a VMA get split when in the MADV_VOLATILE > state? If so, what happens? > > Also, I see no reason why the code shouldn't work OK with nonlinear VMAs, > but I bet this wasn't tested ;) > > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma, > > if (error) > > goto out; > > break; > > + case MADV_VOLATILE: > > + if (vma->vm_flags & VM_LOCKED) { > > + error = -EINVAL; > > + goto out; > > + } > > + new_flags |= VM_VOLATILE; > > + vma->purged = false; > > + break; > > + case MADV_NOVOLATILE: > > + if (!(vma->vm_flags & VM_VOLATILE)) { > > + error = -EINVAL; > > + goto out; > > I wonder if this really should return an error. Other madvise() > options don't do this, and running MADV_NOVOLATILE against a > not-volatile area seems pretty benign and has clearly defined before- > and after- states. > > > + } > > + > > + new_flags &= ~VM_VOLATILE; > > + break; > > } > > > > if (new_flags == vma->vm_flags) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>