On Tue, Mar 18, 2014 at 05:23:37PM -0700, Andy Lutomirski wrote: > On Tue, Mar 18, 2014 at 5:18 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote: > > Hello, > > > > On Tue, Mar 18, 2014 at 10:55:24AM -0700, Andy Lutomirski wrote: > >> On 03/13/2014 11:37 PM, Minchan Kim wrote: > >> > This patch is an attempt to support MADV_FREE for Linux. > >> > > >> > Rationale is following as. > >> > > >> > Allocators call munmap(2) when user call free(3) if ptr is > >> > in mmaped area. But munmap isn't cheap because it have to clean up > >> > all pte entries, unlinking a vma and returns free pages to buddy > >> > so overhead would be increased linearly by mmaped area's size. > >> > So they like madvise_dontneed rather than munmap. > >> > > >> > "dontneed" holds read-side lock of mmap_sem so other threads > >> > of the process could go with concurrent page faults so it is > >> > better than munmap if it's not lack of address space. > >> > But the problem is that most of allocator reuses that address > >> > space soonish so applications see page fault, page allocation, > >> > page zeroing if allocator already called madvise_dontneed > >> > on the address space. > >> > > >> > For avoidng that overheads, other OS have supported MADV_FREE. > >> > The idea is just mark pages as lazyfree when madvise called > >> > and purge them if memory pressure happens. Otherwise, VM doesn't > >> > detach pages on the address space so application could use > >> > that memory space without above overheads. > >> > >> I must be missing something. > >> > >> If the application issues MADV_FREE and then writes to the MADV_FREEd > >> range, the kernel needs to know that the pages are no longer safe to > >> lazily free. This would presumably happen via a page fault on write. > >> For that to happen reliably, the kernel has to write protect the pages > >> when MADV_FREE is called, which in turn requires flushing the TLBs. > > > > It could be done by pte_dirty bit check. Of course, if some architectures > > don't support it by H/W, pte_mkdirty would make it CoW as you said. > > If the page already has dirty PTEs, then you need to clear the dirty > bits and flush TLBs so that other CPUs notice that the PTEs are clean, > I think. True. I didn't mean we don't need TLB flush. Look at the code although there are lots of bug in RFC v1. > > Also, this has very odd semantics wrt reading the page after MADV_FREE > -- is reading the page guaranteed to un-free it? Yeb, I thought about that oddness but didn't make conclusion because other OS seem to work like that. http://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2 But we could fix it easily by checking access bit instead of dirty bit. > > >> > >> How does this end up being faster than munmap? > > > > MADV_FREE doesn't need to return back the pages into page allocator > > compared to MADV_DONTNEED and the overhead is not small when I measured > > that on my machine.(Roughly, MADV_FREE's cost is half of DONTNEED through > > avoiding involving page allocator.) > > > > But I'd like to clarify that it's not MADV_FREE's goal that syscall > > itself should be faster than MADV_DONTNEED but major goal is to > > avoid unnecessary page fault + page allocation + page zeroing + > > garbage swapout. > > This sounds like it might be better solved by trying to make munmap or > MADV_DONTNEED faster. Maybe those functions should lazily give pages > back to the buddy allocator. About munmap, it needs write-mmap_sem and it hurts heavily of allocator performance in multi-thread. About MADV_DONTNEED, Rik van Riel tried to replace MADV_DONTNEED with MADV_FREE in 2007(http://lwn.net/Articles/230799/). But I don't know why it was dropped. One think I can imagine is that it could make regression because user on MADV_DONTNEED expect rss decreasing when syscall is called. > > --Andy > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>