On Tue, Oct 22, 2013 at 4:48 PM, <walken@xxxxxxxxxx> wrote: > > Generally the problems I see with mmap_sem are related to long latency > operations. Specifically, the mmap_sem write side is currently held > during the entire munmap operation, which iterates over user pages to > free them, and can take hundreds of milliseconds for large VMAs. So this would be the *perfect* place to just downgrade the semaphore from a write to a read. Do the vma ops under the write semaphore, then downgrade it to a read-sem, and do the page teardown with just mmap_sem held for reading.. Comments? Anybody want to try that? It should be fairly straightforward, and we had a somewhat similar issue when it came to mmap() having to populate the mapping for mlock. For that case, it was sufficient to just move the "populate" phase outside the lock entirely (for that case, we actually drop the write lock and then take the read-lock and re-lookup the vma, for unmap we'd have to do a proper downgrade so that there is no window where the virtual address area could be re-allocated) The big issue is that we'd have to split up do_munmap() into those two phases, since right now callers take the write semaphore before calling it, and drop it afterwards. And some callers do it in a loop. But we should be fairly easily able to make the *common* case (ie normal "munmap()") do something like down_write(&mm->mmap_sem); phase1_munmap(..); downgrade_write(&mm->mmap_sem); phase2_munmap(..); up_read(&mm->mmap_sem); instead of what it does now (which is to just do down_write()/up_write() around do_munmap()). I don't see any fundamental problems, but maybe there's some really annoying detail that makes this nasty (right now we do "remove_vma_list() -> remove_vma()" *after* tearing down the page tables, and since that calls the ->close function, I think it has to be done that way. I'm wondering if any of that code relies on the mmap_sem() being held for exclusively for writing. I don't see why it possibly could, but.. So maybe I'm being overly optimistic and it's not as easy as just splitting do_mmap() into two phases, but it really *looks* like it might be just a ten-liner or so.. And if a real munmap() is the common case (as opposed to a do_munmap() that gets triggered by somebody doing a "mmap()" on top of an old mapping), then we'd at least allow page faults from other threads to be done concurrently with tearing down the page tables for the unmapped vma.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>