On Wed, Mar 21, 2018 at 09:31:22AM -0700, Yang Shi wrote: > On 3/21/18 6:08 AM, Michal Hocko wrote: > > Yes, this definitely sucks. One way to work that around is to split the > > unmap to two phases. One to drop all the pages. That would only need > > mmap_sem for read and then tear down the mapping with the mmap_sem for > > write. This wouldn't help for parallel mmap_sem writers but those really > > need a different approach (e.g. the range locking). > > page fault might sneak in to map a page which has been unmapped before? > > range locking should help a lot on manipulating small sections of a large > mapping in parallel or multiple small mappings. It may not achieve too much > for single large mapping. I don't think we need range locking. What if we do munmap this way: Take the mmap_sem for write Find the VMA If the VMA is large(*) Mark the VMA as deleted Drop the mmap_sem zap all of the entries Take the mmap_sem Else zap all of the entries Continue finding VMAs Drop the mmap_sem Now we need to change everywhere which looks up a VMA to see if it needs to care the the VMA is deleted (page faults, eg will need to SIGBUS; mmap does not care; munmap will need to wait for the existing munmap operation to complete), but it gives us the atomicity, at least on a per-VMA basis. We could also do: Take the mmap_sem for write Mark all VMAs in the range as deleted & modify any partial VMAs Drop mmap_sem zap pages from deleted VMAs That would give us the same atomicity that we have today. Deleted VMAs would need a pointer to a completion, so operations that need to wait can queue themselves up. I'd recommend we use the low bit of vm_file and treat it as a pointer to a struct completion if set.