On Thu, 24 Apr 2014, Linus Torvalds wrote: > On Thu, Apr 24, 2014 at 7:50 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote: > > > > The cases where they occur the mappings tend to be highly stable, i.e. > > map once *specifically* to be able to do a whole bunch of things without > > system calls, and then unmap when done. > > Yes. But even that tends to be unusual. mmap() really is bad at > writing, since you inevitably get read-modify-write patterns etc. So > it's only useful for fixing up things after-the-fact, which in itself > is a horrible pattern. > > Don't get me wrong - it exists, but it's really quite rare because it > has so many problems. Even people who do "fixup" kind of stuff tend to > map things privately, change things, and then write out the end > result. That way you can get atomicity by then doing a single > "rename()" at the end, for example. > > The traditional case for it used to be the nntp index, and these days > I know some imap indexer (dovecot?) uses it. Every other example of it > I have ever seen has been a VM stress tester.. Your patch looks good to me (nice use of force_flush), and runs fine here in normal usage; but I've not actually tried Dave's racewrite.c. However, I have had a couple of contrarian half-thoughts, that ordinarily I'd prefer to mull over more before blurting out, but in the circumstances better say sooner than later. One, regarding dirty shared mappings: you're thinking above of mmap()'ing proper filesystem files, but this case also includes shared memory - I expect there are uses of giant amounts of shared memory, for which we really would prefer not to slow the teardown. And confusingly, those are not subject to the special page_mkclean() constraints, but still need to be handled in a correct manner: your patch is fine, but might be overkill for them - I'm not yet sure. Two, Ben said earlier that he's more worried about users of unmap_mapping_range() than concurrent munmap(); and you said earlier that you would almost prefer to have some special lock to serialize with page_mkclean(). Er, i_mmap_mutex. That's what unmap_mapping_range(), and page_mkclean()'s rmap_walk, take to iterate over the file vmas. So perhaps there's no race at all in the unmap_mapping_range() case. And easy (I imagine) to fix the race in Dave's racewrite.c use of MADV_DONTNEED: untested patch below. But exit and munmap() don't take i_mmap_mutex: perhaps they should when encountering a VM_SHARED vma (I believe VM_SHARED should be peculiar to having vm_file set, but test both below because I don't want to oops in some odd corner where a special vma is set up). Hugh --- 3.15-rc2/mm/madvise.c 2013-11-03 15:41:51.000000000 -0800 +++ linux/mm/madvise.c 2014-04-25 04:10:40.124514427 -0700 @@ -274,10 +274,16 @@ static long madvise_dontneed(struct vm_a struct vm_area_struct **prev, unsigned long start, unsigned long end) { + struct address_space *mapping = NULL; + *prev = vma; if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) return -EINVAL; + if (vma->vm_file && (vma->vm_flags & VM_SHARED)) { + mapping = vma->vm_file->f_mapping; + mutex_lock(&mapping->i_mmap_mutex); + } if (unlikely(vma->vm_flags & VM_NONLINEAR)) { struct zap_details details = { .nonlinear_vma = vma, @@ -286,6 +292,8 @@ static long madvise_dontneed(struct vm_a zap_page_range(vma, start, end - start, &details); } else zap_page_range(vma, start, end - start, NULL); + if (mapping) + mutex_unlock(&mapping->i_mmap_mutex); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>