On Fri, Nov 18, 2011 at 09:42:05AM +0800, Nai Xia wrote: > First of all, I believe that at the POSIX level, it's ok for > truncate_inode_page() > not scanning COWed pages, since basically we does not provide any guarantee > for privately mapped file pages for this behavior. But missing a file > mapped pte after its > cache page is already removed from the the page cache is a I also exclude there is a case that would break, but it's safer to keep things as is, in case somebody depends on segfault trapping. > fundermental malfuntion for > a shared mapping when some threads see the file cache page is gone > while some thread > is still r/w from/to it! No matter how short the gap between > truncate_inode_page() and > the second loop, this is wrong. Truncate will destroy the info on disk too... so if somebody is writing to a mapping which points beyond the end of the i_size concurrently with truncate, the result is undefined. The write may well reach the page but then the page is discared. Or you may get SIGBUS before the write. > Second, even if the we don't care about this POSIX flaw that may > introduce, a pte can still > missed by the second loop. mremap can happen serveral times during > these non-atomic > firstpass-trunc-secondpass operations, a proper events can happily > make the wrong order > for every scan, and miss them all -- That's just what in Hugh's mind > in the post you just > replied. Without lock and proper ordering( which patial mremap cannot provide), > this *will* happen. There won't be more than one mremap running concurrently from the same process (we must enforce it by making sure anon_vma lock and i_mmap_lock are both taken at least once in copy_vma, they're already both taken in fork, they should already be taken in all common cases in copy_vma so for all cases it's going to be a L1 exclusive cacheline already). I don't exclude there may be some case that won't take the locks in vma_adjust though, we should check it, if we decide to relay on the double loop, but it'd be a simple addition if needed. I'm more concerned about the pte pointing to the orphaned pagecache that would materialize for a little while because of unmap+truncate+unmap instead of unmap+unmap+truncate (but the latter order is needed for the COWs). > You may disagree with me and have that locking removed, and I am > already have that > one line patch prepared waiting fora bug bumpping up again, what a > cheap patch submission! Well I'm not yet sure it's good idea to remove the i_mmap_mutex, or if we should just add the anon_vma lock in mremap and add the i_mmap_lock in fork (to avoid the orphaned pagecache left mapped in the child which already may happen unless there's some i_mmap_lock belonging to the same inode taken after copy_page_range returns until we return to userland and child can run, and I don't think we can relay on the order of the prio tree in fork. Fork is safe for anon pages because there we can relay on the order of the same_anon_vma list. I think clearing up if this orphaned pagecache is dangerous would be a good start. If too complex we just add the i_mmap_lock around copy_page_range in fork if vma->vm_file is set. If you instead think we can deal with the orphaned pagecache we can add a dummy lock/unlock of i_mmap_mutex in copy_vma vma_merge succeeding case (short critical section and not common common case) and remove the i_mmap_mutex around move_page_tables (common case) overall speeding up mremap and not degrading fork. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>