On Tue 05-06-12 15:51:50, Dave Chinner wrote: > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote: > > > To me the issue at hand is that we have no method of serialising > > > multi-page operations on the mapping tree between the filesystem and > > > the VM, and that seems to be the fundamental problem we face in this > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency. > > > Hence it might be better to try to work out how to fix this entire > > > class of problems rather than just adding a complex kuldge that just > > > papers over the current "hot" symptom.... > > Yes, looking at the above table, the amount of different synchronization > > mechanisms is really striking. So probably we should look at some > > possibility of unifying at least some cases. > > It seems to me that we need some thing in between the fine grained > page lock and the entire-file IO exclusion lock. We need to maintain > fine grained locking for mmap scalability, but we also need to be > able to atomically lock ranges of pages. Yes, we also need to keep things fine grained to keep scalability of direct IO and buffered reads... > I guess if we were to nest a fine grained multi-state lock > inside both the IO exclusion lock and the mmap_sem, we might be able > to kill all problems in one go. > > Exclusive access on a range needs to be granted to: > > - direct IO > - truncate > - hole punch > > so they can be serialised against mmap based page faults, writeback > and concurrent buffered IO. Serialisation against themselves is an > IO/fs exclusion problem. > > Shared access for traversal or modification needs to be granted to: > > - buffered IO > - mmap page faults > - writeback > > Each of these cases can rely on the existing page locks or IO > exclusion locks to provide safety for concurrent access to the same > ranges. This means that once we have access granted to a range we > can check truncate races once and ignore the problem until we drop > the access. And the case of taking a page fault within a buffered > IO won't deadlock because both take a shared lock.... You cannot just use a lock (not even a shared one) both above and under mmap_sem. That is deadlockable in presence of other requests for exclusive locking... Luckily, with buffered writes the situation isn't that bad. You need mmap_sem only before each page is processed (in iov_iter_fault_in_readable()). Later on in the loop we use iov_iter_copy_from_user_atomic() which doesn't need mmap_sem. So we can just get our shared lock after iov_iter_fault_in_readable() (or simply leave it for ->write_begin() if we want to give control over the locking to filesystems). > We'd need some kind of efficient shared/exclusive range lock for > this sort of exclusion, and it's entirely possible that it would > have too much overhead to be acceptible in the page fault path. It's > the best I can think of right now..... > > As it is, a range lock of this kind would be very handy for other > things, too (like the IO exclusion locks so we can do concurrent > buffered writes in XFS ;). Yes, that's what I thought as well. In particular it should be pretty efficient in locking a single page range because that's going to be majority of calls. I'll try to write something and see how fast it can be... Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html