Re: Hole punching and mmap races

Jan Kara <jack@xxxxxxx> · Fri, 8 Jun 2012 23:36:29 +0200

On Fri 08-06-12 10:57:00, Dave Chinner wrote:
> On Thu, Jun 07, 2012 at 11:58:35PM +0200, Jan Kara wrote:
> > On Wed 06-06-12 23:36:16, Dave Chinner wrote:
> > > On Wed, Jun 06, 2012 at 11:58:27AM +0200, Jan Kara wrote:
> > > > On Wed 06-06-12 10:06:36, Dave Chinner wrote:
> > > > > On Wed, Jun 06, 2012 at 01:15:30AM +0200, Jan Kara wrote:
> > > > > > On Tue 05-06-12 15:51:50, Dave Chinner wrote:
> > > > > > > On Thu, May 24, 2012 at 02:35:38PM +0200, Jan Kara wrote:
> > > > > > > > > To me the issue at hand is that we have no method of serialising
> > > > > > > > > multi-page operations on the mapping tree between the filesystem and
> > > > > > > > > the VM, and that seems to be the fundamental problem we face in this
> > > > > > > > > whole area of mmap/buffered/direct IO/truncate/holepunch coherency.
> > > > > > > > > Hence it might be better to try to work out how to fix this entire
> > > > > > > > > class of problems rather than just adding a complex kuldge that just
> > > > > > > > > papers over the current "hot" symptom....
> > > > > > > >   Yes, looking at the above table, the amount of different synchronization
> > > > > > > > mechanisms is really striking. So probably we should look at some
> > > > > > > > possibility of unifying at least some cases.
> > > > > > > 
> > > > > > > It seems to me that we need some thing in between the fine grained
> > > > > > > page lock and the entire-file IO exclusion lock. We need to maintain
> > > > > > > fine grained locking for mmap scalability, but we also need to be
> > > > > > > able to atomically lock ranges of pages.
> > > > > >   Yes, we also need to keep things fine grained to keep scalability of
> > > > > > direct IO and buffered reads...
> > > > > > 
> > > > > > > I guess if we were to nest a fine grained multi-state lock
> > > > > > > inside both the IO exclusion lock and the mmap_sem, we might be able
> > > > > > > to kill all problems in one go.
> > > > > > > 
> > > > > > > Exclusive access on a range needs to be granted to:
> > > > > > > 
> > > > > > > 	- direct IO
> > > > > > > 	- truncate
> > > > > > > 	- hole punch
> > > > > > > 
> > > > > > > so they can be serialised against mmap based page faults, writeback
> > > > > > > and concurrent buffered IO. Serialisation against themselves is an
> > > > > > > IO/fs exclusion problem.
> > > > > > > 
> > > > > > > Shared access for traversal or modification needs to be granted to:
> > > > > > > 
> > > > > > > 	- buffered IO
> > > > > > > 	- mmap page faults
> > > > > > > 	- writeback
> > > > > > > 
> > > > > > > Each of these cases can rely on the existing page locks or IO
> > > > > > > exclusion locks to provide safety for concurrent access to the same
> > > > > > > ranges. This means that once we have access granted to a range we
> > > > > > > can check truncate races once and ignore the problem until we drop
> > > > > > > the access.  And the case of taking a page fault within a buffered
> > > > > > > IO won't deadlock because both take a shared lock....
> > > > > >   You cannot just use a lock (not even a shared one) both above and under
> > > > > > mmap_sem. That is deadlockable in presence of other requests for exclusive
> > > > > > locking...
> > > > > 
> > > > > Well, that's assuming that exclusive lock requests form a barrier to
> > > > > new shared requests. Remember that I'm talking about a range lock
> > > > > here, which we can make up whatever semantics we'd need, including
> > > > > having "shared lock if already locked shared" nested locking
> > > > > semantics which avoids this page-fault-in-buffered-IO-copy-in/out
> > > > > problem....
> > > >   That's true. But if you have semantics like this, constant writing to
> > > > or reading from a file could starve e.g. truncate. So I'd prefer not to
> > > > open this can of worms and keep semantics of rw semaphores if possible.
> > > 
> > > Except truncate uses the i_mutex/i_iolock for exclusion, so it would
> > > never get held off any more than it already does by buffered IO in
> > > this case. i.e. the mapping tree range lock is inside the locks used
> > > for truncate serialisation, so we don't have a situation where other
> > > operations woul dbe held off by such an IO pattern...
> >   True. I was just hoping that i_mutex won't be needed if we get our new
> > lock right.
> > 
> > > > Furthermore, with direct IO you have to set in stone the ordering of
> > > > mmap_sem and range lock anyway because there we need an exclusive lock.
> > > 
> > > Yes, mmap basically requires exclusive mmap_sem->shared range lock ordering. For
> > > direct IO, we only need the mmap_sem for the get_user_pages() call
> > > IIRC, so that requires exclusive range lock-> shared mmap_sem
> > > ordering. Unless we can lift the range lock in the mmap path outside
> > > the mmap_sem, we're still in the same boat....
> >   Yes. Just as I said before it is not an unsolvable situation since for
> > direct IO we can grab our lock after get_user_pages() call. I was thinking
> > about it for a while and I realized, that if we have range lock that we
> > take during page fault, buffered IO, direct IO, truncate, punch hole, we
> > actually don't need a shared version of it. Just the fact it would be range
> > lock is enough to avoid serialization.
> 
> Well, we currently allow overlapping direct IO to the same range in
> XFS, The order of completion is underfined, just like concurrent IOs
> to the same sector of a disk, but it basically results in direct IO
> on an XFS file to behave exactly the same way as concurrent IO to a raw
> device would....
> 
> Now, that does cause some problems for naive users of direct IO
> (just like those same naive users mix mmap, buffered and direct IO to
> the same file), but if we are going to make everything else coherent
> then I have no problems with dropping this functionality and ony
> allowing concurrency for non-overlapping requests.
  Yeah, overlapping direct IO is asking for trouble (except possibly two
direct IO reads).

> The other thing is that concurrent overlapping buffered reads to the
> same range will only serialise on page locks if the range can be
> lock shared, so there would be no change in behaviour. Making the
> range lock exclusive would serialise the overlapping reads at a much
> higher level and will result in a change of cached read behaviour
> and potentially a significant reduction in performance. This may be
> a corner case no-one cares about, but exclusive range locking will
> definitely impact such a workload....
  For buffered reads we would lock page-by-page anyway (again due to
locking constraints with mmap_sem when copying what we've read) so there
shouldn't be any difference to current level of concurrency.

> > Also I was thinking that since lock ordering forces our new lock to be
> > relatively close to page lock and page lock is serializing quite some of
> > IO operations anyway,
> 
> That's the observation that has led me to call it a "mapping
> tree" lock.
  I've internally called the locking function lock_mapping_range() ;).

> > it might be workable (and actually reasonably easy
> > to grasp for developers) if the range lock is coupled with page lock - i.e.
> > locking a range will be equivalent to locking each page in a range, except
> > that this way you can "lock" pages which are not present and thus forbid
> > their creation.
> 
> Nice idea, but I think that is introducing requirements and
> potential complexity way beyond what we need right now. It's already
> a complex problem, so lets not make it any harder to validate than
> it already will be... :/
> 
> As it is, we can't "forbid" the creation of pages - we control
> access to the mapping tree, which allows us to prevent insertion or
> removal of pages from a given range.  That's exactly how it will
> serialise DIO against buffered/mmap IO, hole punch/truncate against
> mmap, and so on. It's a mapping tree access control mechanism, not a
> page serialisation mechanism...
  I agree, I was imprecise here and it's good to realize exactly what we
guard.

> > Also we could implement the common case of locking a range
> > containing single page by just taking page lock so we save modification of
> > interval tree in the common case and generally make the tree smaller (of
> > course, at the cost of somewhat slowing down cases where we want to lock
> > larger ranges).
> 
> That seems like premature optimistion to me, and all the cases I
> think we need to care about are locking large ranges of the tree.
> Let's measure what the overhead of tracking everything in a single
> tree is first so we can then see what needs optimising...
  Umm, I agree that initially we probably want just to have the mapping
range lock ability, stick it somewhere to IO path and make things work.
Then we can look into making it faster / merging with page lock.

However I disagree we care most about locking large ranges. For all
buffered IO and all page faults we need to lock a range containing just a
single page. We cannot lock more due to locking constraints with mmap_sem.
So the places that will lock larger ranges are: direct IO, truncate, punch
hole. Writeback actually doesn't seem to need any additional protection at
least as I've sketched out things so far.

So single-page ranges matter at least as much as longer ranges. That's why
I came up with that page lock optimisation and merging...

> Also, I don't think it can replace the page lock entirely because
> the page lock as that also serialises against other parts of the VM
> (e.g. memory reclaim). I'd prefer not to complicate the issue by
> trying to be fancy - it's going to be hard enough to validate that
> it works correctly without having to validate that it does not
> introduce races that were previously prevented by the page lock.
  So I never meant to completely replace page lock. I agree that will be
almost impossible. I rather meant to add a capability of locking a range of
pages. But lets leave that for now.

									Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>