Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

Anand Avati <avati@xxxxxxxxxx> · Thu, 18 Apr 2013 12:18:17 -0700

On 04/18/2013 12:12 PM, Sage Weil wrote:
On Thu, 18 Apr 2013, Anand Avati wrote:
On Apr 18, 2013, at 10:05 AM, Sage Weil<sage@xxxxxxxxxxx>  wrote:
On Wed, 17 Apr 2013, Anand Avati wrote:
On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil<sage@xxxxxxxxxxx>  wrote:

On Wed, 17 Apr 2013, Anand Avati wrote:
On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil<sage@xxxxxxxxxxx>  wrote:
      We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
      this time
      on the read side:

      - ceph-fuse queues an invalidate (in a separate thread)
      - kernel initiates a read
      - invalidate blocks in kernel, waiting on a page lock
      - the read blocks in ceph-fuse

      Now, assuming we're reading the stack traces properly, this is
      more or
      less what we see with writes, except with reads, and the obvious
      "don't
      block the read" would resolve it.

      But!  If that is the only way to avoid deadlock, I'm afraid it
      is
      difficult to implement reliable cache invalidation at all.  The
      reason we
      are invalidating is because the server told us to: we are no
      longer
      allowed to do reads and cached data is invalid.  The obvious
      approach is
      to

      1- stop processing new reads
      2- let in-progress reads complete
      3- invalidate the cache
      4- ack to server

      ...but that will deadlock as above, as any new read will lock
      pages before
      blcoking.  If we don't block, then the read may repopulate pages
      we just
      invalidated.  We could

      1- invalidate
      2- if any reads happened while we were invalidating, goto 1
      3- ack

      but then we risk starvation and livelock.

      How do other people solve this problem?  It seems like another
      upcall that
      would let you block new reads (and/or writes) from starting
      while the
      invalidate is in progress would do the trick, but I'm not
      convinced I'm
      not missing something much simpler.

Do you really need to call fuse_lowlevel_notify_inval_inode() while still
holding the mutex in cfuse? It should be sufficient if you -

0 - Receive inval request from server
1 - mutex_lock() in cfuse
2 - invalidate cfuse cache
3 - mutex_unlock() in cfuse
4 - fuse_lowlevel_notify_inval_inode()
5 - ack to server

The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
mutex boundaries looks unnecessary and self-imposed. In-progress reads
which
took the page lock before fuse_lowlevel_notify_inval_inode() would either
read data cached in cfuse (in case they reached the cache before 1), or
get
sent over to server as though data was never cached. There wouldn't be a
livelock either. Did I miss something?

It's the concurrent reads I'm concerned about:

3.5 - read(2) is called, locks some pages, and sends a message through the
fuse connection

3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
repopulating a region of the page cache it possibly just partially
invalidated (rendering the invalidate a failure),

I think the problem lies here, that handling the read before 5 is returning
old data (which effectively renders the invalidation a failure). Step 0
needs to guarantee that new data about to be written is already staged in
the server and made available for read. However the write request itself
needs to be blocked from completing till step 5 from all other clients
completes.

It sounds like you're thinking of a weaker consistency model.  The mds is
telling us to stop reads and invalidate our cache *before* anybody is
allowed to write. At the end of this process, we ack, we should have an
empty page cache for this file and any reads should be blocked until
further notice.

Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version.

or block, possibly preventing the invalidate from ever completing.

Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
server should be prepared to serve staged data even before Step 0.
Invalidating might be *delayed* till the in-progress reads finishes. But
that only delays the completion of write(), but no deadlocks anywhere. Hope
I din't miss something :-)

You can ignore the mutex_lock stuff; I don't think it's necessary to see
the issue.  Simply consider a racing read (that has pages locked) and
invalidate (that is walking through the address_space mapping locking and
discarding pages).  We can't block the read without risking deadlock with
the invalidate, and we can't continue with the read without making the
invalidate unsuccessful/unreliable.  We can actually do reads at this
point from the ceph client vs server perspective since we haven't acked
the revocation yet.. but with the fuse vs ceph-fuse interaction we are
choosing betweeen deadlock or potential livelock (if we do the read and
then try the invalidate a second time).

For the consistency model you are looking at - (halt all new reads,
clear all page cache, perform write) this sure seems to be a deadlock vs
live-lock situation.  Why isn't your consistency requirements a problem
for the ceph in-kernel client? What extra machinery does it have which
FUSE does not? I supposed you implement the "halt new reads" in
f_op->[aio_]read even before taking page locks?

Exactly.

I think to get the proper semantics in the fuse case (and also to solve
the striping problem you describe below), there would need to be an upcall
to quiece read (and probably something similar to write, depending on how
the writeback stuff works out) before doing the invalidate.

What does gluster do in this case?  Are they using direct_io mode?  Or a
different consistency model?

Currently we offer only close-to-open consistency like NFSv3. For 
tighter requirements we honor posix locks and uncache the locked 
regions, with a combination of direct_io. Concurrent reader/writer of 
the same file/record is just not the use case gluster is focussing on.

Avati

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html