Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

Sage Weil <sage@xxxxxxxxxxx> · Wed, 17 Apr 2013 21:45:02 -0700 (PDT)

On Wed, 17 Apr 2013, Anand Avati wrote:
> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
>       this time
>       on the read side:
> 
>       - ceph-fuse queues an invalidate (in a separate thread)
>       - kernel initiates a read
>       - invalidate blocks in kernel, waiting on a page lock
>       - the read blocks in ceph-fuse
> 
>       Now, assuming we're reading the stack traces properly, this is
>       more or
>       less what we see with writes, except with reads, and the obvious
>       "don't
>       block the read" would resolve it.
> 
>       But!  If that is the only way to avoid deadlock, I'm afraid it
>       is
>       difficult to implement reliable cache invalidation at all.  The
>       reason we
>       are invalidating is because the server told us to: we are no
>       longer
>       allowed to do reads and cached data is invalid.  The obvious
>       approach is
>       to
> 
>       1- stop processing new reads
>       2- let in-progress reads complete
>       3- invalidate the cache
>       4- ack to server
> 
>       ...but that will deadlock as above, as any new read will lock
>       pages before
>       blcoking.  If we don't block, then the read may repopulate pages
>       we just
>       invalidated.  We could
> 
>       1- invalidate
>       2- if any reads happened while we were invalidating, goto 1
>       3- ack
> 
>       but then we risk starvation and livelock.
> 
>       How do other people solve this problem?  It seems like another
>       upcall that
>       would let you block new reads (and/or writes) from starting
>       while the
>       invalidate is in progress would do the trick, but I'm not
>       convinced I'm
>       not missing something much simpler.
> 
> 
> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
> holding the mutex in cfuse? It should be sufficient if you -
> 
> 0 - Receive inval request from server
> 1 - mutex_lock() in cfuse
> 2 - invalidate cfuse cache
> 3 - mutex_unlock() in cfuse
> 4 - fuse_lowlevel_notify_inval_inode()
> 5 - ack to server
> 
> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
> mutex boundaries looks unnecessary and self-imposed. In-progress reads which
> took the page lock before fuse_lowlevel_notify_inval_inode() would either
> read data cached in cfuse (in case they reached the cache before 1), or get
> sent over to server as though data was never cached. There wouldn't be a
> livelock either. Did I miss something?

It's the concurrent reads I'm concerned about:

3.5 - read(2) is called, locks some pages, and sends a message through the 
fuse connection

3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it, 
repopulating a region of the page cache it possibly just partially 
invalidated (rendering the invalidate a failure), or block, possibly 
preventing the invalidate from ever completing.

4.2 - invalidate either completes (having possibly missed some just-read 
pages), or deadlocks on a locked page (depending on whether we blocked the 
read above)

sage