On Wed, 17 Apr 2013, Anand Avati wrote: > On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, > this time > on the read side: > > - ceph-fuse queues an invalidate (in a separate thread) > - kernel initiates a read > - invalidate blocks in kernel, waiting on a page lock > - the read blocks in ceph-fuse > > Now, assuming we're reading the stack traces properly, this is > more or > less what we see with writes, except with reads, and the obvious > "don't > block the read" would resolve it. > > But! If that is the only way to avoid deadlock, I'm afraid it > is > difficult to implement reliable cache invalidation at all. The > reason we > are invalidating is because the server told us to: we are no > longer > allowed to do reads and cached data is invalid. The obvious > approach is > to > > 1- stop processing new reads > 2- let in-progress reads complete > 3- invalidate the cache > 4- ack to server > > ...but that will deadlock as above, as any new read will lock > pages before > blcoking. If we don't block, then the read may repopulate pages > we just > invalidated. We could > > 1- invalidate > 2- if any reads happened while we were invalidating, goto 1 > 3- ack > > but then we risk starvation and livelock. > > How do other people solve this problem? It seems like another > upcall that > would let you block new reads (and/or writes) from starting > while the > invalidate is in progress would do the trick, but I'm not > convinced I'm > not missing something much simpler. > > > Do you really need to call fuse_lowlevel_notify_inval_inode() while still > holding the mutex in cfuse? It should be sufficient if you - > > 0 - Receive inval request from server > 1 - mutex_lock() in cfuse > 2 - invalidate cfuse cache > 3 - mutex_unlock() in cfuse > 4 - fuse_lowlevel_notify_inval_inode() > 5 - ack to server > > The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the > mutex boundaries looks unnecessary and self-imposed. In-progress reads which > took the page lock before fuse_lowlevel_notify_inval_inode() would either > read data cached in cfuse (in case they reached the cache before 1), or get > sent over to server as though data was never cached. There wouldn't be a > livelock either. Did I miss something? It's the concurrent reads I'm concerned about: 3.5 - read(2) is called, locks some pages, and sends a message through the fuse connection 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, repopulating a region of the page cache it possibly just partially invalidated (rendering the invalidate a failure), or block, possibly preventing the invalidate from ever completing. 4.2 - invalidate either completes (having possibly missed some just-read pages), or deadlocks on a locked page (depending on whether we blocked the read above) sage