I added the list_del... Everything is very resilient, I killed the client-core over and over while dbench was running at the same time as ls -R was running, and the client-core always restarted... until finally, it didn't. I guess related to the state of just what was going on at the time... Hit the WARN_ON in service_operation, and then oopsed on the orangefs_bufmap_put down at the end of wait_for_direct_io... http://myweb.clemson.edu/~hubcap/after.list_del -Mike On Sat, Feb 13, 2016 at 9:56 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote: >> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote: >> > I added the patches, and ran a bunch of tests. >> > >> > Stuff works fine when left unbothered, and also >> > when wrenches are thrown into the works. >> > >> > I had multiple userspace things going on at the >> > same time, dbench, ls -R, find... kill -9 or control-C on >> > any of them is handled well. When I killed both >> > the client-core and its restarter, the kernel >> > dealt with swarm of ops that had nowhere >> > to go... the WARN_ON in service_operation >> > was hit. >> > >> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm >> > that pvfs2-client daemon is running. >> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264: >> > orangefs_readdir: orangefs_readdir_index_get() failure (-5) >> >> I.e. bufmap is gone. >> >> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------ >> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667 >> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0() >> >> ... and we are in wait_for_direct_io(), holding an r/w slot and finding >> ourselves with bufmap already gone, despite not having freed that slot >> yet. Bloody wonderful - we still have bufmap refcounting buggered somewhere. >> >> Which tree had that been? Could you push that tree (having checked that >> you don't have any uncommitted changes) in some branch? > > OK, at the very least there's this; should be folded into "orangefs: delay > freeing slot until cancel completes" > > diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h > index 41f8bb1f..1e28555 100644 > --- a/fs/orangefs/orangefs-kernel.h > +++ b/fs/orangefs/orangefs-kernel.h > @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op) > { > spin_lock(&op->lock); > if (unlikely(op_is_cancel(op))) { > + list_del(&op->list); > spin_unlock(&op->lock); > put_cancel(op); > } else { -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html