Hey Al... I studied the relevant parts of the code in the context of your several mail messages from this weekend so I could get the most benefit from them... thanks... Then I applied the patches you suggested and ran some tests, things are much more better now... I can't make the kernel crash, or get the WARN_ON to trigger. The way I run my test (dbench) there's a warmup phase which involves file and directory creation, and then an execute phase, which also does some reading of the created files. My impression is that dbench is more likely to fail ungracefully if I signal the client-core to abort during the execute phase, and more likely to complete normally if I signal the client-core to abort during warmup (or cleanup, which removes the directory tree built during warmup). I'll do more tests tomorrow with more debug turned on, and see if I can get some idea of what makes dbench so ill... the most important thing is that the kernel doesn't crash, but it would be gravy if user processes could better withstand a client-core recycle. Here's the bufmap debug output, I didn't want to send 700k of mostly "orangefs_bufmap_copy_from_iovec" to the list: http://myweb.clemson.edu/~hubcap/out grepping for "finalize" in all that noise is a good way to see where client-core restarts happened. I ran dbench numerous times, and managed to signal the client-core to restart during the same run several times... -Mike On Sat, Feb 6, 2016 at 10:53 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Sun, Feb 07, 2016 at 01:38:35AM +0000, Al Viro wrote: >> > > As for the WARN_ONs, the waitqueue one is easy to hit when the >> > > client-core stops and restarts, you can see here where precopy_buffers >> > > started whining about the client-core, you can see that the client >> > > core restarted when the debug mask got sent back over, and then >> > > the WARN_ON in waitqueue gets hit: >> >> > > [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make >> > > sure that the pvfs2-client is running. -14 >> >> Very interesting... >> >> Looks like there's another bug in restart handling. Namely, restart happening >> on write() tries to fetch more data from iter, without bothering to rewind to >> where it used to be. That's where those -EFAULT are coming from. Easy to fix, >> fortunately - on top of the double-free fix, apply the following: > > > BTW, could you try to reproduce that WARN_ON with these two patches added > and with bufmap debugging turned on? Both double-free and lack of rewinding > are real; I can see scenarios where they would trigger, and I'm pretty sure > that the latter is triggering in your reproducer. Moreover, I'm absolutely > sure that spurious dropping of bufmap references is happening there; what I'm > not sure is whether it was on this double-free or on something else... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html