Re: Orangefs ABI documentation

Mike Marshall <hubcap@xxxxxxxxxxxx> · Mon, 8 Feb 2016 17:26:53 -0500

Hey Al...

I studied the relevant parts of the code in the context of your
several mail messages from this weekend so I could get the
most benefit from them... thanks...

Then I applied the patches you suggested and ran some tests,
things are much more better now...

I can't make the kernel crash, or get the WARN_ON to trigger.

The way I run my test (dbench) there's a warmup phase which
involves file and directory creation, and then an execute phase,
which also does some reading of the created files.

My impression is that dbench is more likely to fail ungracefully
if I signal the client-core to abort during the execute phase, and
more likely to complete normally if I signal the client-core to abort
during warmup (or cleanup, which removes the directory tree
built during warmup).

I'll do more tests tomorrow with more debug turned on, and see if
I can get some idea of what makes dbench so ill... the most important
thing is that the kernel doesn't crash, but it would be gravy if user
processes could better withstand a client-core recycle.

Here's the bufmap debug output, I didn't want to send 700k
of mostly "orangefs_bufmap_copy_from_iovec" to the list:

http://myweb.clemson.edu/~hubcap/out

grepping for "finalize" in all that noise is a good way to see
where client-core restarts happened. I ran dbench numerous
times, and managed to signal the client-core to restart during
the same run several times...

-Mike

On Sat, Feb 6, 2016 at 10:53 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Sun, Feb 07, 2016 at 01:38:35AM +0000, Al Viro wrote:
>> > > As for the WARN_ONs, the waitqueue one is easy to hit when the
>> > > client-core stops and restarts, you can see here where precopy_buffers
>> > > started whining about the client-core, you can see that the client
>> > > core restarted when the debug mask got sent back over, and then
>> > > the WARN_ON in waitqueue gets hit:
>>
>> > > [ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
>> > > sure that  the pvfs2-client is running. -14
>>
>> Very interesting...
>>
>> Looks like there's another bug in restart handling.  Namely, restart happening
>> on write() tries to fetch more data from iter, without bothering to rewind to
>> where it used to be.  That's where those -EFAULT are coming from.  Easy to fix,
>> fortunately - on top of the double-free fix, apply the following:
>
>
> BTW, could you try to reproduce that WARN_ON with these two patches added
> and with bufmap debugging turned on?  Both double-free and lack of rewinding
> are real; I can see scenarios where they would trigger, and I'm pretty sure
> that the latter is triggering in your reproducer.  Moreover, I'm absolutely
> sure that spurious dropping of bufmap references is happening there; what I'm
> not sure is whether it was on this double-free or on something else...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html