Re: Orangefs ABI documentation

Mike Marshall <hubcap@xxxxxxxxxxxx> · Tue, 16 Feb 2016 18:15:56 -0500

This thing is invulnerable now!

Nothing hangs when I kill the client-core, and the client-core
always restarts.

Sometimes, if you hit it right with a kill while dbench is running,
a file create will fail.

I've been trying to trace down why all day, in case there's
something that can be done...

Here's what I see:

  orangefs_create
    service_operation
      wait_for_matching_downcall purges op and returns -EAGAIN
      orangefs_clean_up_interrupted_operation
      if (EAGAIN)
        ...
        goto retry_servicing
      wait_for_matching_downcall returns 0
    service_operation returns 0
  orangefs_create has good return value from service_operation

   op->khandle: 00000000-0000-0000-0000-000000000000
   op->fs_id: 0

   subsequent getattr on bogus object fails orangefs_create on EINVAL.

   seems like the second time around, wait_for_matching_downcall
   must have seen op_state_serviced, but I don't see how yet...

I pushed the new patches out to

gitolite.kernel.org:pub/scm/linux/kernel/git/hubcap/linux
for-next

I made a couple of additional patches that make it easier to read
the flow of gossip statements, and also removed a few lines of vestigial
ASYNC code.

-Mike

On Mon, Feb 15, 2016 at 6:04 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote:
>
>> Something that used a slot, such as reader, would call
>> service_operation while holding a bufmap. Then the client-core would
>> crash, and the kernel would get run_down waiting on the slots to be
>> given up. But the slots are not given up until someone wakes all the
>> processes waiting in service_operation up, which happens after all the
>> slots are given up. Then client-core hangs until someone sends a
>> deadly signal to all the processes waiting in service_operation or
>> presumably the timeout expires.
>>
>> This splits finalize and run_down so that orangefs_devreq_release can
>> mark the slot map as killed, then purge waiting ops, then wait for all
>> the slots to be released. Meanwhile, processes which were waiting will
>> get into orangefs_bufmap_get which will see that the slot map is
>> shutting down and wait for the client-core to come back.
>
> D'oh.  Yes, that was exactly the point of separating mark_dead and run_down -
> the latter should've been done after purging all requests.  Fixes folded,
> branch force-pushed.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html