Re: Orangefs ABI documentation

Mike Marshall <hubcap@xxxxxxxxxxxx> · Thu, 11 Feb 2016 18:36:52 -0500

 >  what should the daemon see in such situation?

I agree that it looks like the client-core doesn't notice that a
process doing IO was cancelled. I don't think the client-core
keeps track of what slots are in use, it just trusts that the
buffer-index in any IO upcall is safe to use.  I believe that
wait_for_cancellation_downcall has its roots in the old
AIO code, and ended up at some point getting used
outside of that context.

-Mike

On Tue, Feb 9, 2016 at 4:06 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Feb 09, 2016 at 05:40:49PM +0000, Al Viro wrote:
>
>> Could you try, on top of those fixes, comment the entire
>>         if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO) {
>>                 long n = wait_for_completion_interruptible_timeout(&op->done,
>>                                                         op_timeout_secs * HZ);
>>                 if (unlikely(n < 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: signal on I/O wait, aborting\n",
>>                                 __func__);
>>                 } else if (unlikely(n == 0)) {
>>                         gossip_debug(GOSSIP_DEV_DEBUG,
>>                                 "%s: timed out.\n",
>>                                 __func__);
>>                 }
>>         }
>> in orangefs_devreq_write_iter() out and see if the corruption happens?
>
> Another thing: what's the protocol rules regarding the cancels?  The current
> code looks very odd - if we get a hit by a signal after the daemon has
> picked e.g. read request but before it had replied, we will call
> orangefs_cancel_op_in_progress(), which will call service_operation() with
> ORANGEFS_OP_CANCELLATION which will.  And that'll insert the cancel request
> into list and practically immediately notice that we have a pending signal,
> remove the cancel request from the list and bugger off.  With daemon almost
> certainly *not* getting to see it at all.
>
> I've asked that before if anybody has explained that, I've missed that reply.
> How the fuck is that supposed to work?  Forget the kernel-side implementation
> details, what should the daemon see in such situation?
>
> I would expect something like "you can't reuse a slot until operation has
> been either completed or purged or a cancel had been sent and ACKed by
> the daemon".  Is that what is intended?  If so, the handling of cancels might
> be better off asynchronous - let the slot freeing be done after the cancel
> had been ACKed and _not_ in the context of original syscall...
>
> There are some traces of AIO support in that thing; could this be a victim of
> trimming async parts for submission into the mainline?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html