Re: ESXi + LIO + Ceph RBD problem

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Fri, 21 Aug 2015 12:20:37 -0400

>> > Based on these and earlier comments, I think there is still some
>> > misconception about misbehaving backend devices, and what needs to
>> > happen in order for LIO to make forward progress during iscsi session
>> > reinstatement.
>> >
>> > Allowing a new session login to proceed and submit new WRITEs when the
>> > failed session can't get I/O completion with exception status to happen from
>> > a backend driver is bad.  Because, unless previous I/Os are able to be
>> > (eventually) completed or aborted within target-core before new backend
>> > driver I/O submission happens, there is no guarantee the stale WRITEs won't
>> > be completed after subsequent new WRITEs from a different session with a
>> > new command sequence number.
>> >
>> > Which means there is potential for new writes to be lost, and is the reason
>> > why 'violating the spec' in this context is not allowed.
>> >
>>

Apologies for my heavy snipping above and below, but thank you for
this patient clarification and I believe in context of Ceph these
issues would be directly addressed by Mike Christie's:

> 2. In the block layer add callouts/cmds so that we can abort
> requests/bios at the LLD level.
> 3. For rbd, we will implement support for #2. In ceph then we would need
> to add code to be able to track down commands and kill them if we can or
> at least figure out what is going on and log a message so we do not have
> these mysterious hung commands.

That looks pretty perfect for  removing the ESXi ABORT_TASK freefall
issues we (a number of ESXi admins here) have been seeing.

Thanks,
Alex
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html