Re: ESXi + LIO + Ceph RBD problem

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Wed, 19 Aug 2015 23:14:39 -0700

(Btw, please don't top post on kernel mailing lists folks, it's
annoying)

On Wed, 2015-08-19 at 12:16 -0400, Alex Gorbachev wrote:
> I have to say that changing default_cmdsn_depth did not help us with
> the abnormal timeouts, i.e. OSD failing or some other abrupt event.
> When that happens we detect the event via ABORT_TASK and if the event
> is transient usually nothing happens.  Anything more than a few
> seconds will usually result in Ceph recovery but ESXi gets stuck and
> never comes out of APD.  Looks like it tries to establish another
> session by bombarding the target with retries and resets, and
> ultimately gives up and goes to PDL state.  Then the only option is
> reboot.
> 
> So to be clear, we have moved on from a discussion about slow storage
> to a discussion about what happens during unexpected and abnormal
> timeouts.  Anecdotal evidence suggests that SCST based systems will
> allow ESXi recover from this condition, while ESXi does not play as
> well with LIO based systems in those situations.
> 
> What is the difference, and is there willingness to allow LIO to be
> modified to work with this ESXi behavior?  Or should we ask Vmware to
> do something for ESXi to play better with LIO?  I cannot fix the code,
> but would be happy to be the voice of the issue via any available
> channels.
> 

Based on these and earlier comments, I think there is still some
misconception about misbehaving backend devices, and what needs to
happen in order for LIO to make forward progress during iscsi session
reinstatement.

Allowing a new session login to proceed and submit new WRITEs when the
failed session can't get I/O completion with exception status to happen
from a backend driver is bad.  Because, unless previous I/Os are able to
be (eventually) completed or aborted within target-core before new
backend driver I/O submission happens, there is no guarantee the stale
WRITEs won't be completed after subsequent new WRITEs from a different
session with a new command sequence number.

Which means there is potential for new writes to be lost, and is the
reason why 'violating the spec' in this context is not allowed.

If a backend driver is not able to complete I/O before ESX timeout
triggers to give-up on outstanding I/Os is being reached, then the
backend driver needs to:

* Have a lower internal I/O timeout to complete back to 
  target-core with exception status before ESX gives up on iscsi session
  login attempts, and associated session I/O.

Also, SCSI LLDs and raw block drivers work very differently wrt to I/O
timeout and reset.

For underlying SCSI LLDs, scsi_eh will attempt to reset the device to
complete failed I/O.  Setting the scsi_eh timeout lower than ESX's iscsi
login timeout to give up and fails I/O is one simple option to consider.

However, if your LLD or LLD's firmware doesn't *ever* complete I/O back
to scsi-core even after a reset occurs resulting in LIO blocking
indefinitely on session reinstatement, then it's a LLD specific bug and
really should be fixed.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html