RE: ESXi + LIO + Ceph RBD problem

Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx> · Thu, 20 Aug 2015 21:44:46 +0000

> -----Original Message-----
> From: target-devel-owner@xxxxxxxxxxxxxxx [mailto:target-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Nicholas A. Bellinger
> Sent: August 19, 2015 11:15 PM
> To: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
> Cc: Steve Beaudry <Steve.Beaudry@xxxxxxxxxxxxx>; Martin Svec
> <martin.svec@xxxxxxxx>; target-devel@xxxxxxxxxxxxxxx
> Subject: Re: ESXi + LIO + Ceph RBD problem
>
> (Btw, please don't top post on kernel mailing lists folks, it's
> annoying)

Sorry 'bout that, chief...

>
> On Wed, 2015-08-19 at 12:16 -0400, Alex Gorbachev wrote:
> > I have to say that changing default_cmdsn_depth did not help us with
> > the abnormal timeouts, i.e. OSD failing or some other abrupt event.
> > When that happens we detect the event via ABORT_TASK and if the event
> > is transient usually nothing happens.  Anything more than a few
> > seconds will usually result in Ceph recovery but ESXi gets stuck and
> > never comes out of APD.  Looks like it tries to establish another
> > session by bombarding the target with retries and resets, and
> > ultimately gives up and goes to PDL state.  Then the only option is
> > reboot.
> >
> > So to be clear, we have moved on from a discussion about slow storage
> > to a discussion about what happens during unexpected and abnormal
> > timeouts.  Anecdotal evidence suggests that SCST based systems will
> > allow ESXi recover from this condition, while ESXi does not play as
> > well with LIO based systems in those situations.
> >
> > What is the difference, and is there willingness to allow LIO to be
> > modified to work with this ESXi behavior?  Or should we ask Vmware to
> > do something for ESXi to play better with LIO?  I cannot fix the code,
> > but would be happy to be the voice of the issue via any available
> > channels.

I believe this is the same issue that's come up a few times previously (All of 
which correct?

http://www.spinics.net/lists/target-devel/msg09266.html
http://www.spinics.net/lists/ceph-users/msg15547.html
http://www.spinics.net/lists/target-devel/msg05444.html

I should mention that we are running kernel 3.14 on these systems currently... 
I mention this, as I've read your comment about the "Fix ABORT_TASK response + 
session reset hang" http://www.spinics.net/lists/target-devel/msg05444.html 
and wonder if it is related to what's occurring for us.  Have you had any 
feedback about that patch?

>
>
> Based on these and earlier comments, I think there is still some
> misconception about misbehaving backend devices, and what needs to
> happen in order for LIO to make forward progress during iscsi session
> reinstatement.
>
> Allowing a new session login to proceed and submit new WRITEs when the
> failed session can't get I/O completion with exception status to happen from
> a backend driver is bad.  Because, unless previous I/Os are able to be
> (eventually) completed or aborted within target-core before new backend
> driver I/O submission happens, there is no guarantee the stale WRITEs won't
> be completed after subsequent new WRITEs from a different session with a
> new command sequence number.
>
> Which means there is potential for new writes to be lost, and is the reason
> why 'violating the spec' in this context is not allowed.
>

Understood.  I can't figure out why the issue doesn't reportedly affect SCST 
or how they're handling it differently, but I can certainly understand your 
reluctance (or outright refusal) to allow anything that could result in a 
write that was reportedly ABORTED from actually landing on the disk.  There's 
something wiggling in my memory about another topic awhile back, about a 
difference between SCST and LIO and how they handled the tracking of commands, 
but I can't put my finger on it exactly... I'm wondering if it's related to 
this at all.

> If a backend driver is not able to complete I/O before ESX timeout triggers 
> to
> give-up on outstanding I/Os is being reached, then the backend driver needs
> to:
>
> * Have a lower internal I/O timeout to complete back to
>   target-core with exception status before ESX gives up on iscsi session
>   login attempts, and associated session I/O.
>
> Also, SCSI LLDs and raw block drivers work very differently wrt to I/O 
> timeout
> and reset.
>
> For underlying SCSI LLDs, scsi_eh will attempt to reset the device to 
> complete
> failed I/O.  Setting the scsi_eh timeout lower than ESX's iscsi login 
> timeout to
> give up and fails I/O is one simple option to consider.

This makes a lot of sense... I'm going to investigate this area heavily.  I 
think that scsi_eh (which I'll admit to being unaware of previously) is 
playing a large role here... We're seeing evidence in the log of the LSI RAID 
card resetting both the individual physical devices (Seagate hard disks), and 
sometimes the entire card rebooting (although we haven't seen the card reboot 
for some time now, quite possibly since a firmware update).  Previously, I had 
believed that this was action (resets) being taken by the LSI Firmware on the 
card, but I'm now understanding that it is likely the result of scsi_eh 
sending requests to do so.

MR_MONITOR[1355]: <MRMON268> Controller ID:  0  PD Reset:   PD
MR_MONITOR[1355]: <MRMON267> Controller ID:  0  Command timeout on PD:   PD
MR_MONITOR[1355]: <MRMON113> Controller ID:  0   Unexpected sense:   PD

I've searched a fair bit, including reading the scsi_eh kernel documentation, 
and cannot find any way to modify the scsi_eh timeout value... Is this 
something that is configuarable from userspace, or is it a hard-coded compile 
time value somewhere?  Seems like it SHOULD be a tunable somewhere, but I 
can't put my finger on it.

I'm also going to be reviewing any timeouts below the LIO layer on our 
systems, which, as far as I can think of, are DRBD, LSI MegaRAID driver, and 
any MegaRAID firmware settings..  It seems logical to me that the lower level 
the code, the shorter the timeout should be, and being as the top level in 
this case is ESXi, with a hardcoded value of 5 seconds, everything needs to 
decrease from there.

>
> However, if your LLD or LLD's firmware doesn't *ever* complete I/O back to
> scsi-core even after a reset occurs resulting in LIO blocking indefinitely 
> on
> session reinstatement, then it's a LLD specific bug and really should be 
> fixed.

I don't know if this is true with the LSI MegaRAID driver/firmware, but I 
agree that if it's the case, it should be fixed.  I may attempt to prove this 
one way or another..

>
> --nab
>
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in 
> the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html

Thanks again,

...Steve...
Attachment:
smime.p7s

Description: S/MIME cryptographic signature