Re: ESXi + LIO + Ceph RBD problem

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 20 Aug 2015 22:52:43 -0700

On Thu, 2015-08-20 at 21:44 +0000, Steve Beaudry wrote:
> > On Wed, 2015-08-19 at 12:16 -0400, Alex Gorbachev wrote:
> > > I have to say that changing default_cmdsn_depth did not help us with
> > > the abnormal timeouts, i.e. OSD failing or some other abrupt event.
> > > When that happens we detect the event via ABORT_TASK and if the event
> > > is transient usually nothing happens.  Anything more than a few
> > > seconds will usually result in Ceph recovery but ESXi gets stuck and
> > > never comes out of APD.  Looks like it tries to establish another
> > > session by bombarding the target with retries and resets, and
> > > ultimately gives up and goes to PDL state.  Then the only option is
> > > reboot.
> > >
> > > So to be clear, we have moved on from a discussion about slow storage
> > > to a discussion about what happens during unexpected and abnormal
> > > timeouts.  Anecdotal evidence suggests that SCST based systems will
> > > allow ESXi recover from this condition, while ESXi does not play as
> > > well with LIO based systems in those situations.
> > >
> > > What is the difference, and is there willingness to allow LIO to be
> > > modified to work with this ESXi behavior?  Or should we ask Vmware to
> > > do something for ESXi to play better with LIO?  I cannot fix the code,
> > > but would be happy to be the voice of the issue via any available
> > > channels.
> 
> I believe this is the same issue that's come up a few times previously (All of 
> which correct?

No, it's not.

> 
> http://www.spinics.net/lists/target-devel/msg09266.html

Read through the entire thread:

http://www.spinics.net/lists/target-devel/msg09268.html

> http://www.spinics.net/lists/ceph-users/msg15547.html

Yes, this is a RCU related issue with ceph rbd client code, that I
assume has been fixed..?

> http://www.spinics.net/lists/target-devel/msg05444.html
> 
> I should mention that we are running kernel 3.14 on these systems currently... 
> I mention this, as I've read your comment about the "Fix ABORT_TASK response + 
> session reset hang" http://www.spinics.net/lists/target-devel/msg05444.html 
> and wonder if it is related to what's occurring for us.  Have you had any 
> feedback about that patch?
> 

Yes, this patch and other related one where included in >= v3.14.10
stable code, and have been back-ported to earlier stable versions.

Make sure you've got a recent enough v3.14.y kernel.

> >
> >
> > Based on these and earlier comments, I think there is still some
> > misconception about misbehaving backend devices, and what needs to
> > happen in order for LIO to make forward progress during iscsi session
> > reinstatement.
> >
> > Allowing a new session login to proceed and submit new WRITEs when the
> > failed session can't get I/O completion with exception status to happen from
> > a backend driver is bad.  Because, unless previous I/Os are able to be
> > (eventually) completed or aborted within target-core before new backend
> > driver I/O submission happens, there is no guarantee the stale WRITEs won't
> > be completed after subsequent new WRITEs from a different session with a
> > new command sequence number.
> >
> > Which means there is potential for new writes to be lost, and is the reason
> > why 'violating the spec' in this context is not allowed.
> >
> 
> Understood.  I can't figure out why the issue doesn't reportedly affect SCST 
> or how they're handling it differently, but I can certainly understand your 
> reluctance (or outright refusal) to allow anything that could result in a 
> write that was reportedly ABORTED from actually landing on the disk.  There's 
> something wiggling in my memory about another topic awhile back, about a 
> difference between SCST and LIO and how they handled the tracking of commands, 
> but I can't put my finger on it exactly... I'm wondering if it's related to 
> this at all.

You are mixing up list threads without enough technical context.

Different backends act very different during timeout.  If your backend
can't complete I/Os back to the target in a timely fashion, then you
need to figure out why that is happening.

Trying to hack LIO to do what $SOME_TARGET does is not going to help you
here.

> 
> > If a backend driver is not able to complete I/O before ESX timeout triggers 
> > to
> > give-up on outstanding I/Os is being reached, then the backend driver needs
> > to:
> >
> > * Have a lower internal I/O timeout to complete back to
> >   target-core with exception status before ESX gives up on iscsi session
> >   login attempts, and associated session I/O.
> >
> > Also, SCSI LLDs and raw block drivers work very differently wrt to I/O 
> > timeout
> > and reset.
> >
> > For underlying SCSI LLDs, scsi_eh will attempt to reset the device to 
> > complete
> > failed I/O.  Setting the scsi_eh timeout lower than ESX's iscsi login 
> > timeout to
> > give up and fails I/O is one simple option to consider.
> 
> This makes a lot of sense... I'm going to investigate this area heavily.  I 
> think that scsi_eh (which I'll admit to being unaware of previously) is 
> playing a large role here... We're seeing evidence in the log of the LSI RAID 
> card resetting both the individual physical devices (Seagate hard disks), and 
> sometimes the entire card rebooting (although we haven't seen the card reboot 
> for some time now, quite possibly since a firmware update).  Previously, I had 
> believed that this was action (resets) being taken by the LSI Firmware on the 
> card, but I'm now understanding that it is likely the result of scsi_eh 
> sending requests to do so.
> 
> MR_MONITOR[1355]: <MRMON268> Controller ID:  0  PD Reset:   PD
> MR_MONITOR[1355]: <MRMON267> Controller ID:  0  Command timeout on PD:   PD
> MR_MONITOR[1355]: <MRMON113> Controller ID:  0   Unexpected sense:   PD
> 
> I've searched a fair bit, including reading the scsi_eh kernel documentation, 
> and cannot find any way to modify the scsi_eh timeout value... Is this 
> something that is configuarable from userspace, or is it a hard-coded compile 
> time value somewhere?  Seems like it SHOULD be a tunable somewhere, but I 
> can't put my finger on it.
> 
> I'm also going to be reviewing any timeouts below the LIO layer on our 
> systems, which, as far as I can think of, are DRBD, LSI MegaRAID driver, and 
> any MegaRAID firmware settings..  It seems logical to me that the lower level 
> the code, the shorter the timeout should be, and being as the top level in 
> this case is ESXi, with a hardcoded value of 5 seconds, everything needs to 
> decrease from there.
> 
> >
> > However, if your LLD or LLD's firmware doesn't *ever* complete I/O back to
> > scsi-core even after a reset occurs resulting in LIO blocking indefinitely 
> > on
> > session reinstatement, then it's a LLD specific bug and really should be 
> > fixed.
> 
> I don't know if this is true with the LSI MegaRAID driver/firmware, but I 
> agree that if it's the case, it should be fixed.  I may attempt to prove this 
> one way or another..

Actually, megaraid LLD does have problems.

There have been problems with megaraid_sas device failure, and unless
you have recent upstream code or LSI's LLD driver from their website on
top of vanilla v3.14, you'll most certainly run into problems with I/O
completion never happening when firmware is in a bad-state.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/megaraid?id=b09e66da3f5d9c47336dfe63f1e76696931fbdb0

The full list of megaraid patches is here:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/scsi/megaraid

compared to the last 2 1/2 year old commit in v3.14.y:

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/drivers/scsi/megaraid?h=linux-3.14.y

Which means, you'll need to find a working megaraid LLD to even have a
chance to handle HDD failures correctly on vanilla v3.14 code.

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html