On Thu, 2015-08-20 at 21:44 +0000, Steve Beaudry wrote: > > On Wed, 2015-08-19 at 12:16 -0400, Alex Gorbachev wrote: > > > I have to say that changing default_cmdsn_depth did not help us with > > > the abnormal timeouts, i.e. OSD failing or some other abrupt event. > > > When that happens we detect the event via ABORT_TASK and if the event > > > is transient usually nothing happens. Anything more than a few > > > seconds will usually result in Ceph recovery but ESXi gets stuck and > > > never comes out of APD. Looks like it tries to establish another > > > session by bombarding the target with retries and resets, and > > > ultimately gives up and goes to PDL state. Then the only option is > > > reboot. > > > > > > So to be clear, we have moved on from a discussion about slow storage > > > to a discussion about what happens during unexpected and abnormal > > > timeouts. Anecdotal evidence suggests that SCST based systems will > > > allow ESXi recover from this condition, while ESXi does not play as > > > well with LIO based systems in those situations. > > > > > > What is the difference, and is there willingness to allow LIO to be > > > modified to work with this ESXi behavior? Or should we ask Vmware to > > > do something for ESXi to play better with LIO? I cannot fix the code, > > > but would be happy to be the voice of the issue via any available > > > channels. > > I believe this is the same issue that's come up a few times previously (All of > which correct? No, it's not. > > http://www.spinics.net/lists/target-devel/msg09266.html Read through the entire thread: http://www.spinics.net/lists/target-devel/msg09268.html > http://www.spinics.net/lists/ceph-users/msg15547.html Yes, this is a RCU related issue with ceph rbd client code, that I assume has been fixed..? > http://www.spinics.net/lists/target-devel/msg05444.html > > I should mention that we are running kernel 3.14 on these systems currently... > I mention this, as I've read your comment about the "Fix ABORT_TASK response + > session reset hang" http://www.spinics.net/lists/target-devel/msg05444.html > and wonder if it is related to what's occurring for us. Have you had any > feedback about that patch? > Yes, this patch and other related one where included in >= v3.14.10 stable code, and have been back-ported to earlier stable versions. Make sure you've got a recent enough v3.14.y kernel. > > > > > > Based on these and earlier comments, I think there is still some > > misconception about misbehaving backend devices, and what needs to > > happen in order for LIO to make forward progress during iscsi session > > reinstatement. > > > > Allowing a new session login to proceed and submit new WRITEs when the > > failed session can't get I/O completion with exception status to happen from > > a backend driver is bad. Because, unless previous I/Os are able to be > > (eventually) completed or aborted within target-core before new backend > > driver I/O submission happens, there is no guarantee the stale WRITEs won't > > be completed after subsequent new WRITEs from a different session with a > > new command sequence number. > > > > Which means there is potential for new writes to be lost, and is the reason > > why 'violating the spec' in this context is not allowed. > > > > Understood. I can't figure out why the issue doesn't reportedly affect SCST > or how they're handling it differently, but I can certainly understand your > reluctance (or outright refusal) to allow anything that could result in a > write that was reportedly ABORTED from actually landing on the disk. There's > something wiggling in my memory about another topic awhile back, about a > difference between SCST and LIO and how they handled the tracking of commands, > but I can't put my finger on it exactly... I'm wondering if it's related to > this at all. You are mixing up list threads without enough technical context. Different backends act very different during timeout. If your backend can't complete I/Os back to the target in a timely fashion, then you need to figure out why that is happening. Trying to hack LIO to do what $SOME_TARGET does is not going to help you here. > > > If a backend driver is not able to complete I/O before ESX timeout triggers > > to > > give-up on outstanding I/Os is being reached, then the backend driver needs > > to: > > > > * Have a lower internal I/O timeout to complete back to > > target-core with exception status before ESX gives up on iscsi session > > login attempts, and associated session I/O. > > > > Also, SCSI LLDs and raw block drivers work very differently wrt to I/O > > timeout > > and reset. > > > > For underlying SCSI LLDs, scsi_eh will attempt to reset the device to > > complete > > failed I/O. Setting the scsi_eh timeout lower than ESX's iscsi login > > timeout to > > give up and fails I/O is one simple option to consider. > > This makes a lot of sense... I'm going to investigate this area heavily. I > think that scsi_eh (which I'll admit to being unaware of previously) is > playing a large role here... We're seeing evidence in the log of the LSI RAID > card resetting both the individual physical devices (Seagate hard disks), and > sometimes the entire card rebooting (although we haven't seen the card reboot > for some time now, quite possibly since a firmware update). Previously, I had > believed that this was action (resets) being taken by the LSI Firmware on the > card, but I'm now understanding that it is likely the result of scsi_eh > sending requests to do so. > > MR_MONITOR[1355]: <MRMON268> Controller ID: 0 PD Reset: PD > MR_MONITOR[1355]: <MRMON267> Controller ID: 0 Command timeout on PD: PD > MR_MONITOR[1355]: <MRMON113> Controller ID: 0 Unexpected sense: PD > > I've searched a fair bit, including reading the scsi_eh kernel documentation, > and cannot find any way to modify the scsi_eh timeout value... Is this > something that is configuarable from userspace, or is it a hard-coded compile > time value somewhere? Seems like it SHOULD be a tunable somewhere, but I > can't put my finger on it. > > I'm also going to be reviewing any timeouts below the LIO layer on our > systems, which, as far as I can think of, are DRBD, LSI MegaRAID driver, and > any MegaRAID firmware settings.. It seems logical to me that the lower level > the code, the shorter the timeout should be, and being as the top level in > this case is ESXi, with a hardcoded value of 5 seconds, everything needs to > decrease from there. > > > > > However, if your LLD or LLD's firmware doesn't *ever* complete I/O back to > > scsi-core even after a reset occurs resulting in LIO blocking indefinitely > > on > > session reinstatement, then it's a LLD specific bug and really should be > > fixed. > > I don't know if this is true with the LSI MegaRAID driver/firmware, but I > agree that if it's the case, it should be fixed. I may attempt to prove this > one way or another.. Actually, megaraid LLD does have problems. There have been problems with megaraid_sas device failure, and unless you have recent upstream code or LSI's LLD driver from their website on top of vanilla v3.14, you'll most certainly run into problems with I/O completion never happening when firmware is in a bad-state. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/megaraid?id=b09e66da3f5d9c47336dfe63f1e76696931fbdb0 The full list of megaraid patches is here: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/scsi/megaraid compared to the last 2 1/2 year old commit in v3.14.y: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/drivers/scsi/megaraid?h=linux-3.14.y Which means, you'll need to find a working megaraid LLD to even have a chance to handle HDD failures correctly on vanilla v3.14 code. --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html