Re: Unable to recover from DataOut timeout while in ERL=0

Nick Couchman <nick.e.couchman@xxxxxxxxx> · Wed, 13 Jul 2022 21:47:40 -0400

On Wed, Jul 13, 2022 at 4:40 PM Dmitry Bogdanov <d.bogdanov@xxxxxxxxx> wrote:
>
> Hi Nick,
>
> On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> >
> > (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> >
> > Hello, everyone,
> > Hopefully this is the correct place to ask a general
> > usage/troubleshooting question regarding the Linux Target iSCSI
> > system.
> >
> > I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> > are configured with DRBD to synchronize data between two ESXi hosts,
> > and then present that disk back to the ESXi hosts via iSCSI. Basically
> > I'm attempting to achieve a vSAN-like configuration, where I have
> > "shared storage" backed by the underlying physical storage of the
> > individual hosts.
> >
> > It's worth noting that, at present, I'm not using an Active/Active
> > configuration (DRBD dual-primary), but each of the VMs has the DRBD
> > configuration and iSCSI configuration, and I can fail the primary and
> > iSCSI service back and forth between the nodes.
> >
> > I'm running into a situation where, once I get the system under
> > moderate I/O load (installing Linux in another VM, for example), I
> > start seeing the following errors in dmesg and/or journalctl on the
> > active node:
> >
> > Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> > connection for I_T Nexus
> > iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> >
> > This gets repeated a couple of dozen or so times, and then I/O to the
> > iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> > "Dead", and I have to reboot the active node and fail over to the
> > other node, at which point VMware picks back up and continues.
> >
> > I've searched around the web to try to find assistance with this
> > error, but it doesn't seem all that common - in one case it appears to
> > be a bug from several years ago that was patched, and beyond that not
> > much relevant has turned up. Based on the error message, it almost
> > seems as if the target system is trying to say that it couldn't write
> > its data out to the disk in a timely fashion (which might be because
> > DRBD can't sync as quickly as is expected?), but it isn't all that
> > clear from the error.
> We have been encountering the same issue with ESXi. For some reasons it
> may not send an IO data for the already sent SCSI WRITE command - iSCSI
> DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
> Core does not abort a SCSI command when it has not yet full IO data
> collected. iSCSI DataOut timer times out and triggers connection
> reinstatement.
> But during that reinstatement iSCSI hangs waiting for that aborted WRITE
> command got completed. A not finished logout prevents a new login from
> the same initiator.
> That condition solves only by a target reboot.

Is this a bug that needs to be raised with VMware? Or is patching the
Linux Target driver really the way to go? I'm happy to put in a case
with VMware if that's desirable.

>
> >
> > I'm wondering if anyone can provide tips as to how to best mitigate
> > this - any tuning that can be done to change the time out, or throttle
> > the iSCSI traffic, or is it indicative of a lack of available RAM for
> > buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> > not catching it)?
> >
> I may just send you a patch for a target that fixes the hanging. ESXi
> will reconnect to the target and will continue work with it without a
> reboot.
>

I got the patch - I had to tweak it a bit for the CentOS Stream 8
kernel I'm running against, but I've added it to the RPM and am
rebuilding the packages, now. Hopefully will get it tested in the next
couple of days.

Thanks - NIck