Re: Unable to recover from DataOut timeout while in ERL=0

Dmitry Bogdanov <d.bogdanov@xxxxxxxxx> · Wed, 13 Jul 2022 23:40:05 +0300

Hi Nick,

On Wed, Jul 13, 2022 at 03:04:12PM -0400, Nick Couchman wrote:
> 
> (Apologies if this ends up as a double-post, re-sending in Plain Text Mode)
> 
> Hello, everyone,
> Hopefully this is the correct place to ask a general
> usage/troubleshooting question regarding the Linux Target iSCSI
> system.
> 
> I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
> are configured with DRBD to synchronize data between two ESXi hosts,
> and then present that disk back to the ESXi hosts via iSCSI. Basically
> I'm attempting to achieve a vSAN-like configuration, where I have
> "shared storage" backed by the underlying physical storage of the
> individual hosts.
> 
> It's worth noting that, at present, I'm not using an Active/Active
> configuration (DRBD dual-primary), but each of the VMs has the DRBD
> configuration and iSCSI configuration, and I can fail the primary and
> iSCSI service back and forth between the nodes.
> 
> I'm running into a situation where, once I get the system under
> moderate I/O load (installing Linux in another VM, for example), I
> start seeing the following errors in dmesg and/or journalctl on the
> active node:
> 
> Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
> connection for I_T Nexus
> iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01
> 
> This gets repeated a couple of dozen or so times, and then I/O to the
> iSCSI LUN from the ESXi host halts, the path to the LUN shows as
> "Dead", and I have to reboot the active node and fail over to the
> other node, at which point VMware picks back up and continues.
> 
> I've searched around the web to try to find assistance with this
> error, but it doesn't seem all that common - in one case it appears to
> be a bug from several years ago that was patched, and beyond that not
> much relevant has turned up. Based on the error message, it almost
> seems as if the target system is trying to say that it couldn't write
> its data out to the disk in a timely fashion (which might be because
> DRBD can't sync as quickly as is expected?), but it isn't all that
> clear from the error.
We have been encountering the same issue with ESXi. For some reasons it
may not send an IO data for the already sent SCSI WRITE command - iSCSI
DataOUT PDUs. Instead, it send an ABORT for that command. Linux Target
Core does not abort a SCSI command when it has not yet full IO data
collected. iSCSI DataOut timer times out and triggers connection
reinstatement.
But during that reinstatement iSCSI hangs waiting for that aborted WRITE
command got completed. A not finished logout prevents a new login from
the same initiator.
That condition solves only by a target reboot.

> 
> I'm wondering if anyone can provide tips as to how to best mitigate
> this - any tuning that can be done to change the time out, or throttle
> the iSCSI traffic, or is it indicative of a lack of available RAM for
> buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
> not catching it)?
> 
I may just send you a patch for a target that fixes the hanging. ESXi
will reconnect to the target and will continue work with it without a
reboot.

> Environment:
> * CentOS 8 Stream
> * Kernel: 4.18.0-394.el8.x86_64
> * DRBD: 9.1.7
> * 2 CPU, 4GB of RAM per VM
> * Shared block devices is 1 TB
> 
> Thanks - Nick