Unable to recover from DataOut timeout while in ERL=0

Nick Couchman <nick.e.couchman@xxxxxxxxx> · Wed, 13 Jul 2022 15:04:12 -0400

(Apologies if this ends up as a double-post, re-sending in Plain Text Mode)

Hello, everyone,
Hopefully this is the correct place to ask a general
usage/troubleshooting question regarding the Linux Target iSCSI
system.

I'm using the Linux iSCSI target on a pair of CentOS 8 Stream VMs that
are configured with DRBD to synchronize data between two ESXi hosts,
and then present that disk back to the ESXi hosts via iSCSI. Basically
I'm attempting to achieve a vSAN-like configuration, where I have
"shared storage" backed by the underlying physical storage of the
individual hosts.

It's worth noting that, at present, I'm not using an Active/Active
configuration (DRBD dual-primary), but each of the VMs has the DRBD
configuration and iSCSI configuration, and I can fail the primary and
iSCSI service back and forth between the nodes.

I'm running into a situation where, once I get the system under
moderate I/O load (installing Linux in another VM, for example), I
start seeing the following errors in dmesg and/or journalctl on the
active node:

Unable to recover from DataOut timeout while in ERL=0, closing iSCSI
connection for I_T Nexus
iqn.1998-01.com.vmware:esx01-18f91cf9,i,0x00023d000001,iqn.1902-01.com.example.site:drbd1,t,0x01

This gets repeated a couple of dozen or so times, and then I/O to the
iSCSI LUN from the ESXi host halts, the path to the LUN shows as
"Dead", and I have to reboot the active node and fail over to the
other node, at which point VMware picks back up and continues.

I've searched around the web to try to find assistance with this
error, but it doesn't seem all that common - in one case it appears to
be a bug from several years ago that was patched, and beyond that not
much relevant has turned up. Based on the error message, it almost
seems as if the target system is trying to say that it couldn't write
its data out to the disk in a timely fashion (which might be because
DRBD can't sync as quickly as is expected?), but it isn't all that
clear from the error.

I'm wondering if anyone can provide tips as to how to best mitigate
this - any tuning that can be done to change the time out, or throttle
the iSCSI traffic, or is it indicative of a lack of available RAM for
buffering (I'm not seeing a lot of RAM pressure, but possible I'm just
not catching it)?

Environment:
* CentOS 8 Stream
* Kernel: 4.18.0-394.el8.x86_64
* DRBD: 9.1.7
* 2 CPU, 4GB of RAM per VM
* Shared block devices is 1 TB

Thanks - Nick