timeout on SEND ETA

David Byte <dbyte@xxxxxxxx> · Sat, 2 Feb 2019 00:56:30 +0000

I have a scripted process that uses fio and after a few tests I start seeing a lot of errors:

<sr630-5> Starting 16 processes
<sr630-6> Starting 16 processes
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
fio: client: unable to find matching tag (1e278e0)
fio: client: unable to find matching tag (1e274b0)
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
fio: client: unable to find matching tag (1e278e0)
fio: client: unable to find matching tag (1e274b0)
client <sr630-6>: timeout on SEND_ETA
client <sr630-5>: timeout on SEND_ETA
client <sr630-6>: timeout on SEND_ETA
fio: client sr630-6, timeout on cmd SEND_ETA
fio: client sr630-6 timed out
client <sr630-5>: timeout on SEND_ETA
fio: client sr630-5, timeout on cmd SEND_ETA
fio: client sr630-5 timed out

The jobs are intended to drive the client system and storage as hard as possible, so I may be pushing over some kind of boundary perhaps?  
The issue doesn’t occur with 5 minute job runs, but it does with 1hr job runs making me think it is tied to the job duration in some way.

fio 3.10
kernel suse-4.4.171-94.76-default

network is 100G, nodes have xeon silver 4110 with spectre/meltdown disabled

The jobs are intense, QD=32+, numjobs=16.  

I see a lot more failures with small I/Os, especially random.  This is true on both spinning and SSD based storage.

Is there anything that can be tweaked from the jobfile definition to lengthen timeouts?

Other thoughts?

David Byte
Sr. Technology Strategist