Re: timeout on SEND ETA

David Byte <dbyte@xxxxxxxx> · Mon, 11 Feb 2019 22:38:45 +0000

Thanks for the response Jeff.  We changed to --eta=never and have stopped losing the data associated with those jobs.  The analysis from one of our guys is that it appears that when the expected ETA request times out too many times that fio aborts the connection, thus resulting in the loss of the data from that instance.  

Admittedly, I drive the fio nodes VERY hard.

David Byte
Sr. Technology Strategist

On 2/11/19, 2:04 PM, "Jeff Furlong" <jeff.furlong@xxxxxxx> wrote:

    Can you reproduce this problem on the latest version of fio?  Do you have a sample job file?  If I recall from seeing this error previously, the server has stopped but the clients are still finishing up at the end of a job.  Maybe worse if your clients are pushing higher CPU utilizations?  You might look at --eta=never or --eta-interval=5s to reduce how many warnings you see.

    Regards,
    Jeff

    -----Original Message-----
    From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On Behalf Of David Byte
    Sent: Friday, February 1, 2019 4:57 PM
    To: fio@xxxxxxxxxxxxxxx
    Subject: timeout on SEND ETA

    I have a scripted process that uses fio and after a few tests I start seeing a lot of errors:

    <sr630-5> Starting 16 processes
    <sr630-6> Starting 16 processes
    client <sr630-6>: timeout on SEND_ETA
    client <sr630-5>: timeout on SEND_ETA
    client <sr630-6>: timeout on SEND_ETA
    client <sr630-5>: timeout on SEND_ETA
    client <sr630-6>: timeout on SEND_ETA
    client <sr630-5>: timeout on SEND_ETA
    fio: client: unable to find matching tag (1e278e0)
    fio: client: unable to find matching tag (1e274b0) client <sr630-6>: timeout on SEND_ETA client <sr630-5>: timeout on SEND_ETA client <sr630-6>: timeout on SEND_ETA client <sr630-5>: timeout on SEND_ETA
    fio: client: unable to find matching tag (1e278e0)
    fio: client: unable to find matching tag (1e274b0) client <sr630-6>: timeout on SEND_ETA client <sr630-5>: timeout on SEND_ETA client <sr630-6>: timeout on SEND_ETA
    fio: client sr630-6, timeout on cmd SEND_ETA
    fio: client sr630-6 timed out
    client <sr630-5>: timeout on SEND_ETA
    fio: client sr630-5, timeout on cmd SEND_ETA
    fio: client sr630-5 timed out

    The jobs are intended to drive the client system and storage as hard as possible, so I may be pushing over some kind of boundary perhaps? The issue doesn’t occur with 5 minute job runs, but it does with 1hr job runs making me think it is tied to the job duration in some way.

    fio 3.10
    kernel suse-4.4.171-94.76-default

    network is 100G, nodes have xeon silver 4110 with spectre/meltdown disabled

    The jobs are intense, QD=32+, numjobs=16.  

    I see a lot more failures with small I/Os, especially random.  This is true on both spinning and SSD based storage.

    Is there anything that can be tweaked from the jobfile definition to lengthen timeouts?

    Other thoughts?

    David Byte
    Sr. Technology Strategist