RE: fio server errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Using fio 2.19 I can reproduce the SEND_ETA issue.  With debug=all, indeed there is a ton of output, but it finishes with roughly:

<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 250, msec_to_next_event: 250
<server1> fio: pid=30750, got signal=11
<server1> process  30748 pid=30750: runstate RUNNING -> REAPED
<server1> process  30748 terminate group_id=-1
<server1> process  30748 setting terminate on test_job/30750
<server1> diskutil 30748 update io ticks
<server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
<server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
<server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0
<server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 500, msec_to_next_event: 157
<server1> process  30748 pid=30750: runstate REAPED -> FINISHING
<server1> process  30748 pid=30750: runstate FINISHING -> REAPED
<server1> net      30748 server sending end stats
<server1> net      30748 ts->ss_state = 0
<server1> net      30748 server sending group run stats
<server1> net      30748 server: sending disk_util 1
<server1>
<server1> parse    30748 free options
<server1> parse    30748 free options
<server1> net      30707 pid 30748 exited, sig=0, exitval=0
<server1> net      30707 server: sending stop (0, 0)
<server1> net      30707 server: sending quit
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
<server1> net      30707 server sending status
<server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
<server1> net      30707 server sending status
parse    30706 free options

If I keep size=100% and runtime at 1m, I can reproduce the issue.  If I change size to 4G (and don’t use time_based) then it runs without any issue.  I suspect it's about the size of the saved log files, which are roughly 400MB each. 

Regards,
Jeff 


-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@xxxxxxxxx] 
Sent: Tuesday, June 6, 2017 11:19 PM
To: Jeff Furlong <jeff.furlong@xxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx
Subject: Re: fio server errors

On 6 June 2017 at 18:39, Jeff Furlong <jeff.furlong@xxxxxxx> wrote:
>
> When using the client/server mode, I'm having two issues.  First, I seemingly randomly get an inflate error, which happens after several hours, and reproducing it is challenging:
>
> fio: inflate error -5
> fio: failed decompressing log
> fio: failed converting IO log
>
> I have --write_iops_log turned on so guessing the IO log transferred across the network is not working smoothly.  I don't have a reliable job file that reproduces the issue, unfortunately.

Hmm -5 is going to be Z_BUF_ERROR so the suggestion is we didn't have enough space to store the decompressed data. Do you get this problem with older versions (e.g. 2.19) of fio too? You'll probably need to add debugging to trace nr_samples and where possible total (e.g.
around https://github.com/axboe/fio/blob/master/client.c#L1467 and
https://github.com/axboe/fio/blob/9af5a2450a555a725dd18e6845967cd7cf3aad64/server.c#L1927
).

> Second, I also get (less randomly) a timeout on SEND_ETA.  Here are the details:
>
> # fio -version
> fio-2.20-39-gc12d
>
> # fio --client=server1 test.job --client=server2 test.job client 
> <server2>: timeout on SEND_ETA client <server1>: timeout on SEND_ETA
> fio: client: unable to find matching tag (76f680)
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> norandommap
> randrepeat=0
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=3m

Again can you reproduce this one with the previous fio release? Other things that would help is if you cut the job down to the fewest lines that still allow the problem to reproduce and reduce values (such as iodepth, runtime and size) as far as possible that again still allow the problem to occur (e.g. if size=4G still makes it happen that's useful to know). If it happens relatively quickly you may be able to use --debug=all to narrow down the last thing the client did but be aware this generates a lot of output...

--
Sitsofe | http://sucs.org/~sits/
��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux