On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451 I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again. It's still difficult to say what type of job is causing the issue. On issue two (SEND_ETA), I can reproduce with a slightly smaller job file: #fio --client=server1 test.job --client=server2 test.job # cat test.job [test_job] ioengine=libaio direct=1 rw=randread iodepth=256 size=100% numjobs=1 bs=4k filename=/dev/nvme0n1 group_reporting write_lat_log=test_job runtime=1m If I remove write_lat_log=test_job then there is no SEND_ETA issue. Only when logging (the latency) does this issue occur. And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur. I must have write_lat_log=test_job and runtime=1m to produce the issue. I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg). Regards, Jeff -----Original Message----- From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On Behalf Of Sitsofe Wheeler Sent: Tuesday, June 13, 2017 11:13 PM To: Jeff Furlong <jeff.furlong@xxxxxxx> Cc: fio@xxxxxxxxxxxxxxx Subject: Re: fio server errors On 8 June 2017 at 22:53, Jeff Furlong <jeff.furlong@xxxxxxx> wrote: > Using fio 2.19 I can reproduce the SEND_ETA issue. With debug=all, indeed there is a ton of output, but it finishes with roughly: > > <server1> diskutil 30748 update io ticks <server1> diskutil 30748 open > stat file: /sys/block/nvme0n1/stat > <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802 0 54165895270 1333065028 863486453 0 63615892884 3819260095 0 57718711 859646548 > <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 > <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: > 250, msec_to_next_event: 250 <server1> diskutil 30748 update io ticks > <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat > <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802 0 54165895270 1333065028 863486453 0 63615892884 3819260095 0 57718711 859646548 > <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 > <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: > 250, msec_to_next_event: 250 <server1> fio: pid=30750, got signal=11 > <server1> process 30748 pid=30750: runstate RUNNING -> REAPED > <server1> process 30748 terminate group_id=-1 <server1> process > 30748 setting terminate on test_job/30750 <server1> diskutil 30748 > update io ticks <server1> diskutil 30748 open stat file: > /sys/block/nvme0n1/stat > <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802 0 54165895270 1333065028 863486453 0 63615892884 3819260095 0 57718711 859646548 > <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 > <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: > 500, msec_to_next_event: 157 <server1> process 30748 pid=30750: > runstate REAPED -> FINISHING <server1> process 30748 pid=30750: runstate FINISHING -> REAPED > <server1> net 30748 server sending end stats > <server1> net 30748 ts->ss_state = 0 > <server1> net 30748 server sending group run stats > <server1> net 30748 server: sending disk_util 1 > <server1> > <server1> parse 30748 free options > <server1> parse 30748 free options > <server1> net 30707 pid 30748 exited, sig=0, exitval=0 > <server1> net 30707 server: sending stop (0, 0) > <server1> net 30707 server: sending quit > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dceae0 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dcea50 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dcea50 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dceae0 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dceae0 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dceae0 > <server1> net 30707 server sending status > <server1> net 30707 server: got op [SEND_ETA], pdu=0, tag=dcea50 > <server1> net 30707 server sending status > parse 30706 free options OK so it's not a newly introduced issue. This looks like the client is continuing to send SEND_ETA messages even though the server has sent back a "I've quit" message... If you whittle the job file as much as possible (e.g. no write_lat_log, smaller iodepth etc) do the symptoms change? What's the smallest job file that still reproduces this problem? -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�