On 14 June 2017 at 23:20, Jeff Furlong <jeff.furlong@xxxxxxx> wrote: > On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451 I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again. It's still difficult to say what type of job is causing the issue. I see there is another uint32_t nr_samples over in https://github.com/axboe/fio/blob/master/iolog.c#L801 - do you think that would need changing too? Is nr_samples slowly growing over time and have you managed to print its value at the point of the inflate error? > On issue two (SEND_ETA), I can reproduce with a slightly smaller job file: > > #fio --client=server1 test.job --client=server2 test.job > > # cat test.job > [test_job] > ioengine=libaio > direct=1 > rw=randread > iodepth=256 > size=100% > numjobs=1 > bs=4k > filename=/dev/nvme0n1 > group_reporting > write_lat_log=test_job > runtime=1m > > If I remove write_lat_log=test_job then there is no SEND_ETA issue. Only when logging (the latency) does this issue occur. And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur. I must have write_lat_log=test_job and runtime=1m to produce the issue. I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg). I can reproduce the problem here using this job: [sendeta] ioengine=libaio direct=1 rw=read bs=512 filename=/dev/nullb0 write_lat_log=test_job disable_clat=1 disable_slat=1 iodepth=512 number_ios=16000000 and running fio --debug=net --client=localhost sendeta.fio --client=127.0.0.1 sendeta.fio Using number_ios seems to make the job run for as long as needed to show the problem regardless of what the iodepth is in my case. From what I can see the client is forced to spend time on the IOLOG op which in turn makes it unresponsive to other actions: net 1709 client: got cmd op ETA from 127.0.0.1 (pdu=161) net 1709 client: got eta tag 0x2442d90, 2 net 1709 client: handle localhost net 1709 client: got cmd op ETA from localhost (pdu=161) net 1709 client: got eta tag 0x2442d90, 1 net 1709 client: request eta (2)k][r=423k,w=0 IOPS][eta 27m:09s] net 1709 client: requested eta tag 0x2442d90 net 1709 client: handle localhost net 1709 client: got cmd op ETA from localhost (pdu=161) net 1709 client: got eta tag 0x2442d90, 2 net 1709 client: handle 127.0.0.1 net 1709 client: got cmd op IOLOG from 127.0.0.1 (pdu=290) net 1709 client: handle 127.0.0.1 net 1709 client: got cmd op IOLOG from 127.0.0.1 (pdu=290) net 1709 client: handle 127.0.0.1 net 1709 client: got cmd op ETA from 127.0.0.1 (pdu=161) net 1709 client: got eta tag 0x2442d90, 1 net 1709 client: handle localhost0k][r=387k,w=0 IOPS][eta 00m:00s] net 1709 client: got cmd op IOLOG from localhost (pdu=11297502) net 1709 client: request eta (2) net 1709 client: requested eta tag 0x2442d90 net 1709 client: handle 127.0.0.1 net 1709 client: got cmd op IOLOG from 127.0.0.1 (pdu=11297330) net 1709 client: handle localhost net 1709 client: got cmd op IOLOG from localhost (pdu=290) net 1709 client: request eta (2) net 1709 client: requested eta tag 0x2449df0 client <127.0.0.1>: timeout on SEND_ETA client <localhost>: timeout on SEND_ETA net 1709 client: handle 127.0.0.1 net 1709 client: got cmd op ETA from 127.0.0.1 (pdu=161) fio: client: unable to find matching tag (2442c80) -- Sitsofe | http://sucs.org/~sits/ -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html