RE: fio server errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Good catch.  I've fixed the second declaration of uint32_t nrsamples in iolog.c  It's running so far, but seems to take a while to fail even in the bad case.  I have printed the value of nr_samples in various tests, but unfortunately not exactly when the inflate error caused fio to exit.  I'm running again so should be a matter of time to determine if changing to unit64_t solves it.

Glad you can reproduce the SEND_ETA issue.  Not sure how to deal with its unresponsiveness during IOLOG.

Regards,
Jeff

-----Original Message-----
From: Sitsofe Wheeler [mailto:sitsofe@xxxxxxxxx] 
Sent: Wednesday, June 14, 2017 11:51 PM
To: Jeff Furlong <jeff.furlong@xxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx
Subject: Re: fio server errors

On 14 June 2017 at 23:20, Jeff Furlong <jeff.furlong@xxxxxxx> wrote:
> On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

I see there is another uint32_t nr_samples over in
https://github.com/axboe/fio/blob/master/iolog.c#L801 - do you think that would need changing too? Is nr_samples slowly growing over time and have you managed to print its value at the point of the inflate error?

> On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:
>
> #fio --client=server1 test.job --client=server2 test.job
>
> # cat test.job
> [test_job]
> ioengine=libaio
> direct=1
> rw=randread
> iodepth=256
> size=100%
> numjobs=1
> bs=4k
> filename=/dev/nvme0n1
> group_reporting
> write_lat_log=test_job
> runtime=1m
>
> If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

I can reproduce the problem here using this job:
[sendeta]
ioengine=libaio
direct=1
rw=read
bs=512
filename=/dev/nullb0
write_lat_log=test_job
disable_clat=1
disable_slat=1
iodepth=512
number_ios=16000000

and running
fio --debug=net --client=localhost sendeta.fio --client=127.0.0.1 sendeta.fio

Using number_ios seems to make the job run for as long as needed to show the problem regardless of what the iodepth is in my case. From what I can see the client is forced to spend time on the IOLOG op which in turn makes it unresponsive to other actions:

net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: request eta (2)k][r=423k,w=0 IOPS][eta 27m:09s]
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle localhost
net      1709  client: got cmd op ETA from localhost (pdu=161)
net      1709  client: got eta tag 0x2442d90, 2
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=290)
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
net      1709  client: got eta tag 0x2442d90, 1
net      1709  client: handle localhost0k][r=387k,w=0 IOPS][eta 00m:00s]
net      1709  client: got cmd op IOLOG from localhost (pdu=11297502)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2442d90
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op IOLOG from 127.0.0.1 (pdu=11297330)
net      1709  client: handle localhost
net      1709  client: got cmd op IOLOG from localhost (pdu=290)
net      1709  client: request eta (2)
net      1709  client: requested eta tag 0x2449df0
client <127.0.0.1>: timeout on SEND_ETA
client <localhost>: timeout on SEND_ETA
net      1709  client: handle 127.0.0.1
net      1709  client: got cmd op ETA from 127.0.0.1 (pdu=161)
fio: client: unable to find matching tag (2442c80)

--
Sitsofe | http://sucs.org/~sits/
��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux