RE: fio server errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On issue one (inflate error), I've gone through some of the code and I think nr_samples should be declared as uint64_t, not uint32_t, at https://github.com/axboe/fio/blob/master/client.c#L1451  I've changed it on my machine and reran the prior test but after 25 hours the same issue occurred again.  It's still difficult to say what type of job is causing the issue.

On issue two (SEND_ETA), I can reproduce with a slightly smaller job file:

#fio --client=server1 test.job --client=server2 test.job

# cat test.job
[test_job]
ioengine=libaio
direct=1
rw=randread
iodepth=256
size=100%
numjobs=1
bs=4k
filename=/dev/nvme0n1
group_reporting
write_lat_log=test_job
runtime=1m

If I remove write_lat_log=test_job then there is no SEND_ETA issue.  Only when logging (the latency) does this issue occur.  And, if I enable write_lat_log=test_job but reduce runtime=30s, then the issue does not occur.  I must have write_lat_log=test_job and runtime=1m to produce the issue.  I imagine some count must be exceeded or the pause in moving the lat data from memory to file at the end of the job is causing a timing conflict (SEND_ETA msg vs "I've quit" msg).

Regards,
Jeff

-----Original Message-----
From: fio-owner@xxxxxxxxxxxxxxx [mailto:fio-owner@xxxxxxxxxxxxxxx] On Behalf Of Sitsofe Wheeler
Sent: Tuesday, June 13, 2017 11:13 PM
To: Jeff Furlong <jeff.furlong@xxxxxxx>
Cc: fio@xxxxxxxxxxxxxxx
Subject: Re: fio server errors

On 8 June 2017 at 22:53, Jeff Furlong <jeff.furlong@xxxxxxx> wrote:
> Using fio 2.19 I can reproduce the SEND_ETA issue.  With debug=all, indeed there is a ton of output, but it finishes with roughly:
>
> <server1> diskutil 30748 update io ticks <server1> diskutil 30748 open 
> stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 250, msec_to_next_event: 250 <server1> diskutil 30748 update io ticks 
> <server1> diskutil 30748 open stat file: /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 250, msec_to_next_event: 250 <server1> fio: pid=30750, got signal=11 
> <server1> process  30748 pid=30750: runstate RUNNING -> REAPED 
> <server1> process  30748 terminate group_id=-1 <server1> process  
> 30748 setting terminate on test_job/30750 <server1> diskutil 30748 
> update io ticks <server1> diskutil 30748 open stat file: 
> /sys/block/nvme0n1/stat
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: 1299576802        0 54165895270 1333065028 863486453        0 63615892884 3819260095        0 57718711 859646548
> <server1> diskutil 30748 /sys/block/nvme0n1/stat: stat read ok? 0 
> <server1> helperthread 30748 since_ss: 0, next_ss: 1000, next_log: 
> 500, msec_to_next_event: 157 <server1> process  30748 pid=30750: 
> runstate REAPED -> FINISHING <server1> process  30748 pid=30750: runstate FINISHING -> REAPED
> <server1> net      30748 server sending end stats
> <server1> net      30748 ts->ss_state = 0
> <server1> net      30748 server sending group run stats
> <server1> net      30748 server: sending disk_util 1
> <server1>
> <server1> parse    30748 free options
> <server1> parse    30748 free options
> <server1> net      30707 pid 30748 exited, sig=0, exitval=0
> <server1> net      30707 server: sending stop (0, 0)
> <server1> net      30707 server: sending quit
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dceae0
> <server1> net      30707 server sending status
> <server1> net      30707 server: got op [SEND_ETA], pdu=0, tag=dcea50
> <server1> net      30707 server sending status
> parse    30706 free options

OK so it's not a newly introduced issue. This looks like the client is continuing to send SEND_ETA messages even though the server has sent back a "I've quit" message...

If you whittle the job file as much as possible (e.g. no write_lat_log, smaller iodepth etc) do the symptoms change? What's the smallest job file that still reproduces this problem?

--
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux