Re: [PATCH v5 0/2] fix inaccurate io_ticks

Weiping Zhang <zwp10758@xxxxxxxxx> · Wed, 18 Nov 2020 13:55:08 +0800

On Tue, Nov 17, 2020 at 3:40 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>
> On Tue, Nov 17, 2020 at 12:59:46PM +0800, Weiping Zhang wrote:
> > On Tue, Nov 17, 2020 at 11:28 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Nov 17, 2020 at 11:01:49AM +0800, Weiping Zhang wrote:
> > > > Hi Jens,
> > > >
> > > > Ping
> > >
> > > Hello Weiping,
> > >
> > > Not sure we have to fix this issue, and adding blk_mq_queue_inflight()
> > > back to IO path brings cost which turns out to be visible, and I did
> > > get soft lockup report on Azure NVMe because of this kind of cost.
> > >
> > Have you test v5, this patch is different from v1, the v1 gets
> > inflight for each IO,
> > v5 has changed to get inflight every jiffer.
>
> I meant the issue can be reproduced on kernel before 5b18b5a73760("block:
> delete part_round_stats and switch to less precise counting").
>
> Also do we really need to fix this issue? I understand device
> utilization becomes not accurate at very small load, is it really
> worth of adding runtime load in fast path for fixing this issue?
>
Hello Ming,

The problem is user hard to know how busy disk is,
for small load, it shows high utilization, for heavy load it also shows
high utilization, that makes %util meaningless.

The following test case shows a big gap with same workload:

modprobe null_blk submit_queues=8 queue_mode=2 irqmode=2 completion_nsec=100000
fio -name=test -ioengine=sync -bs=4K -rw=write -filename=/dev/nullb0
-size=100M -time_based=1 -direct=1 -runtime=300 -rate=4m &

                        w/s   w_await  %util
-----------------------------------------------
before patch 1024         0.15  100
after   patch  1024         0.15  14.5

I know for hyper speed disk, add such accounting in fast path is harmful,
maybe we add an interface to enable/disable io_ticks accounting, like
what /sys/block/<disk>/queue/iostat does.

eg: /sys/block/<disk>/queue/iostat_io_ticks
when write 0 to it, just disable io_ticks totally.

Or any other good idea ?

> >
> > If for v5, can we reproduce it on null_blk ?
>
> No, I just saw report on Azure NVMe.
>
> >
> > > BTW, suppose the io accounting issue needs to be fixed, just wondering
> > > why not simply revert 5b18b5a73760 ("block: delete part_round_stats and
> > > switch to less precise counting"), and the original way had been worked
> > > for decades.
> > >
> > This patch is more better than before, it will break early when find there is
> > inflight io on any cpu, for the worst case(the io in running on the last cpu),
> > it iterates all cpus.
>
Yes, it's the worst case.
Actually v5 has two improvements compare to before 5b18b5a73760:
1. for io end, v5 do not get inflight count
2. for io start, v5 just find the first inflight io in any cpu, for
the worst case it does same as before.

> Please see the following case:
>
> 1) one device has 256 hw queues, and the system has 256 cpu cores, and
> each hw queue's depth is 1k.
>
> 2) there isn't any io load on CPUs(0 ~ 254)
>
> 3) heavy io load is run on CPU 255
>
> So with your trick the code still need to iterate hw queues from 0 to 254, and
> the load isn't something which can be ignored. Especially it is just for
> io accounting.
>
>
> Thanks,
> Ming
>
Thanks