Re: [PATCH 5.11 2/2] io_uring: don't take percpu_ref operations for registered files in IOPOLL mode

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Wed, 18 Nov 2020 23:36:53 +0800

hi,

On 11/18/20 6:59 AM, Pavel Begunkov wrote:
On 18/11/2020 01:42, Jens Axboe wrote:
On 11/17/20 9:58 AM, Pavel Begunkov wrote:
On 17/11/2020 16:30, Jens Axboe wrote:
On 11/17/20 3:43 AM, Pavel Begunkov wrote:
On 17/11/2020 06:17, Xiaoguang Wang wrote:
In io_file_get() and io_put_file(), currently we use percpu_ref_get() and
percpu_ref_put() for registered files, but it's hard to say they're very
light-weight synchronization primitives. In one our x86 machine, I get below
perf data(registered files enabled):
Samples: 480K of event 'cycles', Event count (approx.): 298552867297
Overhead  Comman  Shared Object     Symbol
    0.45%  :53243  [kernel.vmlinux]  [k] io_file_get

Do you have throughput/latency numbers? In my experience for polling for
such small overheads all CPU cycles you win earlier in the stack will be
just burned on polling, because it would still wait for the same fixed*
time for the next response by device. fixed* here means post-factum but
still mostly independent of how your host machine behaves.

That's only true if you can max out the device with a single core.
Freeing any cycles directly translate into a performance win otherwise,
if your device isn't the bottleneck. For the high performance testing

Agree, that's what happens if a host can't keep up with a device, or e.g.

Right, and it's a direct measure of the efficiency. Moving cycles _to_
polling is a good thing! It means that the rest of the stack got more

Absolutely, but the patch makes code a bit more complex and adds some
overhead for non-iopoll path, definitely not huge, but the showed overhead
reduction (i.e. 0.20%) doesn't do much either. Comparing with left 0.25%
it costs just a couple of instructions.

And that's why I wanted to see if there is any real visible impact.

Definitely, it's always a tradeoff between the size of the win and
complexity and other factors. Especially adding to io_kiocb is a big
negative in my book.

efficient. And if the device is fast enough, then that'll directly
result in higher peak IOPS and lower latencies.

in case 2. of my other reply. Why don't you mention throwing many-cores
into a single many (poll) queue SSD?

Not really relevant imho, you can obviously always increase performance
if you are core limited by utilizing multiple cores.

I haven't tested these patches yet, will try and see if I get some time
to do so tomorrow.

Great

Ran it through the polled testing which is core limited, and I didn't
see any changes...
Jens and Pavel, sorry for the noise...
I also have some tests today, in upstream kernel, I also don't see any changes,
but in our internal 4.19 kernel, I got a steady about 1% iops improvement, and
our kernel don't include Ming Lei's patch "2b0d3d3e4fcf percpu_ref: reduce memory
footprint of percpu_ref in fast path".

Regards,
Xiaoguang Wang