Re: [PATCH 5.11 2/2] io_uring: don't take percpu_ref operations for registered files in IOPOLL mode

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 18 Nov 2020 13:59:19 +0000

On 18/11/2020 01:42, Jens Axboe wrote:
> On 11/17/20 9:58 AM, Pavel Begunkov wrote:
>> On 17/11/2020 16:30, Jens Axboe wrote:
>>> On 11/17/20 3:43 AM, Pavel Begunkov wrote:
>>>> On 17/11/2020 06:17, Xiaoguang Wang wrote:
>>>>> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and
>>>>> percpu_ref_put() for registered files, but it's hard to say they're very
>>>>> light-weight synchronization primitives. In one our x86 machine, I get below
>>>>> perf data(registered files enabled):
>>>>> Samples: 480K of event 'cycles', Event count (approx.): 298552867297
>>>>> Overhead  Comman  Shared Object     Symbol
>>>>>    0.45%  :53243  [kernel.vmlinux]  [k] io_file_get
>>>>
>>>> Do you have throughput/latency numbers? In my experience for polling for
>>>> such small overheads all CPU cycles you win earlier in the stack will be
>>>> just burned on polling, because it would still wait for the same fixed*
>>>> time for the next response by device. fixed* here means post-factum but
>>>> still mostly independent of how your host machine behaves. 
>>>
>>> That's only true if you can max out the device with a single core.
>>> Freeing any cycles directly translate into a performance win otherwise,
>>> if your device isn't the bottleneck. For the high performance testing
>>
>> Agree, that's what happens if a host can't keep up with a device, or e.g.
> 
> Right, and it's a direct measure of the efficiency. Moving cycles _to_
> polling is a good thing! It means that the rest of the stack got more

Absolutely, but the patch makes code a bit more complex and adds some
overhead for non-iopoll path, definitely not huge, but the showed overhead
reduction (i.e. 0.20%) doesn't do much either. Comparing with left 0.25%
it costs just a couple of instructions.

And that's why I wanted to see if there is any real visible impact.

> efficient. And if the device is fast enough, then that'll directly
> result in higher peak IOPS and lower latencies.
> 
>> in case 2. of my other reply. Why don't you mention throwing many-cores
>> into a single many (poll) queue SSD?
> 
> Not really relevant imho, you can obviously always increase performance
> if you are core limited by utilizing multiple cores. 
> 
> I haven't tested these patches yet, will try and see if I get some time
> to do so tomorrow.

Great

-- 
Pavel Begunkov