Re: [PATCH 5.11 2/2] io_uring: don't take percpu_ref operations for registered files in IOPOLL mode

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 17 Nov 2020 16:58:48 +0000

On 17/11/2020 16:30, Jens Axboe wrote:
> On 11/17/20 3:43 AM, Pavel Begunkov wrote:
>> On 17/11/2020 06:17, Xiaoguang Wang wrote:
>>> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and
>>> percpu_ref_put() for registered files, but it's hard to say they're very
>>> light-weight synchronization primitives. In one our x86 machine, I get below
>>> perf data(registered files enabled):
>>> Samples: 480K of event 'cycles', Event count (approx.): 298552867297
>>> Overhead  Comman  Shared Object     Symbol
>>>    0.45%  :53243  [kernel.vmlinux]  [k] io_file_get
>>
>> Do you have throughput/latency numbers? In my experience for polling for
>> such small overheads all CPU cycles you win earlier in the stack will be
>> just burned on polling, because it would still wait for the same fixed*
>> time for the next response by device. fixed* here means post-factum but
>> still mostly independent of how your host machine behaves. 
> 
> That's only true if you can max out the device with a single core.
> Freeing any cycles directly translate into a performance win otherwise,
> if your device isn't the bottleneck. For the high performance testing

Agree, that's what happens if a host can't keep up with a device, or e.g.
in case 2. of my other reply. Why don't you mention throwing many-cores
into a single many (poll) queue SSD?

> I've done, the actual polling isn't the bottleneck, it's the rest of the
> stack.
> 

-- 
Pavel Begunkov