On 18/11/2020 01:42, Jens Axboe wrote: > On 11/17/20 9:58 AM, Pavel Begunkov wrote: >> On 17/11/2020 16:30, Jens Axboe wrote: >>> On 11/17/20 3:43 AM, Pavel Begunkov wrote: >>>> On 17/11/2020 06:17, Xiaoguang Wang wrote: >>>>> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and >>>>> percpu_ref_put() for registered files, but it's hard to say they're very >>>>> light-weight synchronization primitives. In one our x86 machine, I get below >>>>> perf data(registered files enabled): >>>>> Samples: 480K of event 'cycles', Event count (approx.): 298552867297 >>>>> Overhead Comman Shared Object Symbol >>>>> 0.45% :53243 [kernel.vmlinux] [k] io_file_get >>>> >>>> Do you have throughput/latency numbers? In my experience for polling for >>>> such small overheads all CPU cycles you win earlier in the stack will be >>>> just burned on polling, because it would still wait for the same fixed* >>>> time for the next response by device. fixed* here means post-factum but >>>> still mostly independent of how your host machine behaves. >>> >>> That's only true if you can max out the device with a single core. >>> Freeing any cycles directly translate into a performance win otherwise, >>> if your device isn't the bottleneck. For the high performance testing >> >> Agree, that's what happens if a host can't keep up with a device, or e.g. > > Right, and it's a direct measure of the efficiency. Moving cycles _to_ > polling is a good thing! It means that the rest of the stack got more Absolutely, but the patch makes code a bit more complex and adds some overhead for non-iopoll path, definitely not huge, but the showed overhead reduction (i.e. 0.20%) doesn't do much either. Comparing with left 0.25% it costs just a couple of instructions. And that's why I wanted to see if there is any real visible impact. > efficient. And if the device is fast enough, then that'll directly > result in higher peak IOPS and lower latencies. > >> in case 2. of my other reply. Why don't you mention throwing many-cores >> into a single many (poll) queue SSD? > > Not really relevant imho, you can obviously always increase performance > if you are core limited by utilizing multiple cores. > > I haven't tested these patches yet, will try and see if I get some time > to do so tomorrow. Great -- Pavel Begunkov