On 11/18/20 6:59 AM, Pavel Begunkov wrote: > On 18/11/2020 01:42, Jens Axboe wrote: >> On 11/17/20 9:58 AM, Pavel Begunkov wrote: >>> On 17/11/2020 16:30, Jens Axboe wrote: >>>> On 11/17/20 3:43 AM, Pavel Begunkov wrote: >>>>> On 17/11/2020 06:17, Xiaoguang Wang wrote: >>>>>> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and >>>>>> percpu_ref_put() for registered files, but it's hard to say they're very >>>>>> light-weight synchronization primitives. In one our x86 machine, I get below >>>>>> perf data(registered files enabled): >>>>>> Samples: 480K of event 'cycles', Event count (approx.): 298552867297 >>>>>> Overhead Comman Shared Object Symbol >>>>>> 0.45% :53243 [kernel.vmlinux] [k] io_file_get >>>>> >>>>> Do you have throughput/latency numbers? In my experience for polling for >>>>> such small overheads all CPU cycles you win earlier in the stack will be >>>>> just burned on polling, because it would still wait for the same fixed* >>>>> time for the next response by device. fixed* here means post-factum but >>>>> still mostly independent of how your host machine behaves. >>>> >>>> That's only true if you can max out the device with a single core. >>>> Freeing any cycles directly translate into a performance win otherwise, >>>> if your device isn't the bottleneck. For the high performance testing >>> >>> Agree, that's what happens if a host can't keep up with a device, or e.g. >> >> Right, and it's a direct measure of the efficiency. Moving cycles _to_ >> polling is a good thing! It means that the rest of the stack got more > > Absolutely, but the patch makes code a bit more complex and adds some > overhead for non-iopoll path, definitely not huge, but the showed overhead > reduction (i.e. 0.20%) doesn't do much either. Comparing with left 0.25% > it costs just a couple of instructions. > > And that's why I wanted to see if there is any real visible impact. Definitely, it's always a tradeoff between the size of the win and complexity and other factors. Especially adding to io_kiocb is a big negative in my book. >> efficient. And if the device is fast enough, then that'll directly >> result in higher peak IOPS and lower latencies. >> >>> in case 2. of my other reply. Why don't you mention throwing many-cores >>> into a single many (poll) queue SSD? >> >> Not really relevant imho, you can obviously always increase performance >> if you are core limited by utilizing multiple cores. >> >> I haven't tested these patches yet, will try and see if I get some time >> to do so tomorrow. > > Great Ran it through the polled testing which is core limited, and I didn't see any changes... -- Jens Axboe