On 17/11/2020 16:30, Jens Axboe wrote: > On 11/17/20 3:43 AM, Pavel Begunkov wrote: >> On 17/11/2020 06:17, Xiaoguang Wang wrote: >>> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and >>> percpu_ref_put() for registered files, but it's hard to say they're very >>> light-weight synchronization primitives. In one our x86 machine, I get below >>> perf data(registered files enabled): >>> Samples: 480K of event 'cycles', Event count (approx.): 298552867297 >>> Overhead Comman Shared Object Symbol >>> 0.45% :53243 [kernel.vmlinux] [k] io_file_get >> >> Do you have throughput/latency numbers? In my experience for polling for >> such small overheads all CPU cycles you win earlier in the stack will be >> just burned on polling, because it would still wait for the same fixed* >> time for the next response by device. fixed* here means post-factum but >> still mostly independent of how your host machine behaves. > > That's only true if you can max out the device with a single core. > Freeing any cycles directly translate into a performance win otherwise, > if your device isn't the bottleneck. For the high performance testing Agree, that's what happens if a host can't keep up with a device, or e.g. in case 2. of my other reply. Why don't you mention throwing many-cores into a single many (poll) queue SSD? > I've done, the actual polling isn't the bottleneck, it's the rest of the > stack. > -- Pavel Begunkov