On 11/17/20 3:43 AM, Pavel Begunkov wrote: > On 17/11/2020 06:17, Xiaoguang Wang wrote: >> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and >> percpu_ref_put() for registered files, but it's hard to say they're very >> light-weight synchronization primitives. In one our x86 machine, I get below >> perf data(registered files enabled): >> Samples: 480K of event 'cycles', Event count (approx.): 298552867297 >> Overhead Comman Shared Object Symbol >> 0.45% :53243 [kernel.vmlinux] [k] io_file_get > > Do you have throughput/latency numbers? In my experience for polling for > such small overheads all CPU cycles you win earlier in the stack will be > just burned on polling, because it would still wait for the same fixed* > time for the next response by device. fixed* here means post-factum but > still mostly independent of how your host machine behaves. That's only true if you can max out the device with a single core. Freeing any cycles directly translate into a performance win otherwise, if your device isn't the bottleneck. For the high performance testing I've done, the actual polling isn't the bottleneck, it's the rest of the stack. -- Jens Axboe