On 11/4/20 8:20 PM, Xiaoguang Wang wrote: > hi, > >> In io_file_get() and io_put_file(), currently we use percpu_ref_get() and >> percpu_ref_put() for registered files, but it's hard to say they're very >> light-weight synchronization primitives, especially in arm platform. In one >> our arm machine, I get below perf data(registered files enabled): >> Samples: 98K of event 'cycles:ppp', Event count (approx.): 63789396810 >> Overhead Command Shared Object Symbol >> ... >> 0.78% io_uring-sq [kernel.vmlinux] [k] io_file_get >> There is an obvious overhead that can not be ignored. >> >> Currently I don't find any good and generic solution for this issue, but >> in IOPOLL mode, given that we can always ensure get/put registered files >> under uring_lock, we can use a simple and plain u64 counter to synchronize >> with registered files update operations in __io_sqe_files_update(). >> >> With this patch, perf data show shows: >> Samples: 104K of event 'cycles:ppp', Event count (approx.): 67478249890 >> Overhead Command Shared Object Symbol >> ... >> 0.27% io_uring-sq [kernel.vmlinux] [k] io_file_get > The above %0.78 => %0.27 improvements are observed in arm machine with > 4.19 kernel. In upstream mainline codes, since this patch > "2b0d3d3e4fcf percpu_ref: reduce memory footprint of percpu_ref in > fast path", I believe the io_file_get's overhead would be further > smaller. I have same tests in same machine, in upstream codes with my > patch, now the io_file_get's overhead is %0.44. > > This patch's idea is simple, and now seems it only gives minor > performance improvement, do you have any comments about this patch, > should I continue re-send it? Can you resend it against for-5.11/io_uring? Looks simple enough to me, and it's a nice little win. -- Jens Axboe