I was looking at the overhead of drmIoctl() in a microbenchmark that repeatedly did a copy_from_user(.size=8) followed by a copy_to_user(.size=8) as part of the DRM_IOCTL_I915_GEM_BUSY. I found that if I forced inlined the get_user/put_user instead the walltime of the ioctl was improved by about 20%. If copy_user_generic_unrolled was used instead of copy_user_enhanced_fast_string, performance of the microbenchmark was improved by 10%. Benchmarking on a few machines (Broadwell) benchmark_copy_user(hot): size unrolled string fast-string 1 158 77 79 2 306 154 158 4 614 308 317 6 926 462 476 8 1344 298 635 12 1773 482 952 16 2797 602 1269 24 4020 903 1906 32 5055 1204 2540 48 6150 1806 3810 64 9564 2409 5082 96 13583 3612 6483 128 18108 4815 8434 (Broxton) benchmark_copy_user(hot): size unrolled string fast-string 1 270 52 53 2 364 106 109 4 460 213 218 6 486 305 312 8 1250 253 437 12 1009 332 625 16 2059 514 897 24 2624 672 1071 32 3043 1014 1750 48 3620 1499 2561 64 7777 1971 3333 96 7499 2876 4772 128 9999 3733 6088 which says that for this cache hot case in benchmarking the rep mov microcode noticeably underperforms. Though once we pass a few cachelines, and definitely after exceeding L1 cache, rep mov is the clear winner. From cold, there is no difference in timings. I can improve the microbenchmark by either force inlining the raw_copy_*_user switches, or by switching to copy_user_generic_unrolled. Both leave a sour taste. The switch is too big to be inlined, and if called out-of-line the function call overhead negates its benefits. Switching between fast-string and unrolled makes a presumption on behaviour. In the end, I limited this series to just adding a few extra translations for statically known copy_*_user(). -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx