On Tue, May 30, 2017 at 01:55:20PM +0100, Chris Wilson wrote: > When looking at simple ioctls coupled with conveniently small user > parameters, the overhead of the syscall and drm_ioctl() present large > low hanging fruit. Profiling trivial microbenchmarks around > i915_gem_busy_ioctl, the low hanging fruit comprises of the call to > copy_user(). Those calls are only inlined by the macro where the > constant is known at compile-time, but the ioctl argument size depends > on the ioctl number. To help the compiler, explicitly add switches for > the small sizes that expand to simple moves to/from user. Doing the > multiple inlines does add significant code bloat, so it is very > debatable as to its value. Back to the trivial, but frequently used, > example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives > us a 15-25% improvement: > > before after > single 100.173ns 84.496ns > parallel (x4) 204.275ns 152.957ns > > On a baby Broxton nuc: > > before after > single 245.355ns 199.477ns > parallel (x2) 280.892ns 232.726ns > > Looking at the cost distribution by moving an equivalent switch into > arch/x86/lib/usercopy, the overhead to the copy user is split almost > equally between the function call and the actual copy itself. It seems > copy_user_enhanced_fast_string simply is not that good at small (single > register) copies. Something as simple as > > @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len) > { > unsigned ret; > > + if (len <= 16) > + return copy_user_generic_unrolled(to, from, len); > > is enough to speed up i915_gem_busy_ioctl() by 10% :| > > Note that this overhead may entirely be x86 specific. > > Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx> > --- > + if (in_size) { > + if (unlikely(!access_ok(VERIFY_READ, arg, in_size))) > + goto err_invalid_user; > + > + switch (in_size) { > + case 4: > + if (unlikely(__copy_from_user(kdata, (void __user *)arg, > + 4))) > + goto err_invalid_user; > + break; > + case 8: > + if (unlikely(__copy_from_user(kdata, (void __user *)arg, > + 8))) > + goto err_invalid_user; > + break; > + case 16: > + if (unlikely(__copy_from_user(kdata, (void __user *)arg, > + 16))) > + goto err_invalid_user; > + break; For example, currently x86-32 only converts case 4 above. It could trivially do case 8 as well, but by case 16 it may as well use the function call to its loop in assembly. And currently x86-32 has no optimisations for fixed sized puts. -Chris -- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx