Re: [RFC] drm: Optimise drm_ioctl() for small user args

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 30 May 2017 16:24:14 +0100



On Tue, May 30, 2017 at 01:55:20PM +0100, Chris Wilson wrote:
> When looking at simple ioctls coupled with conveniently small user
> parameters, the overhead of the syscall and drm_ioctl() present large
> low hanging fruit. Profiling trivial microbenchmarks around
> i915_gem_busy_ioctl, the low hanging fruit comprises of the call to
> copy_user(). Those calls are only inlined by the macro where the
> constant is known at compile-time, but the ioctl argument size depends
> on the ioctl number. To help the compiler, explicitly add switches for
> the small sizes that expand to simple moves to/from user. Doing the
> multiple inlines does add significant code bloat, so it is very
> debatable as to its value. Back to the trivial, but frequently used,
> example of i915_gem_busy_ioctl() on a Broadwell avoiding the call gives
> us a 15-25% improvement:
> 
> 			 before		  after
> 	single		100.173ns	 84.496ns
> 	parallel (x4)	204.275ns	152.957ns
> 
> On a baby Broxton nuc:
> 
> 			before           after
> 	single		245.355ns	199.477ns
> 	parallel (x2)	280.892ns	232.726ns
> 
> Looking at the cost distribution by moving an equivalent switch into
> arch/x86/lib/usercopy, the overhead to the copy user is split almost
> equally between the function call and the actual copy itself. It seems
> copy_user_enhanced_fast_string simply is not that good at small (single
> register) copies. Something as simple as
> 
> @@ -28,6 +28,9 @@ copy_user_generic(void *to, const void *from, unsigned len)
>  {
>         unsigned ret;
> 
> +       if (len <= 16)
> +               return copy_user_generic_unrolled(to, from, len);
> 
> is enough to speed up i915_gem_busy_ioctl() by 10% :|
> 
> Note that this overhead may entirely be x86 specific.
> 
> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
> ---
> +	if (in_size) {
> +		if (unlikely(!access_ok(VERIFY_READ, arg, in_size)))
> +			goto err_invalid_user;
> +
> +		switch (in_size) {
> +		case 4:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      4)))
> +				goto err_invalid_user;
> +			break;
> +		case 8:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      8)))
> +				goto err_invalid_user;
> +			break;
> +		case 16:
> +			if (unlikely(__copy_from_user(kdata, (void __user *)arg,
> +						      16)))
> +				goto err_invalid_user;
> +			break;

For example, currently x86-32 only converts case 4 above. It could
trivially do case 8 as well, but by case 16 it may as well use the
function call to its loop in assembly. And currently x86-32 has no
optimisations for fixed sized puts.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx