Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking

Doug Anderson <dianders@xxxxxxxxxxxx> · Wed, 23 Sep 2020 17:36:40 -0700

Hi,

On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>
> Currently, we use the jiffies counter as a time source, by staring at
> it until a HZ period elapses, and then staring at it again and perform
> as many XOR operations as we can at the same time until another HZ
> period elapses, so that we can calculate the throughput. This takes
> longer than necessary, and depends on HZ, which is undesirable, since
> HZ is system dependent.
>
> Let's use the ktime interface instead, and use it to time a fixed
> number of XOR operations, which can be done much faster, and makes
> the time spent depend on the performance level of the system itself,
> which is much more reasonable.
>
> On ThunderX2, I get the following results:
>
> Before:
>
>   [72625.956765] xor: measuring software checksum speed
>   [72625.993104]    8regs     : 10169.000 MB/sec
>   [72626.033099]    32regs    : 12050.000 MB/sec
>   [72626.073095]    arm64_neon: 11100.000 MB/sec
>   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
>
> After:
>
>   [ 2503.189696] xor: measuring software checksum speed
>   [ 2503.189896]    8regs           : 10556 MB/sec
>   [ 2503.190061]    32regs          : 12538 MB/sec
>   [ 2503.190250]    arm64_neon      : 11470 MB/sec
>   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
>
> Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx>
> ---
>  crypto/xor.c | 36 ++++++++------------
>  1 file changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/crypto/xor.c b/crypto/xor.c
> index b42c38343733..23f98b451b69 100644
> --- a/crypto/xor.c
> +++ b/crypto/xor.c
> @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
>  }
>  #endif
>
> -#define BENCH_SIZE (PAGE_SIZE)
> +#define BENCH_SIZE     4096

I'm curious why the change away from PAGE_SIZE to using 4096.
Everything should work OK w/ using PAGE_SIZE, right?

> +#define REPS           100

Is this sufficient?  I'm not sure what the lower bound on what's
expected of ktime.  If I'm doing the math right, on your system
running 100 loops took 38802 ns in one case, since:

(4096 * 1000 * 100) / 10556 = 38802

If you happen to have your timer backed by a 32 kHz clock, one tick of
ktime could be as much as 31250 ns, right?  Maybe on systems backed
with a 32kHz clock they'll take longer, but it still seems moderately
iffy?  I dunno, maybe I'm just being paranoid.

>  static void __init
>  do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
>  {
>         int speed;
> -       unsigned long now, j;
> -       int i, count, max;
> +       int i, j, count;
> +       ktime_t min, start, diff;
>
>         tmpl->next = template_list;
>         template_list = tmpl;
>
>         preempt_disable();
>
> -       /*
> -        * Count the number of XORs done during a whole jiffy, and use
> -        * this to calculate the speed of checksumming.  We use a 2-page
> -        * allocation to have guaranteed color L1-cache layout.
> -        */
> -       max = 0;
> +       min = (ktime_t)S64_MAX;
>         for (i = 0; i < 5; i++) {
> -               j = jiffies;
> -               count = 0;
> -               while ((now = jiffies) == j)
> -                       cpu_relax();
> -               while (time_before(jiffies, now + 1)) {
> +               start = ktime_get();
> +               for (j = 0; j < REPS; j++) {
>                         mb(); /* prevent loop optimzation */
>                         tmpl->do_2(BENCH_SIZE, b1, b2);
>                         mb();
>                         count++;
>                         mb();
>                 }
> -               if (count > max)
> -                       max = count;
> +               diff = ktime_sub(ktime_get(), start);
> +               if (diff < min)
> +                       min = diff;
>         }
>
>         preempt_enable();
>
> -       speed = max * (HZ * BENCH_SIZE / 1024);
> +       // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]

Comment is super helpful, thanks!  ...but are folks really OK with
"//" comments these days?

> +       speed = (1000 * REPS * BENCH_SIZE) / (u32)min;

nit: Just for prettiness, maybe call ktime_to_ns(min)?

optional nit: I always think of u32 as something for accessing
hardware.  Maybe "unsigned int"?

>         tmpl->speed = speed;
>
> -       printk(KERN_INFO "   %-10s: %5d.%03d MB/sec\n", tmpl->name,
> -              speed / 1000, speed % 1000);
> +       printk(KERN_INFO "   %-16s: %5d MB/sec\n", tmpl->name, speed);

Since you're touching, switch to pr_info()?

>  }
>
>  static int __init
> @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
>                 if (f->speed > fastest->speed)
>                         fastest = f;
>
> -       printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> -              fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> +       printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> +              fastest->name, fastest->speed);

Since you're touching, switch to pr_info()?

-Doug