Re: [PATCH 2/2] crypto: xor - use ktime for template benchmarking

Doug Anderson <dianders@xxxxxxxxxxxx> · Thu, 24 Sep 2020 08:28:19 -0700

Hi,

On Thu, Sep 24, 2020 at 1:32 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>
> On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> > >
> > > Currently, we use the jiffies counter as a time source, by staring at
> > > it until a HZ period elapses, and then staring at it again and perform
> > > as many XOR operations as we can at the same time until another HZ
> > > period elapses, so that we can calculate the throughput. This takes
> > > longer than necessary, and depends on HZ, which is undesirable, since
> > > HZ is system dependent.
> > >
> > > Let's use the ktime interface instead, and use it to time a fixed
> > > number of XOR operations, which can be done much faster, and makes
> > > the time spent depend on the performance level of the system itself,
> > > which is much more reasonable.
> > >
> > > On ThunderX2, I get the following results:
> > >
> > > Before:
> > >
> > >   [72625.956765] xor: measuring software checksum speed
> > >   [72625.993104]    8regs     : 10169.000 MB/sec
> > >   [72626.033099]    32regs    : 12050.000 MB/sec
> > >   [72626.073095]    arm64_neon: 11100.000 MB/sec
> > >   [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> > >
> > > After:
> > >
> > >   [ 2503.189696] xor: measuring software checksum speed
> > >   [ 2503.189896]    8regs           : 10556 MB/sec
> > >   [ 2503.190061]    32regs          : 12538 MB/sec
> > >   [ 2503.190250]    arm64_neon      : 11470 MB/sec
> > >   [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx>
> > > ---
> > >  crypto/xor.c | 36 ++++++++------------
> > >  1 file changed, 15 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/crypto/xor.c b/crypto/xor.c
> > > index b42c38343733..23f98b451b69 100644
> > > --- a/crypto/xor.c
> > > +++ b/crypto/xor.c
> > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> > >  }
> > >  #endif
> > >
> > > -#define BENCH_SIZE (PAGE_SIZE)
> > > +#define BENCH_SIZE     4096
> >
> > I'm curious why the change away from PAGE_SIZE to using 4096.
> > Everything should work OK w/ using PAGE_SIZE, right?
> >
>
> Yes, but then the test will take 16x more time on a 64k page size
> system for no reason whatsoever.

Ah.  I wasn't sure if using PAGE_SIZE as trying to help avoid
measurement inaccuracies or make it work more like the real thing
(maybe the code that calls XOR often does it PAGE_SIZE at a time?)  I
definitely haven't played with it, but I could sorta imagine it making
some small differences.  Maybe the "prefetch" versions of the XOR ops
have some overhead but the overhead is mitigated with larger
operations?  Though both 4K and 64K are probably large enough and
probably it wouldn't affect the outcome of which algorithm is best...

> > > +#define REPS           100
> >
> > Is this sufficient?  I'm not sure what the lower bound on what's
> > expected of ktime.  If I'm doing the math right, on your system
> > running 100 loops took 38802 ns in one case, since:
> >
> > (4096 * 1000 * 100) / 10556 = 38802
> >
> > If you happen to have your timer backed by a 32 kHz clock, one tick of
> > ktime could be as much as 31250 ns, right?  Maybe on systems backed
> > with a 32kHz clock they'll take longer, but it still seems moderately
> > iffy?  I dunno, maybe I'm just being paranoid.
> >
>
> No, that is a good point - I didn't really consider that ktime could
> be that coarse.
>
> OTOH, we don't really need the full 5 digits of precision either, as
> long as we don't misidentify the fastest algorithm.
>
> So I think it should be sufficient to bump this to 800. If my
> calculations are correct, this would limit any potential
> misidentification of algorithms performing below 10 GB/s to ones that
> only deviate in performance up to 10%.
>
> 800 * 1000 * 4096 / (10 * 31250) = 10485
> 800 * 1000 * 4096 / (11 * 31250) = 9532
>
> (10485/9532) / 10485 = 10%

Seems OK to me.  Seems unlikely that super fast machine are going to
have a 32 kHz backed k_time and the worst case is that we'll pick a
slightly sub-optimal xor, I guess.  I assume your goal is to keep
things fitting in a 32-bit unsigned integer?  Looks like if your use
1000 it also fits...

-Doug