On Thu, 24 Sep 2020 at 02:36, Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > Hi, > > On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > > > Currently, we use the jiffies counter as a time source, by staring at > > it until a HZ period elapses, and then staring at it again and perform > > as many XOR operations as we can at the same time until another HZ > > period elapses, so that we can calculate the throughput. This takes > > longer than necessary, and depends on HZ, which is undesirable, since > > HZ is system dependent. > > > > Let's use the ktime interface instead, and use it to time a fixed > > number of XOR operations, which can be done much faster, and makes > > the time spent depend on the performance level of the system itself, > > which is much more reasonable. > > > > On ThunderX2, I get the following results: > > > > Before: > > > > [72625.956765] xor: measuring software checksum speed > > [72625.993104] 8regs : 10169.000 MB/sec > > [72626.033099] 32regs : 12050.000 MB/sec > > [72626.073095] arm64_neon: 11100.000 MB/sec > > [72626.073097] xor: using function: 32regs (12050.000 MB/sec) > > > > After: > > > > [ 2503.189696] xor: measuring software checksum speed > > [ 2503.189896] 8regs : 10556 MB/sec > > [ 2503.190061] 32regs : 12538 MB/sec > > [ 2503.190250] arm64_neon : 11470 MB/sec > > [ 2503.190252] xor: using function: 32regs (12538 MB/sec) > > > > Signed-off-by: Ard Biesheuvel <ardb@xxxxxxxxxx> > > --- > > crypto/xor.c | 36 ++++++++------------ > > 1 file changed, 15 insertions(+), 21 deletions(-) > > > > diff --git a/crypto/xor.c b/crypto/xor.c > > index b42c38343733..23f98b451b69 100644 > > --- a/crypto/xor.c > > +++ b/crypto/xor.c > > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void) > > } > > #endif > > > > -#define BENCH_SIZE (PAGE_SIZE) > > +#define BENCH_SIZE 4096 > > I'm curious why the change away from PAGE_SIZE to using 4096. > Everything should work OK w/ using PAGE_SIZE, right? > Yes, but then the test will take 16x more time on a 64k page size system for no reason whatsoever. > > > +#define REPS 100 > > Is this sufficient? I'm not sure what the lower bound on what's > expected of ktime. If I'm doing the math right, on your system > running 100 loops took 38802 ns in one case, since: > > (4096 * 1000 * 100) / 10556 = 38802 > > If you happen to have your timer backed by a 32 kHz clock, one tick of > ktime could be as much as 31250 ns, right? Maybe on systems backed > with a 32kHz clock they'll take longer, but it still seems moderately > iffy? I dunno, maybe I'm just being paranoid. > No, that is a good point - I didn't really consider that ktime could be that coarse. OTOH, we don't really need the full 5 digits of precision either, as long as we don't misidentify the fastest algorithm. So I think it should be sufficient to bump this to 800. If my calculations are correct, this would limit any potential misidentification of algorithms performing below 10 GB/s to ones that only deviate in performance up to 10%. 800 * 1000 * 4096 / (10 * 31250) = 10485 800 * 1000 * 4096 / (11 * 31250) = 9532 (10485/9532) / 10485 = 10% > > > static void __init > > do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) > > { > > int speed; > > - unsigned long now, j; > > - int i, count, max; > > + int i, j, count; > > + ktime_t min, start, diff; > > > > tmpl->next = template_list; > > template_list = tmpl; > > > > preempt_disable(); > > > > - /* > > - * Count the number of XORs done during a whole jiffy, and use > > - * this to calculate the speed of checksumming. We use a 2-page > > - * allocation to have guaranteed color L1-cache layout. > > - */ > > - max = 0; > > + min = (ktime_t)S64_MAX; > > for (i = 0; i < 5; i++) { > > - j = jiffies; > > - count = 0; > > - while ((now = jiffies) == j) > > - cpu_relax(); > > - while (time_before(jiffies, now + 1)) { > > + start = ktime_get(); > > + for (j = 0; j < REPS; j++) { > > mb(); /* prevent loop optimzation */ > > tmpl->do_2(BENCH_SIZE, b1, b2); > > mb(); > > count++; > > mb(); > > } > > - if (count > max) > > - max = count; > > + diff = ktime_sub(ktime_get(), start); > > + if (diff < min) > > + min = diff; > > } > > > > preempt_enable(); > > > > - speed = max * (HZ * BENCH_SIZE / 1024); > > + // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s] > > Comment is super helpful, thanks! ...but are folks really OK with > "//" comments these days? > Linus said he is fine with it, and even prefers it for single line comments, so I don't think it's a problem > > > + speed = (1000 * REPS * BENCH_SIZE) / (u32)min; > > nit: Just for prettiness, maybe call ktime_to_ns(min)? > > optional nit: I always think of u32 as something for accessing > hardware. Maybe "unsigned int"? > Ack > > > tmpl->speed = speed; > > > > - printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name, > > - speed / 1000, speed % 1000); > > + printk(KERN_INFO " %-16s: %5d MB/sec\n", tmpl->name, speed); > > Since you're touching, switch to pr_info()? > Ack (x2) > > > } > > > > static int __init > > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void) > > if (f->speed > fastest->speed) > > fastest = f; > > > > - printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n", > > - fastest->name, fastest->speed / 1000, fastest->speed % 1000); > > + printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n", > > + fastest->name, fastest->speed); > > Since you're touching, switch to pr_info()? > > > -Doug