Kyber benchmarks

Omar Sandoval <osandov@xxxxxxxxxxx> · Mon, 10 Jul 2017 15:15:54 -0700

Hi, everyone,

I did some benchmarks of Kyber on 4.12 that I wanted to share. If anyone
else has done any testing, I'd love to see the results.

== Latency

Kyber's basic function is controlling latency, so the first benchmark I
did was to measure latency of a mixed workload. When idle, the NVMe
device I tested on has a p99.99 of 150 microseconds for 4k reads and 30
microseconds for 4k writes. I ran the following fio job, where
/dev/nvme0np{1,2} are 16 GB partitions compeletely overwritten before
running the test:

[global]
direct=1
runtime=10s
time_based

[writers]
filename=/dev/nvme0n1p1
rw=randwrite
numjobs=100
group_reporting=1

[reader]
filename=/dev/nvme0n1p2
ioengine=sync
rw=randread
io_submit_mode=offload

This test simulates a single latency-sensitive reader contending with
many writers, so I tweaked the scheduler settings to favor reads over
writes: for Kyber, I used a 1 ms read target latency instead of the
default 2 ms, and for deadline I used a 1 ms read expiry.

read latency percentiles | 1   | 5   | 10  | 20  | 30  | 40  | 50  | 60   | 70   | 80   | 90   | 95   | 99   | 99.5 | 99.9 | 99.95 | 99.99
-------------------------+-----+-----+-----+-----+-----+-----+-----+------+------+------+------+------+------+------+------+-------+------
none                     | 99  | 161 | 223 | 338 | 438 | 572 | 764 | 1012 | 1384 | 1992 | 2640 | 2960 | 3344 | 3504 | 4960 | 5792  | 6816
kyber                    | 75  | 83  | 85  | 87  | 92  | 101 | 103 | 107  | 181  | 270  | 948  | 1928 | 2800 | 2928 | 3120 | 3216  | 5664
mq-deadline              | 169 | 215 | 266 | 358 | 446 | 596 | 796 | 1048 | 1448 | 2024 | 2704 | 3024 | 3376 | 3504 | 5472 | 6496  | 7712

As you can see, Kyber is more effective at managing read latencies. With
this configuration, of course, Kyber optimizes reads at the expense of
writes: the write p99 goes from around 3 ms to 6 ms, since we're using
the default 10 ms write latency target here.

The highest percentiles still don't look great, but we are at the mercy
of flash here. To iron these out, we'll need help from the hardware,
like the NVMe read determinism work.

== Scalability

On CPU scalability, Kyber is a clear win over mq-deadline. To test that,
I used my blk_scale.py script [1]. That basically runs the following fio
job with an increasing numjobs and measures total IOPS:

[scale]
filename=$DEV
direct=1
numjobs=$N
cpus_allowed_policy=split
runtime=10
time_based
ioengine=libaio
iodepth=64
rw=randread
unified_rw_reporting=1

I ran this with iostats disabled on an NVMe drive after running
blkdiscard on the whole thing. Kyber easily hits the limit of the
device, whereas mq-deadline falls over with just 2 jobs.

NVMe numjobs vs. IOPS | 1      | 2      | 4      | 8      | 16     | 32     | 56
----------------------+--------+--------+--------+--------+--------+--------+-------
none                  | 329986 | 642121 | 807191 | 807105 | 806531 | 806875 | 803813
kyber                 | 314097 | 588791 | 807213 | 807057 | 806551 | 807753 | 803833
mq-deadline           | 326959 | 375369 | 352587 | 347723 | 350743 | 336972 | 313795

With null-blk (submit_queues=56 queue_mode=2 hw_queue_depth=1024,
iostats disabled), we can see that Kyber does have some overhead, but it
can still easily keep up with real hardware.

null-blk numjobs vs. IOPS | 1      | 2      | 4       | 8       | 16      | 32       | 56
--------------------------+--------+--------+---------+---------+---------+----------+-------
none                      | 496817 | 965295 | 1946658 | 3847158 | 7698758 | 13424482 | 15151692
kyber                     | 441907 | 832978 | 1598153 | 3202248 | 6137827 | 8931286  | 10725823
mq-deadline               | 462503 | 524586 | 378026  | 372034  | 380879  | 360153   | 337560

1: https://github.com/osandov/osandov-linux/blob/master/scripts/blk_scale.py

== Future Work

The results here are promising, but one thing I haven't tested yet is
how well Kyber reacts to changing workloads. The code hard-codes the
time it gathers statistics for, which for shorter latency targets might
mean we miss our target for a while before Kyber throttles requests.

I'm happy with the scalability results, because it means we still have
some headroom to add fancier features.

Thanks!