Good call -- Turns out that that cache issue is resolved in 5.17. I tried a number of kernels and narrowed it down to a problem that started after 4.9 and before 4.15, and ended some time after 5.13. Namely, 4.9 is good, 4.15 is bad, 5.13 is bad, and 5.17 is good. I did not bisect it all the way down to the specific versions where the behaviors changed. Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme1n1 2758.00 2783.00 11032.00 11132.00 0.00 0.00 0.00 0.00 0.10 0.03 0.36 4.00 4.00 0.18 100.00 nvme0n1 2830.00 2875.00 11320.00 11500.00 0.00 0.00 0.00 0.00 0.10 0.03 0.39 4.00 4.00 0.18 100.00 With regards to the performance between 4.4.0 and 5.17, for a single thread, 4.4.0 still had better performance over 5.17. However, the 5.17 kernel was significantly better at multiple threads. In fact, it is so much better I don't believe the results (10x improvement!). Is this to be expected that a single thread would be slower in 5.17, but recent improvements make it possible to run many of them in parallel more efficiently? # /usr/local/bin/fio -name=randrw -filename=/opt/foo -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=16 -group_reporting=1 -runtime=120 // Ubuntu 16.04 / Linux 4.4.0: Run status group 0 (all jobs): READ: bw=54.5MiB/s (57.1MB/s), 54.5MiB/s-54.5MiB/s (57.1MB/s-57.1MB/s), io=6537MiB (6854MB), run=120002-120002msec WRITE: bw=54.5MiB/s (57.2MB/s), 54.5MiB/s-54.5MiB/s (57.2MB/s-57.2MB/s), io=6544MiB (6862MB), run=120002-120002msec // Ubuntu 18.04 / Linux 5.4.0: Run status group 0 (all jobs): READ: bw=23.5MiB/s (24.7MB/s), 23.5MiB/s-23.5MiB/s (24.7MB/s-24.7MB/s), io=2821MiB (2959MB), run=120002-120002msec WRITE: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=2819MiB (2955MB), run=120002-120002msec // Ubuntu 18.04 / Linux 5.17: Run status group 0 (all jobs): READ: bw=244MiB/s (255MB/s), 244MiB/s-244MiB/s (255MB/s-255MB/s), io=28.6GiB (30.7GB), run=120001-120001msec WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=28.6GiB (30.7GB), run=120001-120001msec Thanks, Michael