On 5/18/22 9:22 AM, Eric Wheeler wrote:
On Tue, 10 May 2022, Adriano Silva wrote:
I'm trying to set up a flash disk NVMe as a disk cache for two or three
isolated (I will use 2TB disks, but in these tests I used a 1TB one)
spinning disks that I have on a Linux 5.4.174 (Proxmox node).
Coly has been adding quite a few optimizations over the years. You might
try a new kernel and see if that helps. More below.
Yes, the latest stable kernel is preferred. Linux 5.4 based kernel is
stable enough for bcache, but it is still better to use latest stable
kernel.
I'm using a NVMe (960GB datacenter devices with tantalum capacitors) as
a cache.
[...]
But when I do the same test on bcache writeback, the performance drops a
lot. Of course, it's better than the performance of spinning disks, but
much worse than when accessed directly from the NVMe device hardware.
[...]
As we can see, the same test done on the bcache0 device only got 1548
IOPS and that yielded only 6.3 KB/s.
Well done on the benchmarking! I always thought our new NVMes performed
slower than expected but hadn't gotten around to investigating.
I've noticed in several tests, varying the amount of jobs or increasing
the size of the blocks, that the larger the size of the blocks, the more
I approximate the performance of the physical device to the bcache
device.
You said "blocks" but did you mean bucket size (make-bcache -b) or block
size (make-bcache -w) ?
If larger buckets makes it slower than that actually surprises me: bigger
buckets means less metadata and better sequential writeback to the
spinning disks (though you hadn't yet hit writeback to spinning disks in
your stats). Maybe you already tried, but varying the bucket size might
help. Try graphing bucket size (powers of 2) against IOPS, maybe there is
a "sweet spot"?
Be aware that 4k blocks (so-called "4Kn") is unsafe for the cache device,
unless Coly has patched that. Make sure your `blockdev --getss` reports
512 for your NVMe!
Hi Coly,
Some time ago you ordered an an SSD to test the 4k cache issue, has that
been fixed? I've kept an eye out for the patch but not sure if it was released.
Yes, I got the Intel P3700 PCIe SSD to fix the 4Kn unaligned I/O issue
(borrowed from a hardware vendor). The new situation is, current kernel
does the sector size alignment checking quite earlier in bio layer, if
the LBA is not sector size aligned, it is rejected in the bio code, and
the underlying driver doesn't have chance to see the bio anymore. So for
now, the unaligned LBA for 4Kn device cannot reach bcache code, that's
to say, the original reported condition won't happen now.
And after this observation, I stopped my investigation on the unaligned
sector size I/O on 4Kn device, and returned the P3700 PCIe SSD to the
hardware vendor.
You have a really great test rig setup with NVMes for stress
testing bcache. Can you replicate Adriano's `ioping` numbers below?
I tried the similar operation, yes it should be a bit slower than raw
device access, but should not be slow like that...
Here is my fio single thread fsync performance number,
job0: (groupid=0, jobs=1): err= 0: pid=3370: Mon May 23 16:17:05 2022
write: IOPS=20.9k, BW=81.8MiB/s (85.8MB/s)(17.3GiB/216718msec); 0
zone resets
bw ( KiB/s): min=75904, max=86872, per=100.00%, avg=83814.21,
stdev=1321.04, samples=433
iops : min=18976, max=21718, avg=20953.56, stdev=330.27,
samples=433
lat (usec) : 2=0.01%, 10=0.01%, 20=97.34%, 50=1.71%, 100=0.47%
lat (usec) : 250=0.42%, 500=0.01%, 750=0.01%, 1000=0.02%
lat (msec) : 2=0.02%, 4=0.01%
Most of the write I/Os finished in 20us, comparing to 100-250us, that is
too slow, which is out of my expectation. There should be something not
properly working.
With ioping it is also possible to notice a limitation, as the latency
of the bcache0 device is around 1.5ms, while in the case of the raw
device (a partition of NVMe), the same test is only 82.1us.
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=1 time=1.52 ms (warmup)
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=2 time=1.60 ms
4 KiB >>> /dev/bcache0 (block device 931.5 GiB): request=3 time=1.55 ms
root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=1 time=81.2 us (warmup)
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=2 time=82.7 us
4 KiB >>> /dev/nvme0n1p2 (block device 300 GiB): request=3 time=82.4 us
Wow, almost 20x higher latency, sounds convincing that something is wrong.
A few things to try:
1. Try ioping without -Y. How does it compare?
2. Maybe this is an inter-socket latency issue. Is your server
multi-socket? If so, then as a first pass you could set the kernel
cmdline `isolcpus` for testing to limit all processes to a single
socket where the NVMe is connected (see `lscpu`). Check `hwloc-ls`
or your motherboard manual to see how the NVMe port is wired to your
CPUs.
If that helps then fine tune with `numactl -cN ioping` and
/proc/irq/<n>/smp_affinity_list (and `grep nvme /proc/interrupts`) to
make sure your NVMe's are locked to IRQs on the same socket.
Wow, this is too slow...
Here is my performance number,
# ./ioping -c10 /dev/bcache0 -D -Y -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=144.3 us
(warmup)
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=84.1 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=71.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=68.9 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=69.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=68.7 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=68.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=70.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=68.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=68.5 us
# ./ioping -c10 /dev/bcache0 -D -WWW -s4k
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=1 time=127.8 us
(warmup)
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=2 time=67.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=3 time=60.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=4 time=46.9 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=5 time=52.6 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=6 time=43.8 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=7 time=52.7 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=8 time=44.3 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=9 time=52.0 us
4 KiB >>> /dev/bcache0 (block device 3.49 TiB): request=10 time=44.6 us
1.5ms is really far from my expectation, there must be something wrong....
[snipped]
Someone correct me if I'm wrong, but I don't think flush_journal=0 will
affect correctness unless there is a crash. If that /is/ the performance
problem then it would narrow the scope of this discussion.
4. I wonder if your 1.5ms `ioping` stats scale with CPU clock speed: can
you set your CPU governor to run at full clock speed and then slowest
clock speed to see if it is a CPU limit somewhere as we expect?
You can do `grep MHz /proc/cpuinfo` to see the active rate to make sure
the governor did its job.
If it scales with CPU then something in bcache is working too hard.
Maybe garbage collection? Other devs would need to chime in here to
steer the troubleshooting if that is the case.
Maybe system memory is small? 1.5ms is too slow, I cannot imagine how
it can be such slow...
5. I'm not sure if garbage collection is the issue, but you might try
Mingzhe's dynamic incremental gc patch:
https://www.spinics.net/lists/linux-bcache/msg11185.html
6. Try dm-cache and see if its IO latency is similar to bcache: If it is
about the same then that would indicate an issue in the block layer
somewhere outside of bcache. If dm-cache is better, then that confirms
a bcache issue.
Great idea.
The cache was configured directly on one of the NVMe partitions (in this
case, the first partition). I did several tests using fio and ioping,
testing on a partition on the NVMe device, without partition and
directly on the raw block, on a first partition, on the second, with or
without configuring bcache. I did all this to remove any doubt as to the
method. The results of tests performed directly on the hardware device,
without going through bcache are always fast and similar.
What is the performance number on the whole NVMe disk without
partition? In case the partition start LBA is not perfectly aligned to
some size...
Can I know the hardware configuration, and the NVMe SSD spec? Maybe I
can try to find a similar one around my location and have a try if I am
lucky.
Thanks.
Coly Li