On Wed, May 11, 2022 at 12:58:48PM +0000, Adriano Silva wrote: > Tank you for your answer! > > > bcache needs to do a lot of metadata work, resulting in a noticeable > > write amplification. My testing with bcache (some years ago and only with > > SATA SSDs) showed that bcache latency increases a lot with high amounts > > of dirty data > > I'm testing with empty devices, no data. > > Wouldn't write amplification be noticeable in dstat? Because it doesn't seem significant during the tests, since I monitor reads and writes in all disks in dstat. yes, you are right, that would be visible. I was misled from ~3k writes to nvme (vs. ~1.5k writes from fio), but the same ~3k writes are on bcache. > > I also found performance to increase slightly when a bcache device > > was created with 4k block size instead of default 512bytes. > > Are you talking about changing the block size for the cache device or the backing device? neither - it was the "-w" argument to make-bcache. I found some old logfile from my tests. Where both hdd and ssd showed as 512b-sector-devices, the command to create the bcache device was make-bcache --data_offset 2048 --wipe-bcache -w 4k -C /dev/sde1 -B /dev/sdb In /sys/block/bcacheX/queue/hw_sector_size it then says "4096". > But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point! I can't claim to fully understand what fsync does (or how a block device driver is supposed to handle it), but this might account for the roughly doubled writes shown with dstat as opposed to the fio results. >From the name "journal-test" I guess you are trying something like https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ He uses very similar parameters, except with "--sync=1", not "--fsync=1". This is a proper benchmark for the old ceph filestore journal, as this was written linearly, and in the worst case could have been written in chunks as small as 4k. As you are using proxmox, I guess you want to use its ceph component. They use the modern ceph bluestore format, and there is no journal anymore. I don't know if the bluestore WAL exhibits similar access patterns as the old journal and if this benchmark still has real-world relevance. But when having enough NVMe disk space, you are advised to put bluestore WAL and ideally also the bluestore DB directly on NVMe, and use bcache only for the bluestore data part. If you do so, make sure to set rotational=1 on the bcache device before creating the OSD, or ceph will use unsuitable bluestore parameters, possibly overwhelming the hdd: https://www.spinics.net/lists/ceph-users/msg71646.html Matthias