Tank you for your answer! > bcache needs to do a lot of metadata work, resulting in a noticeable > write amplification. My testing with bcache (some years ago and only with > SATA SSDs) showed that bcache latency increases a lot with high amounts > of dirty data I'm testing with empty devices, no data. Wouldn't write amplification be noticeable in dstat? Because it doesn't seem significant during the tests, since I monitor reads and writes in all disks in dstat. > I also found performance to increase slightly when a bcache device > was created with 4k block size instead of default 512bytes. Are you talking about changing the block size for the cache device or the backing device? I tried switching to cache device but bcache gave error when trying to attach device afterwards. It only worked when I kept the values as default (512). I was only able, in the creation of the cache, to change the value of the bucket to 16K (which is what I found for information about my NVMe, but I don't even know if it's correct), but unfortunately that didn't change the result of the IOPS or the latency. > so I used to tune down writeback_percent, usually to 1, > and used to keep the cache device size low at around 40GB. I think it must be some fine tuning. One curious thing I noticed, is that writing is always taking place on the flash, never on the spinning disk. This is expected and should give the same fast response as the flash device. However, this is not what happens when going through bcache. But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point! And without fsync, ioping tests also speed up, albeit less. In this case, I can see that the latency drops to something around 600~700us. Nothing compared to the 84us obtained when recording directly to the flash device (with or without fsync). But it's still much better than the 1.5ms you get in bcache when you add the fsync flag to wait for the write response. That is, what it looks like is that there is a wait placed by the bcache layer between the write being sent to it, it waiting for the disk response, and then sending the response to the application. This is increasing latency and consequently reducing performance. I think it must be some fine tuning (or no?). I think that this tool (bcache) is not used much, at least not in this way, because I'm having difficulties getting feedback on the Internet. I didn't even know where to get help. In fact, the use of writes in small blocks with fsync and direct flags is not very common. It is commonly used in database servers and other data center storage tools that need to make sure that the data is physically written to the device immediately after each operation. The problem is that these applications need to guarantee that the writes were actually performed and the disk caches are made of volatile memory, which does not guarantee the write, because a power failure can occur and then the data that was only in the cache is lost. That's why the request in each operation that the data be written directly, without going through the cache and that the response comes immediately. This makes operations very slow in nature. And everything is even slower when each operation has the small size of only 4K, for example. That is, for each requested write operation of only 4K, an instruction is sent along with it requesting that the data is not stopped in the disk cache (suspecting that the cache is a volatile memory) and that the data is immediately written, with confirmation being of such recording coming from the device afterwards. This significantly increases latency. And that's why in these environments it is recommended to use RAID cards with cache and batteries that ignore the direct and fsync instructions, but guarantee data saving, even in cases of power failure precisely because of the batteries. But still, nowadays with enterprise flash devices, containing tantalum capacitors that act as true built-in UPS, RAID arrays, besides being expensive, are no longer considered so fast. In this sense, flash devices with built-in supercapacitors also work by ignoring fsync flags and guaranteeing recording, even in cases of power failure. Thus, writings on these devices become so fast that it doesn't even seem like a physical write confirmation request was sent for each operation. The operations are fast for the databases as well as any simple writes that would naturally occur to the cache of a consumer flash disk. But enterprise data center flash disks are very expensive! So the idea would be to use spinning disks for write space, but use enterprise datacenter flash disks (NVMe) as cache with bcache. So, theoretically, bcache would divert writes (especially small ones) always directly to the NVMe drive and I would benefit from all the low latency, high throughput, and IOPs of the drive, on most writes and reads. Unfortunately something is not working out as I imagined. Because something is limiting IOPS and increasing latency a lot. I think it might be something I'm doing wrong in the configuration. Or some fine tuning I don't know how to do. Thank you! The search continues. If anyone else can help, I'd appreciate it! Em quarta-feira, 11 de maio de 2022 03:20:18 BRT, Matthias Ferdinand <bcache@xxxxxxxxx> escreveu: On Tue, May 10, 2022 at 04:49:35PM +0000, Adriano Silva wrote: > As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s. > > This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe. Hi, bcache needs to do a lot of metadata work, resulting in a noticeable write amplification. My testing with bcache (some years ago and only with SATA SSDs) showed that bcache latency increases a lot with high amounts of dirty data, so I used to tune down writeback_percent, usually to 1, and used to keep the cache device size low at around 40GB. I also found performance to increase slightly when a bcache device was created with 4k block size instead of default 512bytes. Still quite a decrease in iops. Maybe you could monitor with iostat, it gives those _await columns, there might be some hints. Matthias > I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job. > > The commands used to configure bcache were: > > # echo writeback > /sys/block/bcache0/bcache/cache_mode > # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff > ## > ## Then I tried everything also with the commands below, but there was no improvement. > ## > # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us > # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us > > > Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background). > > --dsk/sdb---dsk/nvme0n1-dsk/bcache0 ---io/sdb----io/nvme0n1--io/bcache0 -net/total- ---load-avg--- --total-cpu-usage-- ---system-- ----system---- async > read writ: read writ: read writ| read writ: read writ: read writ| recv send| 1m 5m 15m |usr sys idl wai stl| int csw | time | #aio > 0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |8462B 8000B|0.03 0.15 0.31| 1 0 99 0 0| 250 383 |09-05 15:19:47| 0 > 0 0 :4096B 454k: 0 336k| 0 0 :1.00 184 : 0 170 |4566B 4852B|0.03 0.15 0.31| 2 2 94 1 0|1277 3470 |09-05 15:19:48| 1B > 0 8192B: 0 8022k: 0 6512k| 0 2.00 : 0 3388 : 0 3254 |3261B 2827B|0.11 0.16 0.32| 0 2 93 5 0|4397 16k|09-05 15:19:49| 1B > 0 0 : 0 7310k: 0 6460k| 0 0 : 0 3240 : 0 3231 |6773B 6428B|0.11 0.16 0.32| 0 1 93 6 0|4190 16k|09-05 15:19:50| 1B > 0 0 : 0 7313k: 0 6504k| 0 0 : 0 3252 : 0 3251 |6719B 6201B|0.11 0.16 0.32| 0 2 92 6 0|4482 16k|09-05 15:19:51| 1B > 0 0 : 0 7313k: 0 6496k| 0 0 : 0 3251 : 0 3250 |4743B 4016B|0.11 0.16 0.32| 0 1 93 6 0|4243 16k|09-05 15:19:52| 1B > 0 0 : 0 7329k: 0 6496k| 0 0 : 0 3289 : 0 3245 |6107B 6062B|0.11 0.16 0.32| 1 1 90 8 0|4706 18k|09-05 15:19:53| 1B > 0 0 : 0 5373k: 0 4184k| 0 0 : 0 2946 : 0 2095 |6387B 6062B|0.26 0.19 0.33| 0 2 95 4 0|3774 12k|09-05 15:19:54| 1B > 0 0 : 0 6966k: 0 5668k| 0 0 : 0 3270 : 0 2834 |7264B 7546B|0.26 0.19 0.33| 0 1 93 5 0|4214 15k|09-05 15:19:55| 1B > 0 0 : 0 7271k: 0 6252k| 0 0 : 0 3258 : 0 3126 |5928B 4584B|0.26 0.19 0.33| 0 2 93 5 0|4156 16k|09-05 15:19:56| 1B > 0 0 : 0 7419k: 0 6504k| 0 0 : 0 3308 : 0 3251 |5226B 5650B|0.26 0.19 0.33| 2 1 91 6 0|4433 16k|09-05 15:19:57| 1B > 0 0 : 0 6444k: 0 5704k| 0 0 : 0 2873 : 0 2851 |6494B 8021B|0.26 0.19 0.33| 1 1 91 7 0|4352 16k|09-05 15:19:58| 0 > 0 0 : 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |6030B 7204B|0.24 0.19 0.32| 0 0 100 0 0| 209 279 |09-05 15:19:59| 0