Tankyou Eric, Matthias, Coly.. > Disk controllers seem to interpret FLUSH CACHE / FUA differently. > If bcache would set FUA for cache device writes while running fio > directly on the nvme device would not, that might explain the timing > difference. Matthias, thanks a lot for helping! I believe this test was not aimed at the Ceph context. Although my ultimate goal is to run Ceph (this you are correct), Ceph is still off. Turning on Ceph will be my next step, after getting a solid cached device setup. And these direct and synchronized disk-based tests is useful for Ceph, but also can be useful to get an idea of how it will work for other applications too, such as an Oracle database engine, PostgreSQL, or other database engines. On the other hand, I believe that this result is obtained by the fact that an enterprise NVME with PLP (Power Loss Protection) is very fast for direct writes. More than expected from OS caching mechanisms. If I'm not mistaken, the test was about the OS caching mechanism. Eric, I don't see big problems in creating the bcache using -w 4096. But there might be some situation that it degrades the performance trying to write in 512 Bytes, as you said.. This can worry in production environment? Anyway, the performance even using -w 4096 was still way below the native NVME performance. Is this because of the metadata headers? I noticed one thing via the dstat tool (seems useful for checking the data flow and the flow of I/O operations to the devices in real time): For each write of a 4K block to bcache, it results in a 16K write to the cache device (NVME). This seems to represent that bcache writes an excess 12KB (three times the size of the 4K block) as a form of header, metadata, or whatever, some useful mapping information from it, for each 4K block written. That's right? Is correct? If this is correct, it might explain why I still only have 1/4 of the performance of NVME writing 4KB blocks, even if I format bcache with -w 4096. Because if for every 4KB block I write to bcache, it needs writing 4X that same amount of data to the cache device, it's obvious that I'm only going to get 25% of the hardware performance. That's it ? Another thing that's intrigued me now, is the difference in performance of bcache from one server to the other... Although I believe that this must be some configuration, because the hardware is identical, I can't imagine which one. I even hit the memory to be the same on both machines, even the SATA position of the disks, so there is no difference. But even so, the second machine insists on having half the performance of the first, just in the cache. And again by dstat, I verify that there are zero Bytes written or read to the backing device, while 4K blocks are written to the bcache device and NVME hardware. And that's correct, I think. But at the same time, dstat indicates that I/O operations are taking place to the backing device. And this does not occur on the first server, only on the second. It seems clear to me that this behavior is halving the performance on the second server. But why? Why are there IO operations destined for the backup device with "zero" bytes written or read? What kind of IO operation could write or read zero Bytes? And why would they occur? This is one more step of research.. If anyone has an idea, I'd appreciate it. Thank you all! Em sábado, 28 de maio de 2022 04:22:51 BRT, Matthias Ferdinand <bcache@xxxxxxxxx> escreveu: On Fri, May 27, 2022 at 06:27:53PM -0700, Eric Wheeler wrote: > > I can say that the performance of tests after the write back command for > > all devices greatly worsens the performance of direct tests on NVME > > hardware. Below you can see this. > > I wonder what is going on there! I tried the same thing on my system and > 'write through' is faster for me, too, so it would be worth investigating. In Ceph context, it seems not unusual to disable SSD write back cache and see much improved performance (or the other way round: see surprisingly low performance with write back cache enabled): https://yourcmc.ru/wiki/Ceph_performance#Drive_cache_is_slowing_you_down Disk controllers seem to interpret FLUSH CACHE / FUA differently. If bcache would set FUA for cache device writes while running fio directly on the nvme device would not, that might explain the timing difference. Regards Matthias