On Mon, May 13, 2024 at 10:15:00PM -0700, Robert Pang wrote: > Dear Coly, > Hi Robert, Thanks for the email. Let me explain inline. > Thank you for your dedication in reviewing this patch. I understand my > previous message may have come across as urgent, but I want to > emphasize the significance of this bcache operational issue as it has > been reported by multiple users. > What I concerned was still the testing itself. First of all, from the following information, I see quite a lot of testings are done. I do appreciate for the effort, which makes me confident for the quality of this patch. > We understand the importance of thoroughness, To that end, we have > conducted extensive, repeated testing on this patch across a range of > cache sizes (375G/750G/1.5T/3T/6T/9TB) and CPU cores > (2/4/8/16/32/48/64/80/96/128) for an hour-long run. We tested various > workloads (read-only, read-write, and write-only) with 8kB I/O size. > In addition, we did a series of 16-hour runs with 750GB cache and 16 > CPU cores. Our tests, primarily in writethrough mode, haven't revealed > any issues or deadlocks. > An hour-long run is not enough for bcache. Normally for stability prupose at least 12-36 hours continue I/O pressure is necessary. Before Linux v5.3 bcache will run into out-of-memory after 10 ~ 12 hours heavy randome write workload on the server hardware Lenovo sponsored me. This patch tends to offer high priority to allocator than gc thread, I'd like to see what will happen if most of the cache space are allocated. In my testing, still on the Lenovo SR650. The cache device is 512G Intel optane memory by pmem driver, the backing device is a 4TB nvme SSD, there are 2-way Intel Xeon processors with 48 cores and 160G DRAM on the system. An XFS with default configuration created on the writeback mode bcache device, and following fio job file is used, [global] direct=1 thread=1 lockmem=1 ioengine=libaio random_generator=tausworthe64 group_reporting=1 [job0] directory=/mnt/xfs/ readwrite=randwrite numjobs=20 blocksize=4K/50:8K/30:16K/10:32K/10 iodepth=128 nrfiles=50 size=80G time_based=1 runtime=36h After around 10~12 hours, the cache space is almost exhuasted, and all I/Os go bypass the cache and directly into the backing device. On this moment, cache in used is around 96% (85% is dirty data, rested might be journal and btree nodes). This is as expected. Then stop the fio task, wait for writeback thread flush all dirty data into the backing device. Now the cache space is occupied by clean data and betree nodes. Now restart the fio writing task, an unexpected behavior can be observed: all I/Os still go bypass the cache device and into the backing device directly, even the cache only contains clean data. The above behavior turns out to be a bug from existed bcache code. When cache space is used more than 95%, all write I/Os will go bypass the cache. So there won't be chance to decrease the sectors counter to be negative value to trigger garbage collection. The result is clean data occupies all cache space but cannot be collected and re-allocate again. Before this patch, the above issue was a bit harder to produce. Since this patch trends to offer more priority to allocator threads than gc threads, with very high write workload for quite long time, it is more easier to observe the above no-space issue. Now I fixed it and the first 8 hours run looks fine. I just continue another 12 hours run on the same hardware configuration at this moment. > We hope this additional testing data proves helpful. Please let us > know if there are any other specific tests or configurations you would > like us to consider. The above testing information is very helpful. And bcache now is widely deployed on business critical workload, I/O pressure testing with long time is necessary, otherwise such regression will escape from our eyes. Thanks. [snipped] -- Coly Li