On Wed, 15 May 2024, Coly Li wrote: > On Mon, May 13, 2024 at 10:15:00PM -0700, Robert Pang wrote: > > Dear Coly, > > > > Hi Robert, > > Thanks for the email. Let me explain inline. > > > Thank you for your dedication in reviewing this patch. I understand my > > previous message may have come across as urgent, but I want to > > emphasize the significance of this bcache operational issue as it has > > been reported by multiple users. > > > > What I concerned was still the testing itself. First of all, from the > following information, I see quite a lot of testings are done. I do > appreciate for the effort, which makes me confident for the quality of > this patch. > > > We understand the importance of thoroughness, To that end, we have > > conducted extensive, repeated testing on this patch across a range of > > cache sizes (375G/750G/1.5T/3T/6T/9TB) and CPU cores > > (2/4/8/16/32/48/64/80/96/128) for an hour-long run. We tested various > > workloads (read-only, read-write, and write-only) with 8kB I/O size. > > In addition, we did a series of 16-hour runs with 750GB cache and 16 > > CPU cores. Our tests, primarily in writethrough mode, haven't revealed > > any issues or deadlocks. > > > > An hour-long run is not enough for bcache. Normally for stability prupose > at least 12-36 hours continue I/O pressure is necessary. Before Linux > v5.3 bcache will run into out-of-memory after 10 ~ 12 hours heavy randome > write workload on the server hardware Lenovo sponsored me. FYI: We have been running the v2 patch in production on 5 different servers containing a total of 8 bcache volumes since April 7th this year, applied to 6.6.25 and later kernels. Some servers run 4k sector sizes, and others run 512-byte sectors for the data volume. For the cache volumes, their all cache devices use 512 byte sectors. The backing storage on these servers range from 40-350 terabytes, and the cache sizes are in the 1-2 TB range. We log kernel messages with netconsole into a centralized log server and have not had any bcache issues. -- Eric Wheeler > > This patch tends to offer high priority to allocator than gc thread, I'd > like to see what will happen if most of the cache space are allocated. > > In my testing, still on the Lenovo SR650. The cache device is 512G Intel > optane memory by pmem driver, the backing device is a 4TB nvme SSD, > there are 2-way Intel Xeon processors with 48 cores and 160G DRAM on the > system. An XFS with default configuration created on the writeback mode > bcache device, and following fio job file is used, > [global] > direct=1 > thread=1 > lockmem=1 > ioengine=libaio > random_generator=tausworthe64 > group_reporting=1 > > [job0] > directory=/mnt/xfs/ > readwrite=randwrite > numjobs=20 > blocksize=4K/50:8K/30:16K/10:32K/10 > iodepth=128 > nrfiles=50 > size=80G > time_based=1 > runtime=36h > > After around 10~12 hours, the cache space is almost exhuasted, and all > I/Os go bypass the cache and directly into the backing device. On this > moment, cache in used is around 96% (85% is dirty data, rested might be > journal and btree nodes). This is as expected. > > Then stop the fio task, wait for writeback thread flush all dirty data > into the backing device. Now the cache space is occupied by clean data > and betree nodes. Now restart the fio writing task, an unexpected > behavior can be observed: all I/Os still go bypass the cache device and > into the backing device directly, even the cache only contains clean > data. > > The above behavior turns out to be a bug from existed bcache code. When > cache space is used more than 95%, all write I/Os will go bypass the > cache. So there won't be chance to decrease the sectors counter to be > negative value to trigger garbage collection. The result is clean data > occupies all cache space but cannot be collected and re-allocate again. > > Before this patch, the above issue was a bit harder to produce. Since > this patch trends to offer more priority to allocator threads than gc > threads, with very high write workload for quite long time, it is more > easier to observe the above no-space issue. > > Now I fixed it and the first 8 hours run looks fine. I just continue > another 12 hours run on the same hardware configuration at this moment. > > > We hope this additional testing data proves helpful. Please let us > > know if there are any other specific tests or configurations you would > > like us to consider. > > The above testing information is very helpful. And bcache now is widely > deployed on business critical workload, I/O pressure testing with long > time is necessary, otherwise such regression will escape from our eyes. > > Thanks. > > [snipped] > > -- > Coly Li > >