Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Thu, 16 May 2024 17:30:17 -0700 (PDT)

On Wed, 15 May 2024, Coly Li wrote:
> On Mon, May 13, 2024 at 10:15:00PM -0700, Robert Pang wrote:
> > Dear Coly,
> >
> 
> Hi Robert,
> 
> Thanks for the email. Let me explain inline.
>  
> > Thank you for your dedication in reviewing this patch. I understand my
> > previous message may have come across as urgent, but I want to
> > emphasize the significance of this bcache operational issue as it has
> > been reported by multiple users.
> > 
> 
> What I concerned was still the testing itself. First of all, from the
> following information, I see quite a lot of testings are done. I do
> appreciate for the effort, which makes me confident for the quality of
> this patch.
> 
> > We understand the importance of thoroughness, To that end, we have
> > conducted extensive, repeated testing on this patch across a range of
> > cache sizes (375G/750G/1.5T/3T/6T/9TB) and CPU cores
> > (2/4/8/16/32/48/64/80/96/128) for an hour-long run. We tested various
> > workloads (read-only, read-write, and write-only) with 8kB I/O size.
> > In addition, we did a series of 16-hour runs with 750GB cache and 16
> > CPU cores. Our tests, primarily in writethrough mode, haven't revealed
> > any issues or deadlocks.
> >
> 
> An hour-long run is not enough for bcache. Normally for stability prupose
> at least 12-36 hours continue I/O pressure is necessary. Before Linux
> v5.3 bcache will run into out-of-memory after 10 ~ 12 hours heavy randome
> write workload on the server hardware Lenovo sponsored me.

FYI:

We have been running the v2 patch in production on 5 different servers 
containing a total of 8 bcache volumes since April 7th this year, applied 
to 6.6.25 and later kernels. Some servers run 4k sector sizes, and others 
run 512-byte sectors for the data volume. For the cache volumes, their all 
cache devices use 512 byte sectors.

The backing storage on these servers range from 40-350 terabytes, and the 
cache sizes are in the 1-2 TB range.  We log kernel messages with 
netconsole into a centralized log server and have not had any bcache 
issues.

--
Eric Wheeler

> 
> This patch tends to offer high priority to allocator than gc thread, I'd
> like to see what will happen if most of the cache space are allocated.
> 
> In my testing, still on the Lenovo SR650. The cache device is 512G Intel
> optane memory by pmem driver, the backing device is a 4TB nvme SSD,
> there are 2-way Intel Xeon processors with 48 cores and 160G DRAM on the
> system. An XFS with default configuration created on the writeback mode
> bcache device, and following fio job file is used,
> [global]
> direct=1
> thread=1
> lockmem=1
> ioengine=libaio
> random_generator=tausworthe64
> group_reporting=1
> 
> [job0]
> directory=/mnt/xfs/
> readwrite=randwrite
> numjobs=20
> blocksize=4K/50:8K/30:16K/10:32K/10
> iodepth=128
> nrfiles=50
> size=80G
> time_based=1
> runtime=36h
> 
> After around 10~12 hours, the cache space is almost exhuasted, and all
> I/Os go bypass the cache and directly into the backing device. On this
> moment, cache in used is around 96% (85% is dirty data, rested might be
> journal and btree nodes). This is as expected.
> 
> Then stop the fio task, wait for writeback thread flush all dirty data
> into the backing device. Now the cache space is occupied by clean data
> and betree nodes. Now restart the fio writing task, an unexpected
> behavior can be observed: all I/Os still go bypass the cache device and
> into the backing device directly, even the cache only contains clean
> data.
> 
> The above behavior turns out to be a bug from existed bcache code. When
> cache space is used more than 95%, all write I/Os will go bypass the
> cache. So there won't be chance to decrease the sectors counter to be
> negative value to trigger garbage collection. The result is clean data
> occupies all cache space but cannot be collected and re-allocate again.
> 
> Before this patch, the above issue was a bit harder to produce. Since
> this patch trends to offer more priority to allocator threads than gc
> threads, with very high write workload for quite long time, it is more
> easier to observe the above no-space issue.
> 
> Now I fixed it and the first 8 hours run looks fine. I just continue
> another 12 hours run on the same hardware configuration at this moment.
>  
> > We hope this additional testing data proves helpful. Please let us
> > know if there are any other specific tests or configurations you would
> > like us to consider.
> 
> The above testing information is very helpful. And bcache now is widely
> deployed on business critical workload, I/O pressure testing with long
> time is necessary, otherwise such regression will escape from our eyes.
> 
> Thanks. 
> 
> [snipped]
> 
> -- 
> Coly Li
> 
>