Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

Coly Li <colyli@xxxxxxx> · Wed, 15 May 2024 07:39:07 +0800

On Mon, May 13, 2024 at 10:15:00PM -0700, Robert Pang wrote:
> Dear Coly,
>

Hi Robert,

Thanks for the email. Let me explain inline.

> Thank you for your dedication in reviewing this patch. I understand my
> previous message may have come across as urgent, but I want to
> emphasize the significance of this bcache operational issue as it has
> been reported by multiple users.
> 

What I concerned was still the testing itself. First of all, from the
following information, I see quite a lot of testings are done. I do
appreciate for the effort, which makes me confident for the quality of
this patch.

> We understand the importance of thoroughness, To that end, we have
> conducted extensive, repeated testing on this patch across a range of
> cache sizes (375G/750G/1.5T/3T/6T/9TB) and CPU cores
> (2/4/8/16/32/48/64/80/96/128) for an hour-long run. We tested various
> workloads (read-only, read-write, and write-only) with 8kB I/O size.
> In addition, we did a series of 16-hour runs with 750GB cache and 16
> CPU cores. Our tests, primarily in writethrough mode, haven't revealed
> any issues or deadlocks.
>

An hour-long run is not enough for bcache. Normally for stability prupose
at least 12-36 hours continue I/O pressure is necessary. Before Linux
v5.3 bcache will run into out-of-memory after 10 ~ 12 hours heavy randome
write workload on the server hardware Lenovo sponsored me.

This patch tends to offer high priority to allocator than gc thread, I'd
like to see what will happen if most of the cache space are allocated.

In my testing, still on the Lenovo SR650. The cache device is 512G Intel
optane memory by pmem driver, the backing device is a 4TB nvme SSD,
there are 2-way Intel Xeon processors with 48 cores and 160G DRAM on the
system. An XFS with default configuration created on the writeback mode
bcache device, and following fio job file is used,
[global]
direct=1
thread=1
lockmem=1
ioengine=libaio
random_generator=tausworthe64
group_reporting=1

[job0]
directory=/mnt/xfs/
readwrite=randwrite
numjobs=20
blocksize=4K/50:8K/30:16K/10:32K/10
iodepth=128
nrfiles=50
size=80G
time_based=1
runtime=36h

After around 10~12 hours, the cache space is almost exhuasted, and all
I/Os go bypass the cache and directly into the backing device. On this
moment, cache in used is around 96% (85% is dirty data, rested might be
journal and btree nodes). This is as expected.

Then stop the fio task, wait for writeback thread flush all dirty data
into the backing device. Now the cache space is occupied by clean data
and betree nodes. Now restart the fio writing task, an unexpected
behavior can be observed: all I/Os still go bypass the cache device and
into the backing device directly, even the cache only contains clean
data.

The above behavior turns out to be a bug from existed bcache code. When
cache space is used more than 95%, all write I/Os will go bypass the
cache. So there won't be chance to decrease the sectors counter to be
negative value to trigger garbage collection. The result is clean data
occupies all cache space but cannot be collected and re-allocate again.

Before this patch, the above issue was a bit harder to produce. Since
this patch trends to offer more priority to allocator threads than gc
threads, with very high write workload for quite long time, it is more
easier to observe the above no-space issue.

Now I fixed it and the first 8 hours run looks fine. I just continue
another 12 hours run on the same hardware configuration at this moment.

> We hope this additional testing data proves helpful. Please let us
> know if there are any other specific tests or configurations you would
> like us to consider.

The above testing information is very helpful. And bcache now is widely
deployed on business critical workload, I/O pressure testing with long
time is necessary, otherwise such regression will escape from our eyes.

Thanks. 

[snipped]

-- 
Coly Li