Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

Coly Li <colyli@xxxxxxx> · Sat, 4 May 2024 11:08:22 +0800

> 2024年5月4日 10:04，Robert Pang <robertpang@xxxxxxxxxx> 写道：
> 
> Hi Coly,
> 
> > Can I know In which kernel version did you test the patch?
> 
> I tested in both Linux kernels 5.10 and 6.1.
> 
> > I didn’t observe obvious performance advantage of this patch.
> 
> This patch doesn't improve bcache performance. Instead, it eliminates the IO stall in bcache that happens due to bch_allocator_thread() getting blocked and waiting on GC to finish when GC happens.
> 
> /*
> * We've run out of free buckets, we need to find some buckets
> * we can invalidate. First, invalidate them in memory and add
> * them to the free_inc list:
> */
> retry_invalidate:
> allocator_wait(ca, ca->set->gc_mark_valid &&  <--------
>        !ca->invalidate_needs_gc);
> invalidate_buckets(ca);
> 
> From what you showed, it looks like your rebase is good. As you already noticed, the original patch was based on 4.x kernel so the bucket traversal in btree.c needs to be adapted for 5.x and 6.x kernels. I attached the patch rebased to 6.9 HEAD for your reference.
> 
> But to observe the IO stall before the patch, please test with a read-write workload so GC will happen periodically enough (read-only or read-mostly workload doesn't show the problem). For me, I used the "fio" utility to generate a random read-write workload as follows.
> 
> # Pre-generate a 900GB test file
> $ truncate -s 900G test
> 
> # Run random read-write workload for 1 hour
> $ fio --time_based --runtime=3600s --ramp_time=2s --ioengine=libaio --name=latency_test --filename=test --bs=8k --iodepth=1 --size=900G  --readwrite=randrw --verify=0 --filename=fio --write_lat_log=lat --log_avg_msec=1000 --log_max_value=1 
> 
> We include the flags "--write_lat_log=lat --log_avg_msec=1000 --log_max_value=1" so fio will dump the second-by-second max latency into a log file at the end of test so we can when stall happens and for how long:
> 

Copied. Thanks for the information. Let me try the above command lines on my local machine with longer time.

> E.g.
> 
> $ more lat_lat.1.log
> (format: <time-ms>,<max-latency-ns>,,,)
> ...
> 777000, 5155548, 0, 0, 0
> 778000, 105551, 1, 0, 0
> 802615, 24276019570, 0, 0, 0 <---- stalls for 24s with no IO possible
> 802615, 82134, 1, 0, 0
> 804000, 9944554, 0, 0, 0
> 805000, 7424638, 1, 0, 0
> 
> I used a 375 GB local SSD (cache device) and a 1 TB network-attached storage (backing device). In the 1-hr run, GC starts happening about 10 minutes into the run and then happens at ~ 5 minute intervals. The stall duration ranges from a few seconds at the beginning to close to 40 seconds towards the end. Only about 1/2 to 2/3 of the cache is used by the end.
> 
> Note that this patch doesn't shorten the GC either. Instead, it just avoids GC from blocking the allocator thread by first sweeping the buckets and marking reclaimable ones quickly at the beginning of GC so the allocator can proceed while GC continues its actual job.
> 
> We are eagerly looking forward to this patch to be merged in this coming merge window that is expected to open in a week to two.

In order to avoid the no-space deadlock, normally there are around 10% space will not be allocated out. I need to look more close onto this patch.

Dongsheng Yang,

Could you please post a new version based on current mainline kernel code ?

Thanks.

Coly Li