Re: [PATCH v2] bcache: allow allocator to invalidate bucket in gc

Robert Pang <robertpang@xxxxxxxxxx> · Mon, 13 May 2024 22:15:00 -0700

Dear Coly,

Thank you for your dedication in reviewing this patch. I understand my
previous message may have come across as urgent, but I want to
emphasize the significance of this bcache operational issue as it has
been reported by multiple users.

We understand the importance of thoroughness, To that end, we have
conducted extensive, repeated testing on this patch across a range of
cache sizes (375G/750G/1.5T/3T/6T/9TB) and CPU cores
(2/4/8/16/32/48/64/80/96/128) for an hour-long run. We tested various
workloads (read-only, read-write, and write-only) with 8kB I/O size.
In addition, we did a series of 16-hour runs with 750GB cache and 16
CPU cores. Our tests, primarily in writethrough mode, haven't revealed
any issues or deadlocks.

We hope this additional testing data proves helpful. Please let us
know if there are any other specific tests or configurations you would
like us to consider.

Thank you,
Robert

On Mon, May 13, 2024 at 12:43 AM Coly Li <colyli@xxxxxxx> wrote:
>
>
>
> > 2024年5月12日 13:43，Robert Pang <robertpang@xxxxxxxxxx> 写道：
> >
> > Hi Coly
> >
> > I see that Mingzhe has submitted the rebased patch [1]. Do you have a
> > chance to reproduce the stall and test the patch? Are we on track to
> > submit this patch upstream in the coming 6.10 merge window? Do you
> > need any help or more info?
> >
>
> Hi Robert,
>
> Please don’t push me. The first wave of bcache-6.10 is in linux-next now. For this patch, I need to do more pressure testing, to make me comfortable that no-space deadlock won’t be triggered.
>
> The testing is simple, using small I/O size (512Bytes to 4KB) to do random write on writeback mode cache for long time (24-48 hours), see whether there is any warning or deadlock happens.
>
> For me, my tests covers cache size from 256G/512G/1T/4T cache size with 20-24 CPU cores. If you may help to test on more machine and configuration, that will be helpful.
>
> I trust you and Zheming for the allocation latency measurement, now I need to confirm that offering allocation more priority than GC won’t trigger potential no-space deadlock in practice.
>
> Thanks.
>
> Coly Li
>
>
> >
> >
> > [1] https://lore.kernel.org/linux-bcache/1596418224.689.1715223543586.JavaMail.hmail@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u
> >
> >
> > On Tue, May 7, 2024 at 7:34 PM Dongsheng Yang
> > <dongsheng.yang@xxxxxxxxxxxx> wrote:
> >>
> >>
> >>
> >> 在 2024/5/4 星期六 上午 11:08, Coly Li 写道:
> >>>
> >>>
> >>>> 2024年5月4日 10:04，Robert Pang <robertpang@xxxxxxxxxx> 写道：
> >>>>
> >>>> Hi Coly,
> >>>>
> >>>>> Can I know In which kernel version did you test the patch?
> >>>>
> >>>> I tested in both Linux kernels 5.10 and 6.1.
> >>>>
> >>>>> I didn’t observe obvious performance advantage of this patch.
> >>>>
> >>>> This patch doesn't improve bcache performance. Instead, it eliminates the IO stall in bcache that happens due to bch_allocator_thread() getting blocked and waiting on GC to finish when GC happens.
> >>>>
> >>>> /*
> >>>> * We've run out of free buckets, we need to find some buckets
> >>>> * we can invalidate. First, invalidate them in memory and add
> >>>> * them to the free_inc list:
> >>>> */
> >>>> retry_invalidate:
> >>>> allocator_wait(ca, ca->set->gc_mark_valid &&  <--------
> >>>>        !ca->invalidate_needs_gc);
> >>>> invalidate_buckets(ca);
> >>>>
> >>>> From what you showed, it looks like your rebase is good. As you already noticed, the original patch was based on 4.x kernel so the bucket traversal in btree.c needs to be adapted for 5.x and 6.x kernels. I attached the patch rebased to 6.9 HEAD for your reference.
> >>>>
> >>>> But to observe the IO stall before the patch, please test with a read-write workload so GC will happen periodically enough (read-only or read-mostly workload doesn't show the problem). For me, I used the "fio" utility to generate a random read-write workload as follows.
> >>>>
> >>>> # Pre-generate a 900GB test file
> >>>> $ truncate -s 900G test
> >>>>
> >>>> # Run random read-write workload for 1 hour
> >>>> $ fio --time_based --runtime=3600s --ramp_time=2s --ioengine=libaio --name=latency_test --filename=test --bs=8k --iodepth=1 --size=900G  --readwrite=randrw --verify=0 --filename=fio --write_lat_log=lat --log_avg_msec=1000 --log_max_value=1
> >>>>
> >>>> We include the flags "--write_lat_log=lat --log_avg_msec=1000 --log_max_value=1" so fio will dump the second-by-second max latency into a log file at the end of test so we can when stall happens and for how long:
> >>>>
> >>>
> >>> Copied. Thanks for the information. Let me try the above command lines on my local machine with longer time.
> >>>
> >>>
> >>>
> >>>> E.g.
> >>>>
> >>>> $ more lat_lat.1.log
> >>>> (format: <time-ms>,<max-latency-ns>,,,)
> >>>> ...
> >>>> 777000, 5155548, 0, 0, 0
> >>>> 778000, 105551, 1, 0, 0
> >>>> 802615, 24276019570, 0, 0, 0 <---- stalls for 24s with no IO possible
> >>>> 802615, 82134, 1, 0, 0
> >>>> 804000, 9944554, 0, 0, 0
> >>>> 805000, 7424638, 1, 0, 0
> >>>>
> >>>> I used a 375 GB local SSD (cache device) and a 1 TB network-attached storage (backing device). In the 1-hr run, GC starts happening about 10 minutes into the run and then happens at ~ 5 minute intervals. The stall duration ranges from a few seconds at the beginning to close to 40 seconds towards the end. Only about 1/2 to 2/3 of the cache is used by the end.
> >>>>
> >>>> Note that this patch doesn't shorten the GC either. Instead, it just avoids GC from blocking the allocator thread by first sweeping the buckets and marking reclaimable ones quickly at the beginning of GC so the allocator can proceed while GC continues its actual job.
> >>>>
> >>>> We are eagerly looking forward to this patch to be merged in this coming merge window that is expected to open in a week to two.
> >>>
> >>> In order to avoid the no-space deadlock, normally there are around 10% space will not be allocated out. I need to look more close onto this patch.
> >>>
> >>>
> >>> Dongsheng Yang,
> >>>
> >>> Could you please post a new version based on current mainline kernel code ?
> >>
> >> Hi Coly,
> >>        Mingzhe will send a new version based on mainline.
> >>
> >> Thanx
> >>>
> >>> Thanks.
> >>>
> >>> Coly Li
> >>>
> >>>
> >>>
>