> 2023年5月4日 22:38,Benard Bsc <benard_bsc@xxxxxxxxxxx> 写道: > > I've tried to upgrade the system to the latest kernel available: 5.4.0-139 and I have attempted to swap out the raid controller in case it was faulty but none of those things have helped the situation. Once again even within the same deployment 2 nodes are experiencing this problem but another one is fine, despite the fact that they all seem to be the same. > > Is there anything else that can be checked/done besides upgrading the distro? Since this is a HWE kernel I was under the impression that it should be supported until 2028. Are there any hidden configuration parameters which could have perhaps caused this issue? > > Regards > > Benard > > > From: Benard Bsc <benard_bsc@xxxxxxxxxxx> > Sent: 10 February 2023 10:44 > To: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> > Cc: linux-bcache@xxxxxxxxxxxxxxx <linux-bcache@xxxxxxxxxxxxxxx> > Subject: Re: Bcache btree cache size bug > > On Thu, 2023-02-09 at 17:07 +0100, Andrea Tomassetti wrote: >> On Thu, Feb 9, 2023 at 1:22 PM <benard_bsc@xxxxxxxxxxx> wrote: >>> I believe I have found a bug in bcache where the btree grows out of >>> control and makes operations like garbage collection take a very >>> large >>> amount of time affecting client IO. I can see periodic periods >>> where >>> bcache devices stop responding to client IO and the cache device >>> starts >>> doing a lage amount of reads. In order to test the above I >>> triggered gc >>> manually using 'echo 1 > trigger_gc' and observing the cache set. >>> Once >>> again a large amount of reads start happening on the cache device >>> and >>> all the bcache devices of that cache set stop responding. I believe >>> this is becouse gc blocks all client IO while its happening (might >>> be >>> wrong). Checking the stats I can see that the >>> 'btree_gc_average_duration_ms' is almost 2 minutes >>> (btree_gc_average_duration_ms) which seems excessively large to me. >>> Furthermore doing something like checking bset_tree_stats will just >>> hang and cause a similar performance impact. >>> >>> An interesting thing to note is that after garbage collection the >>> number of btree nodes is lower but the btree cache actually grows >>> in >>> size. >>> >>> Example: >>> /sys/fs/bcache/c_set# cat btree_cache_size >>> 5.2G >>> /sys/fs/bcache/c_set# cat internal/btree_nodes >>> 28318 >>> /sys/fs/bcache/c_set# cat average_key_size >>> 25.2k >>> >>> Just for reference I have a similar environment (which is busier >>> and >>> has more data stored) which doesnt experience this issue and the >>> numbers for the above are: >>> /sys/fs/bcache/c_set# cat btree_cache_size >>> 840.5M >>> /sys/fs/bcache/c_set# cat internal/btree_nodes >>> 3827 >>> /sys/fs/bcache/c_set# cat average_key_size >>> 88.3k >>> >>> Kernel version: 5.4.0-122-generic >>> OS version: Ubuntu 18.04.6 LTS >> Hi Bernard, >> your linux distro and kernel version are quite old. There are good >> chances that things got fixed in the meanwhile. Would it be possible >> for you to try to reproduce your bug with a newer kernel? >> >> Regards, >> Andrea >>> bcache-tools package: 1.0.8-2ubuntu0.18.04.1 >>> >>> I am able to provide more info if needed >>> Regards >>> > Hi Andrea, > > Thank you very much for your email. Unfortunately due to the nature of > this system and the other software running on it I am unable to upgrade > the kernel/distro at the moment. I am also unsure that I will be able > to reproduce this bug as even on other deployments with the same > version of bcache/kernel this problem does not seem to be happening. Is > there some information I can gather from the existing environment > without changing the software versions? The patches which limit B+tree node in-memory cache were merged in Linux v5.6, if the kernel you used doesn’t have these patches backported, the mentioned situation may still exist. For non-distro kernel, the latest kernel might be a choice. For distro kernel, it can be helpful if the distro kernel maintainers have the bcache fixes in. Coly Li