Re: Bcache btree cache size bug

Benard Bsc <benard_bsc@xxxxxxxxxxx> · Thu, 4 May 2023 14:38:28 +0000

I've tried to upgrade the system to the latest kernel available: 5.4.0-139 and I have attempted to swap out the raid controller in case it was faulty but none of those things have helped the situation. Once again even within the same deployment 2 nodes are experiencing this problem but another one is fine, despite the fact that they all seem to be the same.

Is there anything else that can be checked/done besides upgrading the distro? Since this is a HWE kernel I was under the impression that it should be supported until 2028. Are there any hidden configuration parameters which could have perhaps caused this issue? 

Regards

Benard

From: Benard Bsc <benard_bsc@xxxxxxxxxxx>
Sent: 10 February 2023 10:44
To: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
Cc: linux-bcache@xxxxxxxxxxxxxxx <linux-bcache@xxxxxxxxxxxxxxx>
Subject: Re: Bcache btree cache size bug 

On Thu, 2023-02-09 at 17:07 +0100, Andrea Tomassetti wrote:
> On Thu, Feb 9, 2023 at 1:22 PM <benard_bsc@xxxxxxxxxxx> wrote:
> > I believe I have found a bug in bcache where the btree grows out of
> > control and makes operations like garbage collection take a very
> > large
> > amount of time affecting client IO. I can see periodic periods
> > where
> > bcache devices stop responding to client IO and the cache device
> > starts
> > doing a lage amount of reads. In order to test the above I
> > triggered gc
> > manually using 'echo 1 > trigger_gc' and observing the cache set.
> > Once
> > again a large amount of reads start happening on the cache device
> > and
> > all the bcache devices of that cache set stop responding. I believe
> > this is becouse gc blocks all client IO while its happening (might
> > be
> > wrong). Checking the stats I can see that the
> > 'btree_gc_average_duration_ms'  is almost 2 minutes
> > (btree_gc_average_duration_ms) which seems excessively large to me.
> > Furthermore doing something like checking bset_tree_stats will just
> > hang and cause a similar performance impact.
> > 
> > An interesting thing to note is that after garbage collection the
> > number of btree nodes is lower but the btree cache actually grows
> > in
> > size.
> > 
> > Example:
> > /sys/fs/bcache/c_set# cat btree_cache_size
> > 5.2G
> > /sys/fs/bcache/c_set# cat internal/btree_nodes
> > 28318
> > /sys/fs/bcache/c_set# cat average_key_size
> > 25.2k
> > 
> > Just for reference I have a similar environment (which is busier
> > and
> > has more data stored) which doesnt experience this issue and the
> > numbers for the above are:
> > /sys/fs/bcache/c_set# cat btree_cache_size
> > 840.5M
> > /sys/fs/bcache/c_set# cat internal/btree_nodes
> > 3827
> > /sys/fs/bcache/c_set# cat average_key_size
> > 88.3k
> > 
> > Kernel version: 5.4.0-122-generic
> > OS version: Ubuntu 18.04.6 LTS
> Hi Bernard,
> your linux distro and kernel version are quite old. There are good
> chances that things got fixed in the meanwhile. Would it be possible
> for you to try to reproduce your bug with a newer kernel?
> 
> Regards,
> Andrea
> > bcache-tools package: 1.0.8-2ubuntu0.18.04.1
> > 
> > I am able to provide more info if needed
> > Regards
> > 
Hi Andrea,

Thank you very much for your email. Unfortunately due to the nature of
this system and the other software running on it I am unable to upgrade
the kernel/distro at the moment. I am also unsure that I will be able
to reproduce this bug as even on other deployments with the same
version of bcache/kernel this problem does not seem to be happening. Is
there some information I can gather from the existing environment
without changing the software versions?

Regards,

Benard