Re: Bcache is not caching anything. cache state=inconsistent, how to clear?

Kai Krakow <kai@xxxxxxxxxxx> · Wed, 24 Nov 2021 06:35:48 +0100

Hello!

Am Di., 23. Nov. 2021 um 23:34 Uhr schrieb Tobiasz Karoń <unfa00@xxxxxxxxx>:
>
> Thank you for your detailed reply and sharing your experience and solution.
>
> So it seems Bcache and Btrfs are fundamentally incompatible when it
> comes to caching writes? It has worked fine for 2 months, and then it
> just imploded. I'll stay in writearound mode to be safe.

No, they are not fundamentally incompatible but losing writeback data
on btrfs is much more a visible catastrophic event than to other file
systems (which write data in-place when btrfs writes cow).

Even with other filesystems and bcache destroying itself in writeback
mode would cause severe damage of your filesystem (on classical
filesystem, usually you end up with garbled files having partially old
and new data, maybe some fixable metadata errors) - BUT: it is still a
catastrophic event, maybe even more so because data loss could go
silent, ending up in your backups, only to find later that you're
missing data that has already been rotated out of the backup.

Don't use writeback if you cannot afford to recover from backup when
writeback fails. That's a property of how caching works, not a
property of btrfs or bcache. It's the same for any writeback cache you
might be using: RAID-controllers come with writeback caches, and
decide to throw it away sometimes, leaving you with destroyed
filesystems, so you usually turn that off unless your workload
requires it and you can afford to throw lost data away). That doesn't
make them fundamentally incompatible with filesystems, right? Your HDD
comes with write caches which may destroy your filesystem, too, on
power-loss. You might want to turn that off, especially when using
btrfs (but also for better write latency behavior, and the kernel has
better IO scheduling anyways than the really small writecaches of
HDDs): `hdparm -W0 /dev/HDDDEV`. HDD write caches are only useful for
operating systems that do no proper write ordering/merging (usually
DOS, and maybe Windows), and sometimes HDD firmwares are buggy and
cannot use async queueing, when write caches may improve performance a
lot. But usually, you want to keep that setting off. That becomes even
more important when you use bcache in writeback mode (because HDD
write caching may then break assumptions of bcache).

> I've checked and my cache device has a block size of 512 bytes.

Yep, all my bcache systems using 512 bytes are affected by that 5.15.2
kernel bug. Use 4k and you should be okay. The problem seems to come
from page-unaligned writes - and using 4k (the page size of your CPU)
seems to work around that. Kernel 5.15.3 has the most part of the fix,
another fix is queued for one of the next releases. Another lesson
learned: Don't use a new kernel until it's in its x.y.{4,5,6}
releases. This is not the first time I had catastrophic events with
kernels in their infancy. That's why I usually avoid .0 and .1
kernels. Seems I should add .2 and .3 kernels to that list, too. Never
do a major kernel upgrade without creating a full backup first. Kernel
components like bcache are much less well-tested than other
components, so they likely break on early kernel releases for some
exotic use-cases (exotic because nobody who cares about their data
uses writeback).

> That's
> a strange value, as the backing device is a AF HDD (like all of them
> in the past decade or more), so the block size should be 4Kb.
> I guess this also works until it doesn't.

You won't have catastrophic events with writearound - and that's as
good as writeback on btrfs (and even better because it won't destroy
the filesystem in case of a cache hiccup). Bcache can break for any
reason, due to bugs, like any other kernel component. And bcache in
writeback mode usually means catastrophic results for ANY file system
attached to it - where btrfs is just much more likely to detect those
events. Even if you COULD repair the file system logical structure, it
still means some data wasn't written - btrfs just has a much better
understanding about what should be on the disk while other filesystems
silently accept the data loss after recovering from structural errors.
BTW: 4k should be safe, there's another problem in bcache unrelated to
this which still needs fixing.

> Can I destroy and recreate the cache device on a live system (my root
> filesystem is on this bcache set). I guess I can't.

Yes, you can. Detaching the cache makes the backing devices pass
through, they are still available as /dev/bcache* even with no caching
device.

> This is probably what I've done wrong today - I did
> not unregister the whole cset before attempting to recreate the cache
> device.

Okay, unregistering should be quite essential but you don't need to
reboot. Also, I recommend using a new cset UUID so it cannot conflict
with any stale data that MAY be stored in the cache.

> I am honestly a little afraid to touch it, after what happened.

Well, the cache backend is stopped or detached - it doesn't matter
anyways. Just don't use writeback for the next couple of kernel
releases (or maybe rather avoid it for the future completely).
Writeback really doesn't gain you a lot on btrfs because due to COW,
btrfs is already quite good writing (because writes are usually going
to be sequential anyways), and it has become a lot better during the
last few kernel release cycles. I've been using writeback for a long
time now but this is just another occasion why I should not have been
using writeback but writearound instead (the other one being that
sometimes on boot, my SSD detaches from the bus, making bcache throw
away all writeback data and leaving me with a destroyed filesystem).

> I hope Bcachefs will eliminate these problems and provide a stable
> unified solution.

You're swapping one "experimental" FS (btrfs) which has matured great
ways during at least the last 5 years with another experimental
filesystem which is not yet battle-tested and performance-tuned.
bcachefs and bcache are two completely distinctive products with
different use-cases, they only share a similar name because the
fundamental inner structures are based on the same code and idea (and
probably because the author thought it's cool).

I'm not sure if you use device pooling with btrfs (multiple disks) but
for my system, it showed useful to NOT use RAID-0 for btrfs data, it's
actually slower in normal desktop use and the way how btrfs internally
distributes data access across devices. I found that using single-data
mode even with multidisk has better write behavior and better read
latency, and it makes better use of bcache. So maybe its worth a try
if you fear that using writearound mode could degrade your system
responsiveness too much.

> Take care
> - unfa

Good luck
Kai

> wt., 23 lis 2021 o 18:40 Kai Krakow <kai@xxxxxxxxxxx> napisał(a):
> >
> > Oops:
> >
> > > # echo 1 >/sys/fs/bcache/CSETUUID/unregister
> > > # bcache make -C -w 4096 -l LABEL --force /dev/BPART
> >
> > CPART of course!
> >
> > # bcache make -C -w 4096 -l LABEL --force /dev/CPART
> >
> > Bye
> > Kai
>
>
>
> --
> - Tobiasz 'unfa' Karoń
>
> www.youtube.com/unfa000