Re: Bcache is not caching anything. cache state=inconsistent, how to clear?

Tobiasz Karoń <unfa00@xxxxxxxxx> · Wed, 24 Nov 2021 13:41:20 +0100

śr., 24 lis 2021 o 06:36 Kai Krakow <kai@xxxxxxxxxxx> napisał(a):
>
> Hello!
>
> Am Di., 23. Nov. 2021 um 23:34 Uhr schrieb Tobiasz Karoń <unfa00@xxxxxxxxx>:
> >
> > Thank you for your detailed reply and sharing your experience and solution.
> >
> > So it seems Bcache and Btrfs are fundamentally incompatible when it
> > comes to caching writes? It has worked fine for 2 months, and then it
> > just imploded. I'll stay in writearound mode to be safe.
>
> No, they are not fundamentally incompatible but losing writeback data
> on btrfs is much more a visible catastrophic event than to other file
> systems (which write data in-place when btrfs writes cow).
My issue with Btrfs is - it seems to become trashed very easily. I
would expect a COW filesystem to be much more resilient to various
errors. It seems to me that sometimes a single bad sector can make the
filesystem unmountable and unrecoverable. Maybe I am just not handling
such events properly I've definitely made mistakes in the past
(sometimes due to not enough spares to do images before messing around
- not gonna do that again).
>
> Even with other filesystems and bcache destroying itself in writeback
> mode would cause severe damage of your filesystem (on classical
> filesystem, usually you end up with garbled files having partially old
> and new data, maybe some fixable metadata errors) - BUT: it is still a
> catastrophic event, maybe even more so because data loss could go
> silent, ending up in your backups, only to find later that you're
> missing data that has already been rotated out of the backup.
>
> Don't use writeback if you cannot afford to recover from backup when
> writeback fails. That's a property of how caching works, not a
> property of btrfs or bcache. It's the same for any writeback cache you
> might be using: RAID-controllers come with writeback caches, and
> decide to throw it away sometimes, leaving you with destroyed
> filesystems, so you usually turn that off unless your workload
> requires it and you can afford to throw lost data away). That doesn't
> make them fundamentally incompatible with filesystems, right? Your HDD
> comes with write caches which may destroy your filesystem, too, on
> power-loss. You might want to turn that off, especially when using
> btrfs (but also for better write latency behavior, and the kernel has
> better IO scheduling anyways than the really small writecaches of
> HDDs): `hdparm -W0 /dev/HDDDEV`. HDD write caches are only useful for
> operating systems that do no proper write ordering/merging (usually
> DOS, and maybe Windows), and sometimes HDD firmwares are buggy and
> cannot use async queueing, when write caches may improve performance a
> lot. But usually, you want to keep that setting off. That becomes even
> more important when you use bcache in writeback mode (because HDD
> write caching may then break assumptions of bcache).

I've found out that hard drives I am using have a firmware bug that
can corrupt data when using write cache:
https://www.reddit.com/r/linux/comments/c59nry/btrfs_vs_write_caching_firmware_bugs_tldr_some/es1krq2/

I'm going to disable write cache on all of these drives. This could
explain some spontaneous collapses of Btrfs and Bcache on my system in
the past. But again: I'd expect a COW filesystem to be able to recover
from incomplete writes. I've been using Btrfs for about 3-4 years now.
Maybe I just don't know how to handle issues...

I wonder if there's an option fro me to update the firmware on my
existing drives without booting into Windows.
it seems that *some* HDD manufacturers have easy tools for Linux to do
that, but I don't know what they are, as that was redacted:
https://forum.corsair.com/forums/topic/77369-flashing-firmware-with-linux-hdparm-command/

I see that hdparm has an option called --fwdownload, thought  I'd
certainly not try that without being absolutely sure it'll work.

>
> > I've checked and my cache device has a block size of 512 bytes.
>
> Yep, all my bcache systems using 512 bytes are affected by that 5.15.2
> kernel bug. Use 4k and you should be okay. The problem seems to come
> from page-unaligned writes - and using 4k (the page size of your CPU)
> seems to work around that. Kernel 5.15.3 has the most part of the fix,
> another fix is queued for one of the next releases. Another lesson
> learned: Don't use a new kernel until it's in its x.y.{4,5,6}
> releases. This is not the first time I had catastrophic events with
> kernels in their infancy. That's why I usually avoid .0 and .1
> kernels. Seems I should add .2 and .3 kernels to that list, too. Never
> do a major kernel upgrade without creating a full backup first. Kernel
> components like bcache are much less well-tested than other
> components, so they likely break on early kernel releases for some
> exotic use-cases (exotic because nobody who cares about their data
> uses writeback).
I'm at kernel 5.15.3 right now. I think Arch Linux ships kernel
updates after they reach .3. The 5.15 came out like 2 weeks ago.

>
> > That's
> > a strange value, as the backing device is a AF HDD (like all of them
> > in the past decade or more), so the block size should be 4Kb.
> > I guess this also works until it doesn't.
>
> You won't have catastrophic events with writearound - and that's as
> good as writeback on btrfs (and even better because it won't destroy
> the filesystem in case of a cache hiccup). Bcache can break for any
> reason, due to bugs, like any other kernel component. And bcache in
> writeback mode usually means catastrophic results for ANY file system
> attached to it - where btrfs is just much more likely to detect those
> events. Even if you COULD repair the file system logical structure, it
> still means some data wasn't written - btrfs just has a much better
> understanding about what should be on the disk while other filesystems
> silently accept the data loss after recovering from structural errors.
> BTW: 4k should be safe, there's another problem in bcache unrelated to
> this which still needs fixing.
>
> > Can I destroy and recreate the cache device on a live system (my root
> > filesystem is on this bcache set). I guess I can't.
>
> Yes, you can. Detaching the cache makes the backing devices pass
> through, they are still available as /dev/bcache* even with no caching
> device.
>
> > This is probably what I've done wrong today - I did
> > not unregister the whole cset before attempting to recreate the cache
> > device.
>
> Okay, unregistering should be quite essential but you don't need to
> reboot. Also, I recommend using a new cset UUID so it cannot conflict
> with any stale data that MAY be stored in the cache.
Yeah, I used existing cset UUID. That has probably caused bcache to
write garbage and corrupt the cache...
>
> > I am honestly a little afraid to touch it, after what happened.
>
> Well, the cache backend is stopped or detached - it doesn't matter
> anyways. Just don't use writeback for the next couple of kernel
> releases (or maybe rather avoid it for the future completely).
> Writeback really doesn't gain you a lot on btrfs because due to COW,
> btrfs is already quite good writing (because writes are usually going
> to be sequential anyways), and it has become a lot better during the
> last few kernel release cycles. I've been using writeback for a long
> time now but this is just another occasion why I should not have been
> using writeback but writearound instead (the other one being that
> sometimes on boot, my SSD detaches from the bus, making bcache throw
> away all writeback data and leaving me with a destroyed filesystem).

Ok, I've booted into a live ISO and recreated the cache with 4K
blocks. I hope it's gonna spare me some adventures in the future.

>
> > I hope Bcachefs will eliminate these problems and provide a stable
> > unified solution.
>
> You're swapping one "experimental" FS (btrfs) which has matured great
> ways during at least the last 5 years with another experimental
> filesystem which is not yet battle-tested and performance-tuned.
> bcachefs and bcache are two completely distinctive products with
> different use-cases, they only share a similar name because the
> fundamental inner structures are based on the same code and idea (and
> probably because the author thought it's cool).
Yeah, honestly I wish he renamed Bcachefs to something shorter.
Anyway - I'm not gonna use it until it reaches mainline kernel, and
then still only for experiments, not for production.

>
> I'm not sure if you use device pooling with btrfs (multiple disks) but
> for my system, it showed useful to NOT use RAID-0 for btrfs data, it's
> actually slower in normal desktop use and the way how btrfs internally
> distributes data access across devices. I found that using single-data
> mode even with multidisk has better write behavior and better read
> latency, and it makes better use of bcache. So maybe its worth a try
> if you fear that using writearound mode could degrade your system
> responsiveness too much.
I am not using multiple devices in a single Btrfs filesystem at the moment.
I assumed using 2 drives in RAID1 would double the read speed (on
large files) since the extents can be read from two disks at once.
It's strange that it doesn't work like that...

>
> > Take care
> > - unfa
>
> Good luck
> Kai

Thank you so much for your insight!
That's all invaluable information you're sharing.

I hope these messages are going to be available publicly in some
mailing list archive for future reference when I inevitably encounter
the same problems in 5 years after I forgot what it was all about...

Thank you!
- unfa

>
>
> > wt., 23 lis 2021 o 18:40 Kai Krakow <kai@xxxxxxxxxxx> napisał(a):
> > >
> > > Oops:
> > >
> > > > # echo 1 >/sys/fs/bcache/CSETUUID/unregister
> > > > # bcache make -C -w 4096 -l LABEL --force /dev/BPART
> > >
> > > CPART of course!
> > >
> > > # bcache make -C -w 4096 -l LABEL --force /dev/CPART
> > >
> > > Bye
> > > Kai
> >
> >
> >
> > --
> > - Tobiasz 'unfa' Karoń
> >
> > www.youtube.com/unfa000

-- 
- Tobiasz 'unfa' Karoń

www.youtube.com/unfa000