Eric, your "from" mail (lists@xxxxxxxxxxxxxxxxxxx) does not exist: > DNS Error: DNS type 'mx' lookup of bcache.ewheeler.net responded with code NXDOMAIN Domain name not found: bcache.ewheeler.net Or is something messed up on my side? All others, please ignore. Doesn't add to the conversation. Thanks. :-) Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler <lists@xxxxxxxxxxxxxxxxxxx>: > > On Tue, 12 Sep 2023, 邹明哲 wrote: > > From: Eric Wheeler <lists@xxxxxxxxxxxxxxxxxxx> > > Date: 2023-09-07 08:42:41 > > To: Coly Li <colyli@xxxxxxx> > > Cc: Kai Krakow <kai@xxxxxxxxxxx>,"linux-bcache@xxxxxxxxxxxxxxx" <linux-bcache@xxxxxxxxxxxxxxx>,"吴本卿(云桌面 福州)" <wubenqing@xxxxxxxxxxxxx>,Mingzhe Zou <mingzhe.zou@xxxxxxxxxxxx> > > Subject: Re: Dirty data loss after cache disk error recovery > > >+Mingzhe, Coly: please comment on the proposed fix below when you have a > > >moment: > > > > Hi, Eric > > > > This is an old issue, and it took me a long time to understand what > > happened. > > > > > > > >On Thu, 7 Sep 2023, Kai Krakow wrote: > > >> Wow! > > >> > > >> I call that a necro-bump... ;-) > > >> > > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler > > >> <lists@xxxxxxxxxxxxxxxxxxx>: > > >> > > > >> > On Fri, 7 May 2021, Kai Krakow wrote: > > >> > > > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When > > >> > > > the cache device is disconnected, it is always risky that some caching > > >> > > > data or meta data is not updated onto cache device. Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... It was the reason why I didn't add > > >> > > > new error action for the device failure handling patch set. > > >> > > > > >> > > But we are actually now seeing silent data loss: The system f'ed up > > >> > > somehow, needed a hard reset, and after reboot the bcache device was > > >> > > accessible in cache mode "none" (because they have been unregistered > > >> > > before, and because udev just detected it and you can use bcache > > >> > > without an attached cache in "none" mode), completely hiding the fact > > >> > > that we lost dirty write-back data, it's even not quite obvious that > > >> > > /dev/bcache0 now is detached, cache mode none, but accessible > > >> > > nevertheless. To me, this is quite clearly "silent data loss", > > >> > > especially since the unregister action threw the dirty data away. > > >> > > > > >> > > So this: > > >> > > > > >> > > > Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... > > >> > > > > >> > > is actually the situation we are facing currently: Device has been > > >> > > unregistered, after reboot, udev detects it has clean backing device > > >> > > without cache association, using cache mode none, and it is readable > > >> > > and writable just fine: It essentially permitted access to the stale > > >> > > backing device (tho, it didn't re-attach as you outlined, but that's > > >> > > more or less the same situation). > > >> > > > > >> > > Maybe devices that become disassociated from a cache due to IO errors > > >> > > but have dirty data should go to a caching mode "stale", and bcache > > >> > > should refuse to access such devices or throw away their dirty data > > >> > > until I decide to force them back online into the cache set or force > > >> > > discard the dirty data. Then at least I would discover that something > > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't > > >> > > written. In the best case, that makes my FS unmountable, in the worst > > >> > > case, some file data is simply lost (aka silent data loss), besides > > >> > > both situations are the worst-case scenario anyways. > > >> > > > > >> > > The whole situation probably comes from udev auto-registering bcache > > >> > > backing devices again, and bcache has no record of why the device was > > >> > > unregistered - it looks clean after such a situation. > > >> > > >> [...] > > >> > > >> > I think we hit this same issue from 2021. Here is that original thread from 2021: > > >> > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@xxxxxxx/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de > > >> > > > >> > Kai, did you end up with a good patch for this? We are running a 5.15 > > >> > kernel with the many backported bcache commits that Coly suggested here: > > >> > https://www.spinics.net/lists/linux-bcache/msg12084.html > > >> > > >> I'm currently running 6.1 with bcache on mdraid1 and device-level > > >> write caching disabled. I didn't see this ever occur again. > > > > > >Awesome, good to know. > > > > > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel > > >> 6.1, and had no issues since with bcache even on power loss. > > >> > > >> > Coly, is there already a patch to prevent complete dirty cache loss? > > >> > > >> This is probably still an issue. The cache attachment MUST NEVER EVER > > >> automatically degrade to "none" which it did for my fail-cases I had > > >> back then. I don't know if this has changed meanwhile. > > > > > >I would rather that bcache went to a read-only mode in failure > > >conditions like this. Maybe write-around would be acceptable since > > >bcache returns -EIO for any failed dirty cache reads. But if the cache > > >is dirty, and it gets an error, it _must_never_ read from the bdev, which > > >is what appears to happens now. > > > > > >Coly, Mingzhe, would this be an easy change? > > > > First of all, we have never had this problem. We have had an nvme > > controller failure, but at this time the cache cannot be read or > > written, so even unregister will not succeed. > > > > Coly once replied like this: > > > > """ > > There is an option to panic the system when cache device failed. It > > is in errors file with available options as "unregister" and "panic". > > This option is default set to "unregister", if you set it to "panic" > > then panic() will be called. > > """ > > > > I think "panic" is a better way to handle this situation. If cache > > returns an error, there may be more unknown errors if the operation > > continues. > > Depending on how the block devices are stacked, the OS can continue if > bcache fails (eg, bcache under raid1, drbd, etc). Returning IO requests > with -EIO or setting bcache read-only would be better, because a panic > would crash services that could otherwise proceed without noticing the > bcache outage. > > If bcache has a critical failure, I would rather that it fail the IOs so > upper-layers in the block stack can compensate. > > What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly" > option, and make that the default setting? The gendisk(s) for related > /dev/bcacheX devices can be flagged BLKROSET in the error handler: > https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@xxxxxx/ > > This would protect the data and keep the host online. > > -- > Eric Wheeler > > > > > > > > > > >Here are the relevant bits: > > > > > >The allocator called btree_mergesort which called bch_extent_invalid: > > > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > > > > >Which called the `cache_bug` macro, which triggered bch_cache_set_error: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > > > > >It then calls `bch_cache_set_unregister` which shuts down the cache: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > > > > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > > > { > > > ... > > > bch_cache_set_unregister(c); > > > return true; > > > } > > > > > >Proposed solution: > > > > > >What if, instead of bch_cache_set_unregister() that this was called instead: > > > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > > > If cache_mode can be automatically modified, when will it be restored > > to writeback? I think we need to be able to enable or disable this. > > > > > > > >This would bypass the cache for future writes, and allow reads to > > >proceed if possible, and -EIO otherwise to let upper layers handle the > > >failure. > > > > > >What do you think? > > > > If we switch to writearound mode, how to ensure that the IO is read-only, > > because writing IO may require invalidating dirty data. If the backing > > write is successful but invalid fails, how should we handle it? > > > > Maybe "panic" could be the default option. What do you think? > > > > > > > >> But because bcache explicitly does not honor write-barriers from > > >> upstream writes for its own writeback (which is okay because it > > >> guarantees to write back all data anyways and give a consistent view to > > >> upstream FS - well, unless it has to handle write errors), the backed > > >> filesystem is guaranteed to be effed up in that case, and allowing it to > > >> mount and write because bcache silently has fallen back to "none" will > > >> only make the matter worse. > > >> > > >> (HINT: I never used brbd personally, most of the following is > > >> theoretical thinking without real-world experience) > > >> > > >> I see that you're using drbd? Did it fail due to networking issues? > > >> I'm pretty sure it should be robust in that case but maybe bcache > > >> cannot handle the situation? Does brbd have a write log to replay > > >> writes after network connection loss? It looks like it doesn't and > > >> thus bcache exploded. > > > > > >DRBD is _above_ bcache, not below it. In this case, DRBD hung because > > >bcache hung, not the other way around, so DRBD is not the issue here. > > >Here is our stack: > > > > > >bcache: > > > bdev: /dev/sda hardware RAID5 > > > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > > > > >And then bcache is stacked like so: > > > > > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > > > | > > > v > > > [remote host] > > > > > >> Anyways, since your backing device seems to be on drbd, using metadata > > >> allocation hinting is probably no option. You could of course still use > > >> drbd with bcache for metadata hinted partitions, and then use > > >> writearound caching only for that. At least, in the fail-case, your > > >> btrfs won't be destroyed. But your data chunks may have unreadable files > > >> then. But it should be easy to select them and restore from backup > > >> individually. Btrfs is very robust for that fail case: if metadata is > > >> okay, data errors are properly detected and handled. If you're not using > > >> btrfs, all of this doesn't apply ofc. > > >> > > >> I'm not sure if write-back caching for drbd backing is a wise decision > > >> anyways. drbd is slow for writes, that's part of the design (and no > > >> writeback caching could fix that). > > > > > >Bcache-backed DRBD provides a noticable difference, especially with a > > >10GbE link (or faster) and the same disk stack on both sides. > > > > > >> I would not rely on bcache-writeback to fix that for you because it is > > >> not prepared for storage that may be temporarily not available > > > > > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > > >DRBD's existence. > > > > > >> iow, it would freeze and continue when drbd is available again. I think > > >> you should really use writearound/writethrough so your FS can be sure > > >> data has been written, replicated and persisted. In case of btrfs, you > > >> could still split data and metadata as written above, and use writeback > > >> for data, but reliable writes for metadata. > > >> > > >> So concluding: > > >> > > >> 1. I'm now persisting metadata directly to disk with no intermediate > > >> layers (no bcache, no md) > > >> > > >> 2. I'm using allocation-hinted data-only partitions with bcache > > >> write-back, with bcache on mdraid1. If anything goes wrong, I have > > >> file crc errors in btrfs files only, but the filesystem itself is > > >> valid because no metadata is broken or lost. I have snapshots of > > >> recently modified files. I have daily backups. > > >> > > >> 3. Your problem is that bcache can - by design - detect write errors > > >> only when it's too late with no chance telling the filesystem. In that > > >> case, writethrough/writearound is the correct choice. > > >> > > >> 4. Maybe bcache should know if backing is on storage that may be > > >> temporarily unavailable and then freeze until the backing storage is > > >> back online, similar to how iSCSI handles that. > > > > > >I don't think "temporarily unavailable" should be bcache's burden, as > > >bcache is a local-only solution. If someone is using iSCSI under bcache, > > >then good luck ;) > > > > > >> But otoh, maybe drbd should freeze until the replicated storage is > > >> available again while writing (from what I've read, it's designed to not > > >> do that but let local storage get ahead of the replica, which is btw > > >> incompatible with bcache-writeback assumptions). > > > > > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > > >and has no local copy for some reason. If local storage is available, it > > >will use that and resync when its peer comes up. > > > > > >> Or maybe using async mirroring can fix this for you but then, the mirror > > >> will be compromised if a hardware failure immediately follows a previous > > >> drbd network connection loss. But, it may still be an issue with the > > >> local hardware (bit-flips etc) because maybe just bcache internals broke > > >> - Coly may have a better idea of that. > > > > > >This isn't DRBDs fault since it is above bcache. I wish only address the > > >the bcache cache=none issue. > > > > > >-Eric > > > > > >> > > >> I think your main issue here is that bcache decouples writebarriers > > >> from the underlying backing storage - and you should just not use > > >> writeback, it is incompatible by design with how drbd works: your > > >> replica will be broken when you need it. > > > > > > > > >> > > >> > > >> > Here is our trace: > > >> > > > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > >> > 0:1163806048 gen 3: bad, length too big, disabling caching > > >> > > >> > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > >> > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > >> > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > >> > [ +0.000826] Call Trace: > > >> > [ +0.000797] <TASK> > > >> > [ +0.000006] dump_stack_lvl+0x57/0x7e > > >> > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > >> > [ +0.000759] btree_mergesort+0x27e/0x36e > > >> > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > >> > [ +0.000009] __btree_sort+0xa4/0x1e9 > > >> > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > >> > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > >> > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > >> > [ +0.000863] btree_insert_fn+0x20/0x48 > > >> > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > >> > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > >> > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000848] bch_btree_insert+0x102/0x188 > > >> > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > >> > [ +0.000857] bch_data_insert_keys+0x39/0xde > > >> > [ +0.000845] process_one_work+0x280/0x5cf > > >> > [ +0.000858] worker_thread+0x52/0x3bd > > >> > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > >> > [ +0.000877] kthread+0x13e/0x15b > > >> > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > >> > [ +0.000855] ret_from_fork+0x22/0x2d > > >> > [ +0.000854] </TASK> > > >> > > >> > > >> Regards, > > >> Kai > > >> > > > > > > > > > >