Re: Dirty data loss after cache disk error recovery

Kai Krakow <kai@xxxxxxxxxxx> · Wed, 28 Apr 2021 20:30:59 +0200

Hello!

Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州)
<wubenqing@xxxxxxxxxxxxx>:
>
> Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.
>
> I checked the log and found that logs:
> [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.

"stop it to avoid potential data corruption" is not what it actually
does: neither it stops it, nor it prevents corruption because dirty
data becomes thrown away.

> [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
> [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered
>
> I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.
>
> Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
> I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
> echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit
>
> It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37:
> Apr 19 05:50:18 localhost.localdomain kernel: journal io error
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
>
> When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
> Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.

I think the same problem hit me, too, last night.

My kernel choked because of a GPU error, and that somehow disconnected
the cache. I can only guess that there was some sort of timeout due to
blocked queues, and that introduced an IO error which detached the
caches.

Sadly, I only realized this after I already reformatted and started
restore from backup: During the restore I watched the bcache status
and found that the devices are not attached.

I don't know if I could have re-attached the devices instead of
formatting. But I think the dirty data would have been discarded
anyways due to incrementing bcache_device->id.

This really needs a better solution, detaching is one of the worst,
especially on btrfs this has catastrophic consequences because data is
not updated inline but via copy on write. This requires updating a lot
of pointers. Usually, cow filesystem would be robust to this kind of
data-loss but the vast amount of dirty data that is lost puts the tree
generations too far behind of what btrfs is expecting, making it
essentially broken beyond repair. If some trees in the FS are just a
few generations behind, btrfs can repair itself by using a backup tree
root, but when the bcache is lost, generation numbers usually lag
behind several hundred generations. Detaching would be fine if there'd
be no dirty data - otherwise the device should probably stop and
refuse any more IO.

@Coly If I patched the source to stop instead of detach, would it have
made anything better? Would there be any side-effects? Is it possible
to atomically check for dirty data in that case and take either the
one or the other action?

Thanks,
Kai