Re: Re: Dirty data loss after cache disk error recovery

Kai Krakow <kai@xxxxxxxxxxx> · Tue, 17 Oct 2023 02:33:35 +0200

Am Di., 17. Okt. 2023 um 01:39 Uhr schrieb Eric Wheeler
<bcache@xxxxxxxxxxxxxxxxxx>:
>
> On Wed, 11 Oct 2023, Kai Krakow wrote:
> > After a reboot it worked again but of course there were still bad
> > blocks because bcache did writeback, so no blocks have been replaced
> > with btrfs auto-repair on read feature. This time, the system handled
> > the situation a bit better but files became inaccessible in the middle
> > of writing them which destroyed my Plasma desktop configuration and
> > Chrome profile (I restored them from the last snapper snapshot
> > successfully). Essentially, the file system was in a readonly-like
> > state: most requests failed with IO errors despite the btrfs didn't
> > switch to read-only. Something messed up in the error path of
> > userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the
>
> Do you mean userspace -> btrfs -> bcache -> device

Ehm.. Yes...

> > device somewhere in the limbo of not existing and not working - it
> > still tried to access it while bcache claimed the backend device would
> > be missing. To me this looks like bcache error handling may need some
> > fine tuning - it should not fail in that way, especially not with
> > btrfs-raid, but still the system was seeing IO errors and broken files
> > in the middle of writes.
> >
> > "bcache show" showed the backend device missing while "btrfs dev show"
> > was still seeing the attached bcache device, and the system threw IO
> > errors to user-space despite btrfs still having a valid copy of the
> > blocks.
> >
> > I've rebooted and now switched the bad device from bcache writeback to
> > bcache none - and guess what: The system runs stable now, btrfs
> > auto-repair does its thing. The above mentioned behavior does not
> > occur (IO errors in user-space). A final scrub across the bad devices
> > repaired the bad blocks, I currently do not see any more problems.
> >
> > It's probably better to replace that device but this also shows that
> > switching bcache to "none" (if the backing device fails) or "write
> > through" at least may be a better choice than doing some other error
> > handling. Or bcache should have been able to make btrfs see the device
> > as missing (which obviously did not happen).
>
> Noted.  Did bcache actually detach its cache in the failure scenario
> you describe?

It seemed still attached but was marked as "missing" the the bcache cli tool.

> > Of course, if the cache device fails we have a completely different
> > situation. I'm not sure which situation Eric was seeing (I think the
> > caching device failed) but for me, the backing device failed - and
> > with bcache involved, the result was very unexpected.
>
> Ahh, so you are saying the cache continued to service requests even though
> the bdev was offline?  Was the bdev completely "unplugged" or was it just
> having IO errors?

smartctl was still seeing the device, so I think it "just" had IO errors.

> > So we probably need at least two error handlers: Handling caching
> > device errors, and handling backing device errors (for which bcache
> > doesn't currently seem to have a setting).
>
> I think it tries to write to the cache if the bdev dies.  Dirty or cached
> blocks are read from cache and other IOs are passed to bdev which may
> return end up returning an EIO.

Hmm, yes that makes sense... But it seems to confuse user-space a lot.

Except that in writeback mode, it won't (and cannot) return errors to
user-space although writes eventually fail later and data does not
persist. So it may be better to turn writeback off as soon as bdev IO
errors are found, or trigger an immediate writeback by temporarily
setting writeback_percent to 0. Usually, HDDs support self-healing -
which didn't work in this case because of delayed writeback. After I
switched to "none", it worked. After some more experimenting, it looks
like even "writethrough" may lack behind and not bubble bdev IO errors
back up to user-space (or it was due to writeback_percent=0, errors
are gone so I can no longer reproduce). I would expect it to do
exactly that, tho. I didn't test "writearound".

Also, it looks like a failed delay write from writeback dirty data may
not be retried by bcache. Or at least, I needed to run "btrfs scrub"
with bcache mode "none" to make it work properly and let the HDD heal
itself. OTOH, the HDD probably didn't fail writes but reads (except
when the situation got completely messed up and even writes returned
IO errors but maybe btrfs was involved here).

BTW: The failed HDDs ran fine for a few days now, even switched
writeback on again. It properly healed itself. But still, time to swap
it sooner than later.

>  Coly, is this correct?
>
> -Eric

Regards,
Kai