Re: What about the kernel patch "failfast" and SCTERC/kernel-driver timeouts

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Sun, 11 Jul 2021 18:09:20 +0100

On 11/07/21 09:53, BW wrote:
> I do believe I understand the timeout mismatch issue, at least at a
> overall level.

The mere fact you asked "why *doesn't* mdadm kick the disk out of the
array proves you don't understand, sorry. Because mdadm *does* kick the
disk out of the array.

> But it's true I don't understand the failfast patch and its influence
> on the timeout mismatch, if any?
> 
I don't know that patch at all ...

> In regards to the timeout mismatch issue at a more detailed level (not
> covered "anywhere"):
> In case of timeout-mismatch the kernel-driver reset the disk (or whole
> controller at worse) and that alone is not the problem.

The kernel driver *doesn't* reset the disk, because it *can't*. It may
try ... (sending a reset command does absolutely nothing if the disk is
not listening!)

> The problem is
> mdadm still see the drive as healthy/an active array-member and
> perhaps starts to read/write to it again when it return after a reset.

No. The problem is that mdadm *does not know* whether the drive is
healthy or not, and tries to write to it BEFORE it returns after a
kernel reset. In the case of a shingled drive, the drive is almost
certainly healthy, just "doing it's own thing"

> And waiting up-to 180 seconds, depending on config,  is unnecessary
> and is an issue by its own (some might do a system-reboot or reset
> something, not understanding what going on). A service freezing for
> up-to 2 minutes is a looooong time.

The problem is that the DISK has frozen. That 180 seconds config is
KERNEL config, nothing whatsoever to do with mdadm. It tells the kernel
how long to wait for a disk that has apparently frozen.

> What I'm saying is, if mdadm hasn't heard from the drive within 25
> seconds is should just "pretend" it got an error-code just as it would
> if the drive supported/had SCTERC enabled an act on that (kick the
> drive out of the array/mark it failed).
> There might be a good reasons for that, actually I find it strange if
> that's not the case because the above it "too" obvious.
> 
You are describing - sort of - what REALLY DOES happen, and what you say
"should happen" is exactly what "does happen and *should not*", because
that is exactly what does the damage!

The CORRECT sequence of events when something goes wrong is

1 - raid tries to read/write to the disk.
2 - the disk times out, and the kernel recognises the failure
3 - raid recalculates the data, rewrites it to disk, and everything goes
merrily on its way.

The WRONG sequence of events (which is what you are saying *should*
happen) is

1 - raid tries to read/write to the disk.
2 - the kernel times out and returns a failure to raid
3 - raid recalculates the data, rewrites it to disk, AND THE DISK IS
STILL FROZEN AND DOES NOT RESPOND.
4 - the disk gets kicked

What you are describing as the sequence you think SHOULD happen, is
exactly the sequence that DOES happen and TRASHES THE ARRAY.

Oh - and by the way, let's take raid out of the picture here. This
problem is real and it affects ALL systems, not just ones with raid.
It's just that the consequences for raid are rather more dramatic, as
one disk having problems will bring down an array like dominoes, rather
than just bringing down the file system on that disk. It will trash
btrfs or ZFS or any other setup just as effectively as it trashes md-raid.

Cheers,
Wol