On 4/3/19 9:18 PM, Dan Williams wrote:
On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 3/26/19 7:33 PM, Song Liu wrote:
On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose. A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.
Could you please explain more details about the failed case before
this patch?
Thanks,
Song
Reproduced the issue with a modified scsi_debug module. Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk. I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.
Device Boot Start End Blocks Id System
/dev/sde1 2048 94207 46080 83 Linux
/dev/sde2 94208 186367 46080 83 Linux
/dev/sde3 186368 278527 46080 83 Linux
/dev/sde4 278528 511999 116736 5 Extended
/dev/sde5 280576 372735 46080 83 Linux
/dev/sde6 374784 466943 46080 83 Linux
The raid6 device was created with:
mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
After initially running, a check can be triggered and it work without issue. The medium error feature of scsi_debug can then be turned on with:
echo "2" >/sys/module/scsi_debug/parameters/opts
The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array. Running a new check operation crashes the system at the same BUG_ON as seen by the customer.
This patch makes the raid6 check_state_check_result handling
work more like raid5's. If somehow too many failures for a
check, just quit the check operation for the stripe. When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for. Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
+ Dan Williams. But I am not sure whether Dan still remembers this
work. :)
Dan, please feel free to ignore this. :)
Thanks Nigel and Xiao for these details. I will process the patch.
Regards,
Song
OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
---
drivers/md/raid5.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..4204a972ecf2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
case check_state_check_result:
sh->check_state = check_state_idle;
+ if (s->failed > 1)
+ break;
/* handle a successful check operation, if parity is correct
* we are done. Otherwise update the mismatch count and repair
* parity if !MD_RECOVERY_CHECK
*/
if (sh->ops.zero_sum_result == 0) {
- /* both parities are correct */
- if (!s->failed)
- set_bit(STRIPE_INSYNC, &sh->state);
- else {
- /* in contrast to the raid5 case we can validate
- * parity, but still have a failure to write
- * back
- */
- sh->check_state = check_state_compute_result;
- /* Returning at this point means that we may go
- * off and bring p and/or q uptodate again so
- * we make sure to check zero_sum_result again
- * to verify if p or q need writeback
- */
- }
+ /* Any parity checked was correct */
+ set_bit(STRIPE_INSYNC, &sh->state);
The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.
It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.
Can you clarify which BUG_ON is being triggered?
The customer's systems were using 8 disk RAID6 arrays. The crashes
occurred with one failing disk while scrubbing was occurring with the
"check" action. After the parity information for a stripe was checked
for a scrubbing operation, the uptodate value was expected to be 7 or 8,
a maximum of 1 disk failed, but the uptodate value was 6 which triggered
the BUG_ON(s->uptodate < disk - 1).
crash> struct stripe_head_state ffff8aba77affbf0
struct stripe_head_state {
syncing = 1,
...
uptodate = 6,
to_read = 0,
to_write = 0,
failed = 1,
...
For a stripe read into memory with a single failed disk, the failed
counter was 1 as expected but the uptodate counter didn't consider all 7
healthy disks as uptodate. Looking at the individual disk states for
the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync
| R5_UPTODATE. The remaining working disk had a flags value of only 16,
or R5_Insync. The R5_UPTODATE flag was not set.
crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5]
$8 = {
...
sector = 0,
flags = 16,
log_checksum = 0
}
This would appear to be why the update count was lower than expected.
The last disk was the failed device, with flags R5_ReadNoMerge |
R5_ReadError | R5_ReWrite.
crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6]
$9 = {
...
sector = 0,
flags = 1792,
log_checksum = 0
}
These 2 not uptodate disks held the parity information for this RAID6
stripe.
crash> struct stripe_head ffff8a9c83273280
struct stripe_head {
...
pd_idx = 5,
qd_idx = 6,
...
The normal data disks of the stripe were as expected, the "q" parity
device was failed, but the "p" parity device was the issue from not
being marked uptodate while being considered healthy.
The "p" parity disk uses the same parity format as the RAID5 code. When
the "p" disk is available but the "q" disk is failed, the RAID6 code
does not use the RAID6-specific function which can check both sets of
parity data. Instead, it uses the RAID5 parity check code which also
has the side effect of destroying the in-memory copy of the parity
information. This would be why the "p" parity disk was no longer marked
uptodate. The parity check destroyed its data.
handle_parity_checks6() cleared the R5_UPTODATE flag and decreased
uptodate itself to account for it using the parity check which will
destroy the parity data:
case check_state_idle:
/* start a new check operation if there are < 2 failures */
if (s->failed == s->q_failed) {
/* The only possible failed device holds Q, so it
* makes sense to check P (If anything else
were failed,
* we would have used P to recreate it).
*/
sh->check_state = check_state_run;
}
...
if (sh->check_state == check_state_run) {
/* async_xor_zero_sum destroys the contents of P */
clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
s->uptodate--;
}
When the parity check succeeds, the handle_parity_checks6 function fails
to account for this difference in behavior. The parity check succeeds
while having a failed device, and the raid6 code attempts to jump from
"check_state_check_result" state to "check_state_compute_result" state
without re-reading or re-computing the destroyed parity data. This
results in the uptodate count being too low for the
"check_state_compute_result" state, triggering the bug.
case check_state_check_result:
sh->check_state = check_state_idle;
/* handle a successful check operation, if parity is
correct
* we are done. Otherwise update the mismatch count
and repair
* parity if !MD_RECOVERY_CHECK
*/
if (sh->ops.zero_sum_result == 0) {
/* both parities are correct */
if (!s->failed)
set_bit(STRIPE_INSYNC, &sh->state);
else {
/* in contrast to the raid5 case we can
validate
* parity, but still have a failure to
write
* back
*/
sh->check_state =
check_state_compute_result;
/* Returning at this point means that
we may go
* off and bring p and/or q uptodate
again so
* we make sure to check
zero_sum_result again
* to verify if p or q need writeback
*/
}
To trigger this issue, you not only need a failed disk but to also
perform a scrubbing operation. The customer's systems both crashed
early Sunday morning when the raid-check script is run by default from cron.