On Thursday May 1, dan.j.williams@xxxxxxxxx wrote: > commit bd2ab67030e9116f1e4aae1289220255412b37fd "md: close a livelock > window in handle_parity_checks5" introduced a bug in handling 'repair' > operations. After a repair operation completes we clear the state bits > tracking this operation. However, they are cleared too early and this > results in the code deciding to re-run the parity check operation. Since > we have done the repair in memory the second check does not find a mismatch > and thus does not do a writeback. yes.... I must admit that I find that code fairly hard to make sense of, but I can see how it was failing before and how this fixes it, and testing confirms that, so I suspect it is right. I cannot help feeling that there must be some way to simplify all those .pending and .complete bits and make it somewhat clearer, but I haven't been able to figure out how :-( So: Acked-by: NeilBrown <neilb@xxxxxxx> I'm heading for a weekend, but feel free to send this to akpm. Thanks, NeilBrown > > Test results: > $ echo repair > /sys/block/md0/md/sync_action > $ cat /sys/block/md0/md/mismatch_cnt > 51072 > $ echo repair > /sys/block/md0/md/sync_action > $ cat /sys/block/md0/md/mismatch_cnt > 0 > > Cc: <stable@xxxxxxxxxx> > Reported-by: George Spelvin <linux@xxxxxxxxxxx> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > > drivers/md/raid5.c | 21 +++++++++++---------- > 1 files changed, 11 insertions(+), 10 deletions(-) > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 087eee0..da3390d 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -2400,16 +2400,6 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, > canceled_check = 1; /* STRIPE_INSYNC is not set */ > } > > - /* check if we can clear a parity disk reconstruct */ > - if (test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete) && > - test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) { > - > - clear_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending); > - clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete); > - clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.ack); > - clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending); > - } > - > /* start a new check operation if there are no failures, the stripe is > * not insync, and a repair is not in flight > */ > @@ -2424,6 +2414,17 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, > } > } > > + /* check if we can clear a parity disk reconstruct */ > + if (test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete) && > + test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) { > + > + clear_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending); > + clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete); > + clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.ack); > + clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending); > + } > + > + > /* Wait for check parity and compute block operations to complete > * before write-back. If a failure occurred while the check operation > * was in flight we need to cycle this stripe through handle_stripe -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html