Re: [PATCH] md: fix raid5 'repair' operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thursday May 1, dan.j.williams@xxxxxxxxx wrote:
> commit bd2ab67030e9116f1e4aae1289220255412b37fd "md: close a livelock
> window in handle_parity_checks5" introduced a bug in handling 'repair'
> operations.  After a repair operation completes we clear the state bits
> tracking this operation.  However, they are cleared too early and this
> results in the code deciding to re-run the parity check operation.  Since
> we have done the repair in memory the second check does not find a mismatch
> and thus does not do a writeback.

yes....
I must admit that I find that code fairly hard to make sense of, but I
can see how it was failing before and how this fixes it, and testing
confirms that, so I suspect it is right.

I cannot help feeling that there must be some way to simplify all
those .pending and .complete bits and make it somewhat clearer, but I
haven't been able to figure out how :-(

So: Acked-by: NeilBrown <neilb@xxxxxxx>

I'm heading for a weekend, but feel free to send this to akpm.

Thanks,
NeilBrown


> 
> Test results:
> $ echo repair > /sys/block/md0/md/sync_action
> $ cat /sys/block/md0/md/mismatch_cnt
> 51072
> $ echo repair > /sys/block/md0/md/sync_action
> $ cat /sys/block/md0/md/mismatch_cnt
> 0
> 
> Cc: <stable@xxxxxxxxxx>
> Reported-by: George Spelvin <linux@xxxxxxxxxxx>
> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> ---
> 
>  drivers/md/raid5.c |   21 +++++++++++----------
>  1 files changed, 11 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 087eee0..da3390d 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2400,16 +2400,6 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
>  			canceled_check = 1; /* STRIPE_INSYNC is not set */
>  	}
>  
> -	/* check if we can clear a parity disk reconstruct */
> -	if (test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete) &&
> -		test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
> -
> -		clear_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending);
> -		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete);
> -		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.ack);
> -		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
> -	}
> -
>  	/* start a new check operation if there are no failures, the stripe is
>  	 * not insync, and a repair is not in flight
>  	 */
> @@ -2424,6 +2414,17 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh,
>  		}
>  	}
>  
> +	/* check if we can clear a parity disk reconstruct */
> +	if (test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete) &&
> +		test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
> +
> +		clear_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending);
> +		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete);
> +		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.ack);
> +		clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
> +	}
> +
> +
>  	/* Wait for check parity and compute block operations to complete
>  	 * before write-back.  If a failure occurred while the check operation
>  	 * was in flight we need to cycle this stripe through handle_stripe
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux