Re: [PATCH] md/raid10: Fix raid10 replace hang when new added disk faulty

Shaohua Li <shli@xxxxxxxxxx> · Fri, 7 Sep 2018 11:20:09 -0700

On Fri, Sep 07, 2018 at 11:30:34AM +0800, alexwu wrote:
> From: Alex Wu <alexwu@xxxxxxxxxxxx>
> 
> In raid10_sync_request(), we expect issue a bio with callback
> end_sync_read(), and a bio with callback end_sync_write().
> 
> In normal case, we will add resync sectors into mddev->recovery_active
> when raid10_sync_request() returned, and sub resync sectors from
> mddev->recovery_active when end_sync_write() call end_sync_request().
> 
> If new added disk, which are replacing the old disk, is set faulty,
> there is a race issue:
>     1. In the first rcu protected section, resync thread did not detect
>        that mreplace is set faulty and pass the condition.
>     2. In the second rcu protected section, mreplace is set faulty.
>     3. But, resync thread will prepare the read object first, and then
>        check the write condition.
>     4. It will find that mreplace is set faulty and do not have to
>        prepare write object.
> This cause we add resync sectors but never sub.
> 
> This issue can be easily reproduced by the following steps:
>     mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
>     mdadm /dev/md0 -a /dev/sde
>     mdadm /dev/md0 --replace /dev/sdd
>     sleep 1
>     mdadm /dev/md0 -f /dev/sde
> 
> This issue can be fixed by checking the write condition before prepare
> the read object in the second rcu protected section.
> 
> Reported-by: Alex Chen <alexchen@xxxxxxxxxxxx>
> Reviewed-by: Allen Peng <allenpeng@xxxxxxxxxxxx>
> Reviewed-by: BingJing Chang <bingjingc@xxxxxxxxxxxx>
> Signed-off-by: Alex Wu <alexwu@xxxxxxxxxxxx>
> ---
>  drivers/md/raid10.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 9818980..593d193 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3189,6 +3189,17 @@ static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
>  						continue;
>  					}
>  				}
> +
> +				/* make sure we need to write
> +				 * before we prepare read object
> +				 */
> +				if (test_bit(In_sync, &mrdev->flags) &&
> +					(mreplace == NULL ||
> +					 r10_bio->devs[1].repl_bio == NULL ||
> +					 test_bit(Faulty, &mreplace->flags))) {
> +					break;
> +				}
> +

Does this fix the problem completely? Faulty can be set at any time, for
example, after this check, then we will have a mismatch again. How about we
record the Faulty bit for mpreplace locally, then use it for both read/write
check.

Thanks,
Shaohua
>  				bio = r10_bio->devs[0].bio;
>  				bio->bi_next = biolist;
>  				biolist = bio;
> -- 
> 2.7.4
>