Re: [PATCH] md/raid10: Fix raid10 replace hang when new added disk faulty

Guoqing Jiang <gqjiang@xxxxxxxx> · Thu, 20 Sep 2018 09:35:39 +0800

On 9/20/18 8:53 AM, alexwu wrote:
Guoqing Jiang 於 2018-09-19 16:12 寫到:
On 9/19/18 3:24 PM, Guoqing Jiang wrote:


On 9/19/18 1:15 PM, alexwu wrote:
Guoqing Jiang 於 2018-09-17 15:47 寫到:
On 9/10/18 3:10 PM, alexwu wrote:
Shaohua Li 於 2018-09-08 02:20 寫到:
On Fri, Sep 07, 2018 at 11:30:34AM +0800, alexwu wrote:
From: Alex Wu <alexwu@xxxxxxxxxxxx>

In raid10_sync_request(), we expect issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().

In normal case, we will add resync sectors into 
mddev->recovery_active
when raid10_sync_request() returned, and sub resync sectors from
mddev->recovery_active when end_sync_write() call 
end_sync_request().

If new added disk, which are replacing the old disk, is set 
faulty,
there is a race issue:
    1. In the first rcu protected section, resync thread did 
not detect
       that mreplace is set faulty and pass the condition.
    2. In the second rcu protected section, mreplace is set 
faulty.
    3. But, resync thread will prepare the read object first, 
and then
       check the write condition.
    4. It will find that mreplace is set faulty and do not have to
       prepare write object.
This cause we add resync sectors but never sub.

This issue can be easily reproduced by the following steps:
    mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
    mdadm /dev/md0 -a /dev/sde
    mdadm /dev/md0 --replace /dev/sdd
    sleep 1
    mdadm /dev/md0 -f /dev/sde

This issue can be fixed by checking the write condition before 
prepare
the read object in the second rcu protected section.

Reported-by: Alex Chen <alexchen@xxxxxxxxxxxx>
Reviewed-by: Allen Peng <allenpeng@xxxxxxxxxxxx>
Reviewed-by: BingJing Chang <bingjingc@xxxxxxxxxxxx>
Signed-off-by: Alex Wu <alexwu@xxxxxxxxxxxx>
---
 drivers/md/raid10.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9818980..593d193 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3189,6 +3189,17 @@ static sector_t 
raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
                         continue;
                     }
                 }
+
+                /* make sure we need to write
+                 * before we prepare read object
+                 */
+                if (test_bit(In_sync, &mrdev->flags) &&
+                    (mreplace == NULL ||
+                     r10_bio->devs[1].repl_bio == NULL ||
+                     test_bit(Faulty, &mreplace->flags))) {
+                    break;
+                }
+

Does this fix the problem completely? Faulty can be set at any 
time, for
example, after this check, then we will have a mismatch again. 
How about we
record the Faulty bit for mpreplace locally, then use it for 
both read/write
check.

Thanks,
Shaohua
                 bio = r10_bio->devs[0].bio;
                 bio->bi_next = biolist;
                 biolist = bio;
-- 2.7.4


I think you are right, this patch might not fix the problem 
completely.
I will change another method and test it.
Thanks for your reply.

Sorry for jump in, I guess we can't increase the recovery_active in
case recovery is interrupted,
also raise_barrier is called for recovery case but seems no one call
lower_barrier, so we need to
call it when device is in faulty state. Can you try below changes 
in your side?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63ceabb4e020..6a7054e92577 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8486,14 +8486,14 @@ void md_do_sync(struct md_thread *thread)
                        break;
                }

+               if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
+                       break;
+
                if (!skipped) { /* actual IO requested */
                        io_sectors += sectors;
                        atomic_add(sectors, &mddev->recovery_active);
                }

-               if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
-                       break;
-
                j += sectors;
                if (j > max_sectors)
                        /* when skipping, extra large numbers can be
returned. */
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 981898049491..411bc39b5d6a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3239,8 +3239,10 @@ static sector_t raid10_sync_request(struct
mddev *mddev, sector_t sector_nr,
                                 * happy.
                                 */
                                if (mreplace == NULL || bio == 
NULL ||
-                                   test_bit(Faulty, 
&mreplace->flags))
+                                   test_bit(Faulty, 
&mreplace->flags)) {
+ lower_barrier(conf);
                                        break;
+                               }
                                bio->bi_next = biolist;
                                biolist = bio;
                                bio->bi_end_io = end_sync_write;

Thanks,
Guoqing

Thanks for joining the discussion.
I can try your changes when I modify the patch. But I think 
lower_barrier()
should not be called directly. lower_barrier() will be called in 
put_buf()
when r10_bio finished all tasks. I think the goal is to let r10_bio 
can
finish the tasks if we allocate r10_bio and raise_barrier().

In this specific case, code would break before set bi_end_io to 
end_sync_write.

                                if (mreplace == NULL || bio == NULL ||
                                    test_bit(Faulty, &mreplace->flags))
                                        break;

Since raid10_sync_request has called raise_barrier before, then we 
can't expect
end_sync_write -> end_sync_request -> put_buf -> lower_barrier to 
decrease
barrier, no?

But, we need to figure out where to free the r10_bio (as put_buf 
did) which is
allocated before test_bit(Faulty, &mreplace->flags) returns true, 
otherwise memory
leak may happen ...

Hmm, maybe it is possible to set biolist to NULL before break since 
the recovery
definitely can't continue in this case.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d6f7978b4449..f99893e78d08 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3239,8 +3239,10 @@ static sector_t raid10_sync_request(struct
mddev *mddev, sector_t sector_nr,
                                 * happy.
                                 */
                                if (mreplace == NULL || bio == NULL ||
-                                   test_bit(Faulty, &mreplace->flags))
+                                   test_bit(Faulty, 
&mreplace->flags)) {
+                                       biolist = NULL;
                                        break;
+                               }

Then put_buf would be called in the following  section.

if (biolist == NULL) {
                        while (r10_bio) {
                                struct r10bio *rb2 = r10_bio;
                                r10_bio = (struct r10bio*) 
rb2->master_bio;
                                rb2->master_bio = NULL;
                                put_buf(rb2);
                        }

Thanks,
Guoqing

I think we cannot always set biolist to NULL in that condition, because
the resync thread might do recovery without replacement.

No,  the recovery without replacement is already handled before. Now,
we are only try to write to replacement, and I don't think recovery can
write to both replacement and other member device in parallel. IOW,
we don't to consider other member devices which is not in sync state,
you can also see from the code:

                                /* and maybe write to replacement */
                                bio = r10_bio->devs[1].repl_bio;
                                if (bio)
                                        bio->bi_end_io = NULL;

The bi_end_io of the write bio has been set to NULL, for normal case, then
next code need to reset bi_end_io etc for replacement, since the replacement
is already in faulty state, we definitely can't continue the process.

Thanks,
Guoqing