Re: [PATCH] md/raid1: fix a race between removing rdev and access conf->mirrors[i].rdev

Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> · Mon, 29 Jul 2019 12:17:02 +0200

Now actually add list to CC ..., sorry about it.

On 7/29/19 12:15 PM, Guoqing Jiang wrote:
Forgot to cc list ...

On 7/29/19 12:03 PM, Guoqing Jiang wrote:


On 7/27/19 4:54 AM, Yufen Yu wrote:


On 2019/7/26 18:21, Guoqing Jiang wrote:


On 7/26/19 8:00 AM, Yufen Yu wrote:
We get a NULL pointer dereference oops when test raid1 as follow:

mdadm -CR /dev/md1 -l 1 -n 2 /dev/sd[ab]

mdadm /dev/md1 -f /dev/sda
mdadm /dev/md1 -r /dev/sda
mdadm /dev/md1 -a /dev/sda
sleep 5
mdadm /dev/md1 -f /dev/sdb
mdadm /dev/md1 -r /dev/sdb
mdadm /dev/md1 -a /dev/sdb

After a disk(/dev/sda) has been removed, we add the disk to raid
array again, which would trigger recovery action. Since the rdev
current state is 'spare', read/write bio can be issued to the disk.

Then we set the other disk (/dev/sdb) faulty. Since the raid array
is now in degraded state and /dev/sdb is the only 'In_sync' disk,
raid1_error() will return but without set faulty success.

I don't think you can set sdb to faulty since it is the last working
device in raid1 per the comment, unless you did some internal change
already.

        /*
         * If it is not operational, then we have already marked it as dead
         * else if it is the last working disks, ignore the error, let the
         * next level up know.
         * else mark the drive as failed
         */


Yes, we can not set sdb faulty *Since the raid array is now in degraded state
and /dev/sdb is the only In_sync disk*. And raid1_error() will return in follow,
*but without set faulty success*:

     if (test_bit(In_sync, &rdev->flags)
         && (conf->raid_disks - mddev->degraded) == 1) {
         /*
          * Don't fail the drive, act as though we were just a
          * normal single drive.
          * However don't try a recovery from this drive as
          * it is very likely to fail.
          */
         conf->recovery_disabled = mddev->recovery_disabled;
         spin_unlock_irqrestore(&conf->device_lock, flags);
         return;
     }

Here, 'conf->recovery_disabled' is assigned value as 'mddev->recovery_disabled',
and 'mddev' will be set 'MD_RECOVERY_INTR' flag in md_error().
Then,  md_check_recovery() can go to remove_and_add_spares() and try to remove the disk.

raid1_remove_disk()
         if (test_bit(In_sync, &rdev->flags) ||
             atomic_read(&rdev->nr_pending)) {
             err = -EBUSY;
             goto abort;
         }
         /* Only remove non-faulty devices if recovery
          * is not possible.
          */
         if (!test_bit(Faulty, &rdev->flags) &&
             mddev->recovery_disabled != conf->recovery_disabled &&
             mddev->degraded < conf->raid_disks) {
             err = -EBUSY;
             goto abort;
         }
         p->rdev = NULL;

The function cannot 'goto abort', and try to set 'p->rdev = NULL', which cause the race
conditions in the patch as a result.

BTW, we have not changed any internal implementation of raid.



However, that can interrupt the recovery action and
md_check_recovery will try to call remove_and_add_spares() to
remove the spare disk. And the race condition between
remove_and_add_spares() and raid1_write_request() in follow
can cause NULL pointer dereference for conf->mirrors[i].rdev:

raid1_write_request()   md_check_recovery    raid1_error()
rcu_read_lock()
rdev != NULL
!test_bit(Faulty, &rdev->flags)

conf->recovery_disabled=
mddev->recovery_disabled;
                                             return busy

                         remove_and_add_spares
                         raid1_remove_disk

I am not quite follow here, raid1_remove_disk is called under some conditions:

                if ((this == NULL || rdev == this) &&
                    rdev->raid_disk >= 0 &&
                    !test_bit(Blocked, &rdev->flags) &&
                    ((test_bit(RemoveSynchronized, &rdev->flags) ||
                     (!test_bit(In_sync, &rdev->flags) &&
                      !test_bit(Journal, &rdev->flags))) &&
                    atomic_read(&rdev->nr_pending)==0))

So rdev->raid_disk should not be negative value, and for the spare disk showed
as 'S' should meet the requirement:

                        if (rdev->raid_disk < 0)
                                seq_printf(seq, "(S)"); /* spare */



Maybe I have not expressed clearly.  I observe the 'spare' state by 
'/sys/block/md1/md/dev-sda/state'.
state_show()
     if (!test_bit(Faulty, &flags) &&
         !test_bit(Journal, &flags) &&
         !test_bit(In_sync, &flags))
         len += sprintf(page+len, "spare%s", sep);

Hmm, so there are two different definitions about "spare" ...


When /dev/sda has been added to the array again, hot_add_disk() just add it to mddev->disks list.
Then, md_check_recovery() will invoke remove_and_add_spares() to add the rdev in conf->mirrors[i]
array actually.
After that, rdev->raid_disk is not be negative value and we can issue bio to the disk.


Thanks a lot for the detailed explanation!

For the issue, maybe a better way is to call "p->rdev = NULL" after synchronize_rcu().
Could you try the below change? Thanks.

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 34e26834ad28..867808bed325 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1825,16 +1825,17 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
                         err = -EBUSY;
                         goto abort;
                 }
-               p->rdev = NULL;
                 if (!test_bit(RemoveSynchronized, &rdev->flags)) {
                         synchronize_rcu();
+                       p->rdev = NULL;
                         if (atomic_read(&rdev->nr_pending)) {
                                 /* lost the race, try later */
                                 err = -EBUSY;
                                 p->rdev = rdev;
                                 goto abort;
                         }
-               }
+               } else
+                       p->rdev = NULL;
                 if (conf->mirrors[conf->raid_disks + number].rdev) {
                         /* We just removed a device that is being replaced.
                          * Move down the replacement.  We drain all IO before


And I think the change should apply to raid10 as well, but let's see the above works or not.


Seems raid10_error doesn't have the code about recovery_disabled, then it is not necessary for raid10.

Thanks,
Guoqing