Re: [PATCH] md/raid1: fix a race between removing rdev and access conf->mirrors[i].rdev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 7/30/19 10:36 AM, Yufen Yu wrote:


On 2019/7/29 23:29, Guoqing Jiang wrote:


On 7/29/19 3:23 PM, Yufen Yu wrote:


On 2019/7/29 19:54, Guoqing Jiang wrote:


On 7/29/19 1:36 PM, Yufen Yu wrote:
I don't think this can fix the race condition completely.

-               p->rdev = NULL;
                  if (!test_bit(RemoveSynchronized, &rdev->flags)) {
                          synchronize_rcu();
+                       p->rdev = NULL;
                          if (atomic_read(&rdev->nr_pending)) {

If we access conf->mirrors[i].rdev (e.g. raid1_write_request()) after RCU grace period, synchronize_rcu() will not wait the reader. Then, it also can cause NULL pointer dereference.

That is the reason why we add the new flag 'WantRemove'. It can prevent the reader to access
the 'rdev' after RCU grace period.


Sorry for my wrong description. It is  ** after RCU grace start **, not 'after RCU grace period'.



How about move it to the else branch?

@@ -1825,7 +1828,6 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
                        err = -EBUSY;
                        goto abort;
                }
-               p->rdev = NULL;
                if (!test_bit(RemoveSynchronized, &rdev->flags)) {
                        synchronize_rcu();
                        if (atomic_read(&rdev->nr_pending)) {
@@ -1833,8 +1835,10 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
                                err = -EBUSY;
                                p->rdev = rdev;
                                goto abort;
-                       }
-               }
+                       } else
+                               p->rdev = NULL;
+               } else
+                       p->rdev = NULL;

After rcu period, the nr_pending should be not zero in your case.


I also don't think this can work.

    +                              +
     |                                |
     |                                |
     |  +--->Reader         |          +--->Reader nr_pending++
     |                                |
     |                                |
     |                                |
    +                               +
start rcu period             end rcu period
call synchronize_rcu()    return synchronize_rcu()

If the reader try to read conf->mirrors[i].rdev after rcu peroid start,
synchronize_rcu() will not wait the reader. We assume the current
value of nr_pending is 0. Then, raid1_remove_disk will set 'p->rdev = NULL'.

After that the reader add 'nr_pending' to 1 and try to access conf->mirrors[i].rdev,
It can cause NULL pointer dereference.

Adding the new flag 'WantRemove' can prevent the reader to access
conf->mirrors[i].rdev after ** start rcu period **.

I have to admit that I don't know RCU well, but add an additional flag to address rcu related
stuff is not a right way from my understanding ...

And your original patch set p->rdev to NULL finally which is not correct I think, I also wonder what will happen if "raid1_remove_disk set WantRemove and p->rdev = NULL" happens between rcu_read_lock/unlock and access conf->mirrors[i].rdev.


We should not forget there is 'synchronize_rcu()' between set 'WantRemove' and 'p->rdev = NULL'. If your worried scenes is true, it means that the reader call rcu_read_lock() before synchronize_rcu().  Then, synchronize_rcu() will wait until the reader call rcu_read_unlock(), which is inconsistent with the worried scenes.

I could misunderstood your description.  For write request, two parts in it need to dereference rdev, one has the protection of rcu lock while another didn't have the
protection.

1. raid1_write_request
part 1.a has rcu_read_lock, rcu_dereference(rdev) and rcu_read_unlock.

part 1.b:
use rdev to dereference without the protection of rcu read lock, such as
"conf->mirrors[i].rdev->data_offset" in your description.

Now raid1_remove_disk set WantRemove flag before rdev is set to NULL, let's say part 2.a includes those lines: "set_bit(WantRemove, &rdev->flags)", synchronize_rcu
and "p->rdev = NULL".

If 2.a is called between 1.a and 1.b, the dereference issue still exists, no? So my
naive thinking is to add protection to part 1.b as well.


Anyway, I hope something like this can work.

gjiang@ls00508:/media/gjiang/opensource-tree/linux$ git diff drivers/md/raid1.c
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 34e26834ad28..62d0b3b69628 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1471,8 +1471,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,

        first_clone = 1;

+       rcu_read_lock();
        for (i = 0; i < disks; i++) {
                struct bio *mbio = NULL;
+               struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
+               if (!rdev || test_bit(Faulty, &rdev->flags))
+                       continue;
                if (!r1_bio->bios[i])
                        continue;

@@ -1500,7 +1507,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,                         mbio = bio_clone_fast(bio, GFP_NOIO, &mddev->bio_set);

                if (r1_bio->behind_master_bio) {
-                       struct md_rdev *rdev = conf->mirrors[i].rdev;

                        if (test_bit(WBCollisionCheck, &rdev->flags)) {
                                sector_t lo = r1_bio->sector;
@@ -1551,6 +1557,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
                        md_wakeup_thread(mddev->thread);
                }
        }
+       rcu_read_unlock();


The region may sleep to wait memory allocate. So, I don't think it is reasonable
to add rcu_read_lock/unlock().

Thanks for the investigation, that is why rcu_read_lock/unlock are not called here.


r1_bio_write_done(r1_bio);

@@ -1825,16 +1832,19 @@ static int raid1_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
                        err = -EBUSY;
                        goto abort;
                }
-               p->rdev = NULL;
                if (!test_bit(RemoveSynchronized, &rdev->flags)) {
                        synchronize_rcu();
                        if (atomic_read(&rdev->nr_pending)) {
                                /* lost the race, try later */
                                err = -EBUSY;
-                               p->rdev = rdev;
+                               rcu_assign_pointer(p->rdev, rdev);
                                goto abort;
+                       } else {
+                               RCU_INIT_POINTER(p->rdev, NULL);
+                               synchronize_rcu();

What is the purpose of adding 'synchronize_rcu()' here?

Just follow some code in kernel since I don't know RCU well as I said.

Another way comes to my mind (totally untested),  and I think this way is more reasonable. BTW, I had no luck to hit the issue since the race is happened in a short window, so I can't
verify the change.

@@ -8769,7 +8776,8 @@ static int remove_and_add_spares(struct mddev *mddev,
        int removed = 0;
        bool remove_some = false;

-       if (this && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+       if (this && (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
+                    (test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))))
                /* Mustn't remove devices when resync thread */
                return 0;


Thanks,
Guoqing



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux