Re: [PATCH] md/raid1: fix a race between removing rdev and access conf->mirrors[i].rdev

Song Liu <liu.song.a23@xxxxxxxxx> · Wed, 31 Jul 2019 14:01:09 -0700

On Wed, Jul 31, 2019 at 3:41 AM Guoqing Jiang
<guoqing.jiang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 7/30/19 10:36 AM, Yufen Yu wrote:
> >
> >
> > On 2019/7/29 23:29, Guoqing Jiang wrote:
> >>
> >>
> >> On 7/29/19 3:23 PM, Yufen Yu wrote:
> >>>
> >>>
> >>> On 2019/7/29 19:54, Guoqing Jiang wrote:
> >>>>
> >>>>
> >>>> On 7/29/19 1:36 PM, Yufen Yu wrote:
> >>>>> I don't think this can fix the race condition completely.
> >>>>>
> >>>>> -               p->rdev = NULL;
> >>>>>                   if (!test_bit(RemoveSynchronized, &rdev->flags)) {
> >>>>>                           synchronize_rcu();
> >>>>> +                       p->rdev = NULL;
> >>>>>                           if (atomic_read(&rdev->nr_pending)) {
> >>>>>
> >>>>> If we access conf->mirrors[i].rdev (e.g. raid1_write_request())
> >>>>> after RCU grace period,
> >>>>> synchronize_rcu() will not wait the reader. Then, it also can
> >>>>> cause NULL pointer dereference.
> >>>>>
> >>>>> That is the reason why we add the new flag 'WantRemove'. It can
> >>>>> prevent the reader to access
> >>>>> the 'rdev' after RCU grace period.
> >>>>
> >>>
> >>> Sorry for my wrong description. It is  ** after RCU grace start **,
> >>> not 'after RCU grace period'.
> >>>
> >>>
> >>>>
> >>>> How about move it to the else branch?
> >>>>
> >>>> @@ -1825,7 +1828,6 @@ static int raid1_remove_disk(struct mddev
> >>>> *mddev, struct md_rdev *rdev)
> >>>>                         err = -EBUSY;
> >>>>                         goto abort;
> >>>>                 }
> >>>> -               p->rdev = NULL;
> >>>>                 if (!test_bit(RemoveSynchronized, &rdev->flags)) {
> >>>>                         synchronize_rcu();
> >>>>                         if (atomic_read(&rdev->nr_pending)) {
> >>>> @@ -1833,8 +1835,10 @@ static int raid1_remove_disk(struct mddev
> >>>> *mddev, struct md_rdev *rdev)
> >>>>                                 err = -EBUSY;
> >>>>                                 p->rdev = rdev;
> >>>>                                 goto abort;
> >>>> -                       }
> >>>> -               }
> >>>> +                       } else
> >>>> +                               p->rdev = NULL;
> >>>> +               } else
> >>>> +                       p->rdev = NULL;
> >>>>
> >>>> After rcu period, the nr_pending should be not zero in your case.
> >>>>
> >>>
> >>> I also don't think this can work.
> >>>
> >>>     +                              +
> >>>      |                                |
> >>>      |                                |
> >>>      |  +--->Reader         |          +--->Reader nr_pending++
> >>>      |                                |
> >>>      |                                |
> >>>      |                                |
> >>>     +                               +
> >>> start rcu period             end rcu period
> >>> call synchronize_rcu()    return synchronize_rcu()
> >>>
> >>> If the reader try to read conf->mirrors[i].rdev after rcu peroid start,
> >>> synchronize_rcu() will not wait the reader. We assume the current
> >>> value of nr_pending is 0. Then, raid1_remove_disk will set 'p->rdev
> >>> = NULL'.
> >>>
> >>> After that the reader add 'nr_pending' to 1 and try to access
> >>> conf->mirrors[i].rdev,
> >>> It can cause NULL pointer dereference.
> >>>
> >>> Adding the new flag 'WantRemove' can prevent the reader to access
> >>> conf->mirrors[i].rdev after ** start rcu period **.
> >>
> >> I have to admit that I don't know RCU well, but add an additional
> >> flag to address rcu related
> >> stuff is not a right way from my understanding ...
> >>
> >> And your original patch set p->rdev to NULL finally which is not
> >> correct I think, I also wonder
> >> what will happen if "raid1_remove_disk set WantRemove and p->rdev =
> >> NULL" happens between rcu_read_lock/unlock and access
> >> conf->mirrors[i].rdev.
> >>
> >
> > We should not forget there is 'synchronize_rcu()' between set
> > 'WantRemove'
> > and 'p->rdev = NULL'. If your worried scenes is true, it means that
> > the reader
> > call rcu_read_lock() before synchronize_rcu().  Then,
> > synchronize_rcu() will wait
> > until the reader call rcu_read_unlock(), which is inconsistent with
> > the worried scenes.
>
> I could misunderstood your description.  For write request, two parts in
> it need to
> dereference rdev, one has the protection of rcu lock while another
> didn't have the
> protection.
>
> 1. raid1_write_request
> part 1.a has rcu_read_lock, rcu_dereference(rdev) and rcu_read_unlock.
>
> part 1.b:
> use rdev to dereference without the protection of rcu read lock, such as
> "conf->mirrors[i].rdev->data_offset" in your description.
>
> Now raid1_remove_disk set WantRemove flag before rdev is set to NULL,
> let's say
> part 2.a includes those lines: "set_bit(WantRemove, &rdev->flags)",
> synchronize_rcu
> and "p->rdev = NULL".
>
> If 2.a is called between 1.a and 1.b, the dereference issue still
> exists, no? So my
> naive thinking is to add protection to part 1.b as well.
>
> >
> >> Anyway, I hope something like this can work.
> >>
> >> gjiang@ls00508:/media/gjiang/opensource-tree/linux$ git diff
> >> drivers/md/raid1.c
> >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> >> index 34e26834ad28..62d0b3b69628 100644
> >> --- a/drivers/md/raid1.c
> >> +++ b/drivers/md/raid1.c
> >> @@ -1471,8 +1471,15 @@ static void raid1_write_request(struct mddev
> >> *mddev, struct bio *bio,
> >>
> >>         first_clone = 1;
> >>
> >> +       rcu_read_lock();
> >>         for (i = 0; i < disks; i++) {
> >>                 struct bio *mbio = NULL;
> >> +               struct md_rdev *rdev =
> >> rcu_dereference(conf->mirrors[i].rdev);
> >> +               if (!rdev || test_bit(Faulty, &rdev->flags))
> >> +                       continue;
> >>                 if (!r1_bio->bios[i])
> >>                         continue;
> >>
> >> @@ -1500,7 +1507,6 @@ static void raid1_write_request(struct mddev
> >> *mddev, struct bio *bio,
> >>                         mbio = bio_clone_fast(bio, GFP_NOIO,
> >> &mddev->bio_set);
> >>
> >>                 if (r1_bio->behind_master_bio) {
> >> -                       struct md_rdev *rdev = conf->mirrors[i].rdev;
> >>
> >>                         if (test_bit(WBCollisionCheck, &rdev->flags)) {
> >>                                 sector_t lo = r1_bio->sector;
> >> @@ -1551,6 +1557,7 @@ static void raid1_write_request(struct mddev
> >> *mddev, struct bio *bio,
> >>                         md_wakeup_thread(mddev->thread);
> >>                 }
> >>         }
> >> +       rcu_read_unlock();
> >>
> >
> > The region may sleep to wait memory allocate. So, I don't think it is
> > reasonable
> > to add rcu_read_lock/unlock().
>
> Thanks for the investigation, that is why rcu_read_lock/unlock are not
> called here.
>
> >
> >> r1_bio_write_done(r1_bio);
> >>
> >> @@ -1825,16 +1832,19 @@ static int raid1_remove_disk(struct mddev
> >> *mddev, struct md_rdev *rdev)
> >>                         err = -EBUSY;
> >>                         goto abort;
> >>                 }
> >> -               p->rdev = NULL;
> >>                 if (!test_bit(RemoveSynchronized, &rdev->flags)) {
> >>                         synchronize_rcu();
> >>                         if (atomic_read(&rdev->nr_pending)) {
> >>                                 /* lost the race, try later */
> >>                                 err = -EBUSY;
> >> -                               p->rdev = rdev;
> >> +                               rcu_assign_pointer(p->rdev, rdev);
> >>                                 goto abort;
> >> +                       } else {
> >> +                               RCU_INIT_POINTER(p->rdev, NULL);
> >> +                               synchronize_rcu();
> >
> > What is the purpose of adding 'synchronize_rcu()' here?
>
> Just follow some code in kernel since I don't know RCU well as I said.
>
> Another way comes to my mind (totally untested),  and I think this way
> is more reasonable.
> BTW, I had no luck to hit the issue since the race is happened in a
> short window, so I can't
> verify the change.
>
> @@ -8769,7 +8776,8 @@ static int remove_and_add_spares(struct mddev *mddev,
>          int removed = 0;
>          bool remove_some = false;
>
> -       if (this && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> +       if (this && (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
> +                    (test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))))
>                  /* Mustn't remove devices when resync thread */
>                  return 0;
>

Sorry for jumping in late. Thanks Guoqing for all the feedback.

Question: do we need a case of "WantRemove but Not Faulty"?

Thanks,
Song