On Wed, Jul 31, 2019 at 3:41 AM Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> wrote: > > > > On 7/30/19 10:36 AM, Yufen Yu wrote: > > > > > > On 2019/7/29 23:29, Guoqing Jiang wrote: > >> > >> > >> On 7/29/19 3:23 PM, Yufen Yu wrote: > >>> > >>> > >>> On 2019/7/29 19:54, Guoqing Jiang wrote: > >>>> > >>>> > >>>> On 7/29/19 1:36 PM, Yufen Yu wrote: > >>>>> I don't think this can fix the race condition completely. > >>>>> > >>>>> - p->rdev = NULL; > >>>>> if (!test_bit(RemoveSynchronized, &rdev->flags)) { > >>>>> synchronize_rcu(); > >>>>> + p->rdev = NULL; > >>>>> if (atomic_read(&rdev->nr_pending)) { > >>>>> > >>>>> If we access conf->mirrors[i].rdev (e.g. raid1_write_request()) > >>>>> after RCU grace period, > >>>>> synchronize_rcu() will not wait the reader. Then, it also can > >>>>> cause NULL pointer dereference. > >>>>> > >>>>> That is the reason why we add the new flag 'WantRemove'. It can > >>>>> prevent the reader to access > >>>>> the 'rdev' after RCU grace period. > >>>> > >>> > >>> Sorry for my wrong description. It is ** after RCU grace start **, > >>> not 'after RCU grace period'. > >>> > >>> > >>>> > >>>> How about move it to the else branch? > >>>> > >>>> @@ -1825,7 +1828,6 @@ static int raid1_remove_disk(struct mddev > >>>> *mddev, struct md_rdev *rdev) > >>>> err = -EBUSY; > >>>> goto abort; > >>>> } > >>>> - p->rdev = NULL; > >>>> if (!test_bit(RemoveSynchronized, &rdev->flags)) { > >>>> synchronize_rcu(); > >>>> if (atomic_read(&rdev->nr_pending)) { > >>>> @@ -1833,8 +1835,10 @@ static int raid1_remove_disk(struct mddev > >>>> *mddev, struct md_rdev *rdev) > >>>> err = -EBUSY; > >>>> p->rdev = rdev; > >>>> goto abort; > >>>> - } > >>>> - } > >>>> + } else > >>>> + p->rdev = NULL; > >>>> + } else > >>>> + p->rdev = NULL; > >>>> > >>>> After rcu period, the nr_pending should be not zero in your case. > >>>> > >>> > >>> I also don't think this can work. > >>> > >>> + + > >>> | | > >>> | | > >>> | +--->Reader | +--->Reader nr_pending++ > >>> | | > >>> | | > >>> | | > >>> + + > >>> start rcu period end rcu period > >>> call synchronize_rcu() return synchronize_rcu() > >>> > >>> If the reader try to read conf->mirrors[i].rdev after rcu peroid start, > >>> synchronize_rcu() will not wait the reader. We assume the current > >>> value of nr_pending is 0. Then, raid1_remove_disk will set 'p->rdev > >>> = NULL'. > >>> > >>> After that the reader add 'nr_pending' to 1 and try to access > >>> conf->mirrors[i].rdev, > >>> It can cause NULL pointer dereference. > >>> > >>> Adding the new flag 'WantRemove' can prevent the reader to access > >>> conf->mirrors[i].rdev after ** start rcu period **. > >> > >> I have to admit that I don't know RCU well, but add an additional > >> flag to address rcu related > >> stuff is not a right way from my understanding ... > >> > >> And your original patch set p->rdev to NULL finally which is not > >> correct I think, I also wonder > >> what will happen if "raid1_remove_disk set WantRemove and p->rdev = > >> NULL" happens between rcu_read_lock/unlock and access > >> conf->mirrors[i].rdev. > >> > > > > We should not forget there is 'synchronize_rcu()' between set > > 'WantRemove' > > and 'p->rdev = NULL'. If your worried scenes is true, it means that > > the reader > > call rcu_read_lock() before synchronize_rcu(). Then, > > synchronize_rcu() will wait > > until the reader call rcu_read_unlock(), which is inconsistent with > > the worried scenes. > > I could misunderstood your description. For write request, two parts in > it need to > dereference rdev, one has the protection of rcu lock while another > didn't have the > protection. > > 1. raid1_write_request > part 1.a has rcu_read_lock, rcu_dereference(rdev) and rcu_read_unlock. > > part 1.b: > use rdev to dereference without the protection of rcu read lock, such as > "conf->mirrors[i].rdev->data_offset" in your description. > > Now raid1_remove_disk set WantRemove flag before rdev is set to NULL, > let's say > part 2.a includes those lines: "set_bit(WantRemove, &rdev->flags)", > synchronize_rcu > and "p->rdev = NULL". > > If 2.a is called between 1.a and 1.b, the dereference issue still > exists, no? So my > naive thinking is to add protection to part 1.b as well. > > > > >> Anyway, I hope something like this can work. > >> > >> gjiang@ls00508:/media/gjiang/opensource-tree/linux$ git diff > >> drivers/md/raid1.c > >> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > >> index 34e26834ad28..62d0b3b69628 100644 > >> --- a/drivers/md/raid1.c > >> +++ b/drivers/md/raid1.c > >> @@ -1471,8 +1471,15 @@ static void raid1_write_request(struct mddev > >> *mddev, struct bio *bio, > >> > >> first_clone = 1; > >> > >> + rcu_read_lock(); > >> for (i = 0; i < disks; i++) { > >> struct bio *mbio = NULL; > >> + struct md_rdev *rdev = > >> rcu_dereference(conf->mirrors[i].rdev); > >> + if (!rdev || test_bit(Faulty, &rdev->flags)) > >> + continue; > >> if (!r1_bio->bios[i]) > >> continue; > >> > >> @@ -1500,7 +1507,6 @@ static void raid1_write_request(struct mddev > >> *mddev, struct bio *bio, > >> mbio = bio_clone_fast(bio, GFP_NOIO, > >> &mddev->bio_set); > >> > >> if (r1_bio->behind_master_bio) { > >> - struct md_rdev *rdev = conf->mirrors[i].rdev; > >> > >> if (test_bit(WBCollisionCheck, &rdev->flags)) { > >> sector_t lo = r1_bio->sector; > >> @@ -1551,6 +1557,7 @@ static void raid1_write_request(struct mddev > >> *mddev, struct bio *bio, > >> md_wakeup_thread(mddev->thread); > >> } > >> } > >> + rcu_read_unlock(); > >> > > > > The region may sleep to wait memory allocate. So, I don't think it is > > reasonable > > to add rcu_read_lock/unlock(). > > Thanks for the investigation, that is why rcu_read_lock/unlock are not > called here. > > > > >> r1_bio_write_done(r1_bio); > >> > >> @@ -1825,16 +1832,19 @@ static int raid1_remove_disk(struct mddev > >> *mddev, struct md_rdev *rdev) > >> err = -EBUSY; > >> goto abort; > >> } > >> - p->rdev = NULL; > >> if (!test_bit(RemoveSynchronized, &rdev->flags)) { > >> synchronize_rcu(); > >> if (atomic_read(&rdev->nr_pending)) { > >> /* lost the race, try later */ > >> err = -EBUSY; > >> - p->rdev = rdev; > >> + rcu_assign_pointer(p->rdev, rdev); > >> goto abort; > >> + } else { > >> + RCU_INIT_POINTER(p->rdev, NULL); > >> + synchronize_rcu(); > > > > What is the purpose of adding 'synchronize_rcu()' here? > > Just follow some code in kernel since I don't know RCU well as I said. > > Another way comes to my mind (totally untested), and I think this way > is more reasonable. > BTW, I had no luck to hit the issue since the race is happened in a > short window, so I can't > verify the change. > > @@ -8769,7 +8776,8 @@ static int remove_and_add_spares(struct mddev *mddev, > int removed = 0; > bool remove_some = false; > > - if (this && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) > + if (this && (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || > + (test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)))) > /* Mustn't remove devices when resync thread */ > return 0; > Sorry for jumping in late. Thanks Guoqing for all the feedback. Question: do we need a case of "WantRemove but Not Faulty"? Thanks, Song