----- Original Message ----- > From: "NeilBrown" <neilb@xxxxxxxx> > To: "Xiao Ni" <xni@xxxxxxxxxx>, "Shaohua Li" <shli@xxxxxxxxxx> > Cc: "linux-raid" <linux-raid@xxxxxxxxxxxxxxx>, "Nigel Croxon" <ncroxon@xxxxxxxxxx> > Sent: Thursday, August 30, 2018 8:30:35 AM > Subject: Re: Panic at BUG_ON(force && !conf->barrier); > > On Thu, Aug 16 2018, Xiao Ni wrote: > > > Hi Shaohua > > > > I encounter one panic recent in rhel7.6. The test is: > > 1. Create some VDO devices > > 2. Create raid10 device on 4 vdo devices > > 3. Reshape raid10 device to 6 vdo devices. > > > > When sector_nr <= last it needs to goto read_more. If the r10_bio > > containing the > > read_bio which is submitted before goto has freed and lower_barrier has > > called. > > It'll panic at BUG_ON(force && !conf->barrier) > > > > The possibility of this is decreased by c85ba1 (md: raid1/raid10: don't > > handle failure of bio_add_page()) > > In the test case bio_add_page fails after adding one page. It usually calls > > goto read_more. So the > > problem happens easily. > > > > But in upstream it still has the possibility to hit the BUG_ON. Because the > > max_sectors return from > > read_balance can let sector_nr <= last. > > > > Do you think it's the right way to fix this? > > No. > The right way to fix it is: > > 1/ before the read_more: label, put > raise_barrier(conf, 0); > 2/ in place of the existing raise_barrier() put > raise_barrier(conf, 1); > 3/ after the "goto read_more" put > lower_barrier(conf); Hi Neil Thanks for your reply. The argument force of raise_barrier tries to submit those reshape bios as close as it can. It's the target. So the speed of reshape can be more quick, right? Add raise_barrier(conf, 0) before read_more label, and call raise_barrier(conf, 1). The second call don't wait conf->nr_waiting to be zero. But the first call of raise_barrier(conf, 0), it still needs to wait for conf->nr_waiting to be zero. The time should be same if we call raise_barrier(conf, 0) directly in place of the existing raise_barrier(). And is there a possibility it calls goto read_more many times? Fox example: there are two near copies. And they all have bad sectors in different places. If so it calls raise_barrier(conf, 0) many times and it just calls lower_barrier one time. How about this way: diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 35bd3a6..49ca735 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -4531,11 +4531,12 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, allow_barrier(conf); } + raise_barrier(conf, 0); + read_more: /* Now schedule reads for blocks from sector_nr to last */ r10_bio = raid10_alloc_init_r10buf(conf); r10_bio->state = 0; - raise_barrier(conf, sectors_done != 0); atomic_set(&r10_bio->remaining, 0); r10_bio->mddev = mddev; r10_bio->sector = sector_nr; @@ -4625,11 +4626,14 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, md_sync_acct_bio(read_bio, r10_bio->sectors); atomic_inc(&r10_bio->remaining); read_bio->bi_next = NULL; - generic_make_request(read_bio); sector_nr += nr_sectors; sectors_done += nr_sectors; - if (sector_nr <= last) + if (sector_nr <= last) { + raise_barrier(conf, 1); + generic_make_request(read_bio); goto read_more; + } else + generic_make_request(read_bio); /* Now that we have done the whole section we can * update reshape_progress Best Regards Xiao