Re: Panic at BUG_ON(force && !conf->barrier);

Shaohua Li <shli@xxxxxxxxxx> · Fri, 31 Aug 2018 17:33:16 -0700

On Thu, Aug 30, 2018 at 02:03:36PM +0800, Xiao Ni wrote:
> 
> 
> On 08/30/2018 12:36 PM, NeilBrown wrote:
> > On Wed, Aug 29 2018, Xiao Ni wrote:
> > 
> > > ----- Original Message -----
> > > > From: "NeilBrown" <neilb@xxxxxxxx>
> > > > To: "Xiao Ni" <xni@xxxxxxxxxx>, "Shaohua Li" <shli@xxxxxxxxxx>
> > > > Cc: "linux-raid" <linux-raid@xxxxxxxxxxxxxxx>, "Nigel Croxon" <ncroxon@xxxxxxxxxx>
> > > > Sent: Thursday, August 30, 2018 8:30:35 AM
> > > > Subject: Re: Panic at BUG_ON(force && !conf->barrier);
> > > > 
> > > > On Thu, Aug 16 2018, Xiao Ni wrote:
> > > > 
> > > > > Hi Shaohua
> > > > > 
> > > > > I encounter one panic recent in rhel7.6. The test is:
> > > > > 1. Create some VDO devices
> > > > > 2. Create raid10 device on 4 vdo devices
> > > > > 3. Reshape raid10 device to 6 vdo devices.
> > > > > 
> > > > > When sector_nr <= last it needs to goto read_more. If the r10_bio
> > > > > containing the
> > > > > read_bio which is submitted before goto has freed and lower_barrier has
> > > > > called.
> > > > > It'll panic at BUG_ON(force && !conf->barrier)
> > > > > 
> > > > > The possibility of this is decreased by c85ba1 (md: raid1/raid10: don't
> > > > > handle failure of bio_add_page())
> > > > > In the test case bio_add_page fails after adding one page. It usually calls
> > > > > goto read_more. So the
> > > > > problem happens easily.
> > > > > 
> > > > > But in upstream it still has the possibility to hit the BUG_ON. Because the
> > > > > max_sectors return from
> > > > > read_balance can let sector_nr <= last.
> > > > > 
> > > > > Do you think it's the right way to fix this?
> > > > No.
> > > > The right way to fix it is:
> > > > 
> > > > 1/ before the read_more: label, put
> > > >       raise_barrier(conf, 0);
> > > > 2/ in place of the existing raise_barrier() put
> > > >       raise_barrier(conf, 1);
> > > > 3/ after the "goto read_more" put
> > > >       lower_barrier(conf);
> > > Hi Neil
> > > 
> > > Thanks for your reply.
> > > 
> > > The argument force of raise_barrier tries to submit those
> > > reshape bios as close as it can. It's the target. So the
> > > speed of reshape can be more quick, right?
> > Wrong.  The goal is to avoid deadlock.  If you hold a lock (which
> > raise_barrier effectively is), it isn't safe to block while waiting to
> > take it again - even if it is a reader-lock.
> 
> Could you explain this in detail? How the deadlock can happen if without
> force?
> Raid1 didn't have the argument force before new barrier codes too.

It's raise_barrier vs wait_barrier. The second raise_barrier could deadlock
without force if wait_barrier is called.

> > > Add raise_barrier(conf, 0) before read_more label, and call
> > > raise_barrier(conf, 1). The second call don't wait conf->nr_waiting
> > > to be zero. But the first call of raise_barrier(conf, 0), it still
> > > needs to wait for conf->nr_waiting to be zero. The time should be
> > > same if we call raise_barrier(conf, 0) directly in place of the
> > > existing raise_barrier().
> > > 
> > > And is there a possibility it calls goto read_more many times? Fox example:
> > > there are two near copies. And they all have bad sectors in different
> > > places. If so it calls raise_barrier(conf, 0) many times and it just calls
> > > lower_barrier one time.
> > Yes, it could goto read_more many times.  Every time it calls
> > raise_barrier() and also schedules a request.  When the request
> > completes, lower_barrier is called.  So the raise_barrier() calls after
> > read_more exactly match the lower_barrier calls when the request
> > completes.
> > The new raise_barrier call I introduced balances the new
> > lower_barrier() call I introduced.  The purpose of these two is to keep
> > ->barrier elevated until reshape_request can completed a set of
> > requests.
> 
> Sorry. I put the new raise_barrier you added in the wrong place. Now I
> understand
> the relationship.  I'll send a new patch later.

The new patch looks good, will take it.

Thanks,
Shaohua