Re: Panic at BUG_ON(force && !conf->barrier);

Xiao Ni <xni@xxxxxxxxxx> · Thu, 30 Aug 2018 14:03:36 +0800

On 08/30/2018 12:36 PM, NeilBrown wrote:
On Wed, Aug 29 2018, Xiao Ni wrote:

----- Original Message -----
From: "NeilBrown" <neilb@xxxxxxxx>
To: "Xiao Ni" <xni@xxxxxxxxxx>, "Shaohua Li" <shli@xxxxxxxxxx>
Cc: "linux-raid" <linux-raid@xxxxxxxxxxxxxxx>, "Nigel Croxon" <ncroxon@xxxxxxxxxx>
Sent: Thursday, August 30, 2018 8:30:35 AM
Subject: Re: Panic at BUG_ON(force && !conf->barrier);

On Thu, Aug 16 2018, Xiao Ni wrote:

Hi Shaohua

I encounter one panic recent in rhel7.6. The test is:
1. Create some VDO devices
2. Create raid10 device on 4 vdo devices
3. Reshape raid10 device to 6 vdo devices.

When sector_nr <= last it needs to goto read_more. If the r10_bio
containing the
read_bio which is submitted before goto has freed and lower_barrier has
called.
It'll panic at BUG_ON(force && !conf->barrier)

The possibility of this is decreased by c85ba1 (md: raid1/raid10: don't
handle failure of bio_add_page())
In the test case bio_add_page fails after adding one page. It usually calls
goto read_more. So the
problem happens easily.

But in upstream it still has the possibility to hit the BUG_ON. Because the
max_sectors return from
read_balance can let sector_nr <= last.

Do you think it's the right way to fix this?
No.
The right way to fix it is:

1/ before the read_more: label, put
      raise_barrier(conf, 0);
2/ in place of the existing raise_barrier() put
      raise_barrier(conf, 1);
3/ after the "goto read_more" put
      lower_barrier(conf);
Hi Neil

Thanks for your reply.

The argument force of raise_barrier tries to submit those
reshape bios as close as it can. It's the target. So the
speed of reshape can be more quick, right?
Wrong.  The goal is to avoid deadlock.  If you hold a lock (which
raise_barrier effectively is), it isn't safe to block while waiting to
take it again - even if it is a reader-lock.

Could you explain this in detail? How the deadlock can happen if without 
force?
Raid1 didn't have the argument force before new barrier codes too.

Add raise_barrier(conf, 0) before read_more label, and call
raise_barrier(conf, 1). The second call don't wait conf->nr_waiting
to be zero. But the first call of raise_barrier(conf, 0), it still
needs to wait for conf->nr_waiting to be zero. The time should be
same if we call raise_barrier(conf, 0) directly in place of the
existing raise_barrier().

And is there a possibility it calls goto read_more many times? Fox example:
there are two near copies. And they all have bad sectors in different
places. If so it calls raise_barrier(conf, 0) many times and it just calls
lower_barrier one time.
Yes, it could goto read_more many times.  Every time it calls
raise_barrier() and also schedules a request.  When the request
completes, lower_barrier is called.  So the raise_barrier() calls after
read_more exactly match the lower_barrier calls when the request
completes.
The new raise_barrier call I introduced balances the new
lower_barrier() call I introduced.  The purpose of these two is to keep
->barrier elevated until reshape_request can completed a set of
requests.

Sorry. I put the new raise_barrier you added in the wrong place. Now I 
understand
the relationship.  I'll send a new patch later.

Best Regards
Xiao