Re: [PATCH 2/8] xfs: separate CIL commit record IO

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 5 Mar 2021 09:57:51 +1100

On Wed, Mar 03, 2021 at 10:22:05AM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> > On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > You haven't addressed my feedback from the previous version. In
> > > particular the bit about whether it is safe to block on ->ic_force_wait
> > > from here considering some of our more quirky buffer locking behavior.
> > 
> > Sorry, first I've heard about this. I don't have any such email in
> > my inbox.
> > 
> 
> For reference, the last bit of this mail:
> 
> https://lore.kernel.org/linux-xfs/20210201160737.GA3252048@bfoster/
> 
> > I don't know what waiting on an iclog in the middle of a checkpoint
> > has to do with buffer locking behaviour, because iclogs don't use
> > buffers and we block waiting on iclog IO completion all the time in
> > xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
> > completion here, then it's not safe to block on an iclog in
> > xlog_state_get_iclog_space(). That's obviously not true, so I'm
> > really not sure what the concern here is...
> > 
> 
> I think the broader question is not so much whether it's safe to block
> here or not, but whether our current use of async log forces might have
> a deadlock vector (which may or may not also include the
> _get_iclog_space() scenario, I'd need to stare at that one a bit). I
> referred to buffer locking because the buffer ->iop_unpin() handler can
> attempt to acquire a buffer lock.

There are none that I know of, and I'm not changing any of the log
write blocking rules. Hence if there is a problem, it's a zero-day
that we have never triggered nor have any awareness about at all.
Hence for the purposes of development and review, we can assume such
unknown design problems don't actually exist because there's
absolutely zero evidence to indicate there is problem here...

> Looking again, that is the only place I see that blocks in iclog
> completion callbacks and it's actually an abort scenario, which means
> shutdown.

Yup. The AIL simply needs to abort writeback of such locked, pinned
buffers and then everything works just fine.

> I am slightly concerned that introducing more regular blocking in
> the CIL push might lead to more frequent async log forces that
> block on callback iclogs and thus exacerbate that issue (i.e.
> somebody might be able to now reproduce yet another shutdown
> deadlock scenario to track down that might not have been
> reproducible before, for whatever reason), but that's probably not
> a serious enough problem to block this patch and the advantages of
> the series overall.

And that's why I updated the log force stats accounting to capture
the async log forces and how we account log forces that block. That
gives me direct visibility into the blocking behaviour while I'm
running tests. And even with this new visibility, I can't see any
change in the metrics that are above the noise floor...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx