Re: [PATCH] xfs: stabilize insert range start boundary to avoid COW writeback race

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 13 Dec 2019 06:37:59 -0500

On Fri, Dec 13, 2019 at 07:48:51AM +1100, Dave Chinner wrote:
> On Thu, Dec 12, 2019 at 09:16:34AM -0500, Brian Foster wrote:
> > On Thu, Dec 12, 2019 at 07:52:30AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 11, 2019 at 07:47:12AM -0500, Brian Foster wrote:
> > > > On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > > > > I think insert/collapse need to be converted to work like a
> > > > > truncate operation instead of a series on individual write
> > > > > operations. That is, they are a permanent transaction that locks the
> > > > > inode once and is rolled repeatedly until the entire extent listi
> > > > > modification is done and then the inode is unlocked.
> > > > > 
> > > > 
> > > > Note that I don't think it's sufficient to hold the inode locked only
> > > > across the shift. For the insert case, I think we'd need to grab it
> > > > before the extent split at the target offset and roll from there.
> > > > Otherwise the same problem could be reintroduced if we eventually
> > > > replaced the xfs_prepare_shift() tweak made by this patch. Of course,
> > > > that doesn't look like a big problem. The locking is already elevated
> > > > and split and shift even use the same transaction type, so it's mostly a
> > > > refactor from a complexity standpoint. 
> > > 
> > > *nod*
> > > 
> > > > For the collapse case, we do have a per-shift quota reservation for some
> > > > reason. If that is required, we'd have to somehow replace it with a
> > > > worst case calculation. That said, it's not clear to me why that
> > > > reservation even exists.
> > > 
> > > I'm not 100% sure, either, but....
> > > 
> > > > The pre-shift hole punch is already a separate
> > > > transaction with its own such reservation. The shift can merge extents
> > > > after that point (though most likely only on the first shift), but that
> > > > would only ever remove extent records. Any thoughts or objections if I
> > > > just killed that off?
> > > 
> > > Yeah, I suspect that it is the xfs_bmse_merge() case freeing blocks
> > > the reservation is for, and I agree that it should only happen on
> > > the first shift because all the others that are moved are identical
> > > in size and shape and would have otherwise been merged at creation.
> > > 
> > > Hence I think we can probably kill the xfs_bmse_merge() case,
> > > though it might be wrth checking first how often it gets called...
> > > 
> > 
> > Ok, but do we need an up-front quota reservation for freeing blocks out
> > of the bmapbt? ISTM the reservation could be removed regardless of the
> > merging behavior. This is what my current patch does, at least, so we'll
> > see if anything explodes. :P
> 
> xfs_itruncate_extents() doesn't require an up front block
> reservation for quotas or transaction allocation, so I don't really
> see how the collapse would require it, even in the merge case...
> 

Ok, so I'm not completely crazy. :) I've seen no issues so far with the
reservation removed, anyways.

> > I agree on the xfs_bmse_merge() bit. I was planning to leave that as is
> > however because IIRC, even though it is quite rare, I thought we had a
> > few corner cases where it was possible for physically and logically
> > contiguous extents to track separately in a mapping tree. Max sized
> > extents that are subsequently punched out or truncated might be one
> > example. I thought we had others, but I can't quite put my finger on it
> > atm..
> 
> True, but is it common enough that we really need to care about it?
> If it's bad, just run xfs_fsr on the file/filesystem....
> 

Yeah probably not. Extent shift shouldn't really be a path out of that
problem if it exists. As you suggest, fsr or online repair should be
expected to fix up things like that. My thinking was more that it wasn't
necessary to change this code to achieve atomicity. IOW, I don't object
to changing it, but I'd probably do it in a separate patch.

In looking at this code again, it also looks like filtering out the
merge check for everything but the first left shift would result in more
code than what we have now. To me, the current code seems
straightforward and harmless in that it's implemented as sort of an
optimization of how to perform the shift. I.e., do we shift the existing
extent or can we fold it into an existing one? So I think I've kind of
lost track of what the suggestion was wrt to the merge code. What did
you mean exactly by "kill off the merge case" and what was the goal?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>