Re: [PATCH] xfs: stabilize insert range start boundary to avoid COW writeback race

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 11 Dec 2019 07:47:12 -0500

On Wed, Dec 11, 2019 at 08:41:00AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2019 at 08:23:40AM -0500, Brian Foster wrote:
> > generic/522 (fsx) occasionally fails with a file corruption due to
> > an insert range operation. The primary characteristic of the
> > corruption is a misplaced insert range operation that differs from
> > the requested target offset. The reason for this behavior is a race
> > between the extent shift sequence of an insert range and a COW
> > writeback completion that causes a front merge with the first extent
> > in the shift.
> 
> How is the COW writeback completion modifying the extent list while
> an extent shift is modifying the extent list?  Both should be
> running under XFS_ILOCK_EXCL contexts so there shouldn't be a race
> condition here unless we've screwed up the extent list modification
> atomicity...
> 
> > 
> > The shift preparation function flushes and unmaps from the target
> > offset of the operation to the end of the file to ensure no
> > modifications can be made and page cache is invalidated before file
> > data is shifted. An insert range operation then splits the extent at
> > the target offset, if necessary, and begins to shift the start
> > offset of each extent starting from the end of the file to the start
> > offset. The shift sequence operates at extent level and so depends
> > on the preparation sequence to guarantee no changes can be made to
> > the target range during the shift.
> 
> Oh... shifting extents is not an atomic operation w.r.t. other
> inode modifications - both insert and collapse run individual
> modification transactions and lock/unlock the inode around each
> transaction. So, essentially, they aren't atomic when faced with
> other *metadata* modifications to the inode.
> 

Right..

> > If the block immediately prior to
> > the target offset was dirty and shared, however, it can undergo
> > writeback and move from the COW fork to the data fork at any point
> > during the shift. If the block is contiguous with the block at the
> > start offset of the insert range, it can front merge and alter the
> > start offset of the extent. Once the shift sequence reaches the
> > target offset, it shifts based on the latest start offset and
> > silently changes the target offset of the operation and corrupts the
> > file.
> 
> Yup, that's exactly the landmine that non-atomic, multi-transaction
> extent range operations have. It might be a COW operation, it might
> be something else that ends up manipulating the extent list. But
> because the ILOCK is not held across the entire extent shift,
> insert/collapse are susceptible to corruption when any other XFs
> code concurrently modifies the extent list.
> 
> I think insert/collapse need to be converted to work like a
> truncate operation instead of a series on individual write
> operations. That is, they are a permanent transaction that locks the
> inode once and is rolled repeatedly until the entire extent listi
> modification is done and then the inode is unlocked.
> 

Note that I don't think it's sufficient to hold the inode locked only
across the shift. For the insert case, I think we'd need to grab it
before the extent split at the target offset and roll from there.
Otherwise the same problem could be reintroduced if we eventually
replaced the xfs_prepare_shift() tweak made by this patch. Of course,
that doesn't look like a big problem. The locking is already elevated
and split and shift even use the same transaction type, so it's mostly a
refactor from a complexity standpoint. 

For the collapse case, we do have a per-shift quota reservation for some
reason. If that is required, we'd have to somehow replace it with a
worst case calculation. That said, it's not clear to me why that
reservation even exists. The pre-shift hole punch is already a separate
transaction with its own such reservation. The shift can merge extents
after that point (though most likely only on the first shift), but that
would only ever remove extent records. Any thoughts or objections if I
just killed that off?

> > To address this problem, update the shift preparation code to
> > stabilize the start boundary along with the full range of the
> > insert. Also update the existing corruption check to fail if any
> > extent is shifted with a start offset behind the target offset of
> > the insert range. This prevents insert from racing with COW
> > writeback completion and fails loudly in the event of an unexpected
> > extent shift.
> 
> It looks ok to avoid this particular symptom (backportable point
> fix), but I really think we should convert insert/collapse to be
> atomic w.r.t other extent list modifications....
> 

Ok, I think that approach is reasonable so long as we do it in two
phases as such to minimize backport churn and separate bug fix from
behavior change.

Unless there is other feedback on this patch, is there any objection to
getting this one reviewed/merged independently? I can start looking into
the shift rework today, but that is ultimately going to require more
involved testing than I'd prefer to block the bug fix on (whereas this
patch has now seen multiple days of fsx testing..).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>