Re: [PATCH v2] xfs: fix COW writeback race

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 17 Jan 2017 10:39:00 -0800

On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > > Any reason we don't try to address the core race rather than shake up
> > > the affected code to accommodate it?
> > 
> > I think there are two aspects to the whole thing.  One is the way
> > xfs_bmapi_write currently works is fundamentally wrong - if the
> > caller only needs a conversion from delalloc to real space trying
> > to allocate space is always wrong and we should catch it early.
> > The second is if we should do the eager conversion of the whole
> > found extent for either the data and/or the COW fork.
> > 
> 
> Makes sense. I agree that the former is probably the right thing to do,
> it just seems more like an error check than a solution for a race. The
> second is probably a bigger question, as I assume we do that to request
> as large allocations as possible.

I'm under the impression that yes, we do #2 to maximize request sizes.
AFAICT this patch preserves that behavior and gets rid of the behavior
where cow delalloc conversion creates real extents where there
previously were holes.

> > > I ask for a couple reasons: 1.) I'm
> > > not quite following the specific race from the description and 2.) I
> > > considered doing the exact same thing at first for the eofblocks i_size
> > > issue, but more digging rooted out the problem in the eofblocks code.
> > > This one may not be as straightforward a fix, of course... (but if not,
> > > the commit log should probably explain why).
> > 
> > My hope was that the long commit message explained the issue, but
> > I guess I need to go into even more details.
> > 
> > The writeback code (both COW and real) works like this
> > 
> > take ilock
> > read in an extent at offset O
> > drop ilock
> > 
> > if extent is delalloc:
> >   while !done with the whole extent
> >     take ilock
> >     convert partial extent to a real allocation
> >     drop ilock
> > 
> > But if multiple threads are doing writeback on pages next to
> > each other another thread might have already converted parts
> > of all of the extent found in the beginning to a real allocation.
> > That on it's own is not a problem because xfs_bmapi_write
> > handles a call to allocate an already real allocation as a no-op.
> > But with the COW code moving real extents from the COW fork to
> > the data fork after I/O completion this can become a problem,
> > because we now might not only have delalloc or real extents in
> > the area covered by extent returned from the inital xfs_bmapi_read
> > call, but also a hole in the COW work.  At which point it blows up.
> > 
> 
> Got it, thanks. So all of the writeback stuff is protected via
> page/buffer locks, and even if we still had those locks, it doesn't
> matter because the same extent is obviously covered by many page/buffer
> objects.
> 
> > As for why we're doing the eager conversion:  at least for the data
> > fork this was initentional to get better allocation patterns, I
> > remember a discussion with Dave on that a long time ago.  Maybe we
> > shouldn't do it for the COW for to avoid these sorts of races, but
> > then again without xfs_bmapi_write being stupid the race is harmless.
> > 
> 
> Yeah, and doing otherwise may break the assumption that larger delallocs
> produce larger physical allocs (re: cowextsz hint and potentially
> preallocation).

Yep.

> > > What happens in this case if eof is true? It looks like got could be
> > > bogus, yet we still carry on using it in the post-allocation part of the
> > > loop.
> > 
> > For an initial EOF lookup it could indeed be bogus.  To properly
> > work it would need something like the trick xfs_bmapi_read uses
> > for this case.
> > 
> 
> That seems reasonable so long as we skip the parts of the loop that are
> expecting a real (non-hole) startblock.
> 
> > > The fact that the allocation code breaks out of the loop if
> > > allocation doesn't occur is a bit of a red flag that the post-allocation
> > > code may very well expect to always have an allocated mapping.
> > 
> > The post-allocation cleanup code bust handle xfs_bmapi_allocate
> > returning an error before doing anything, and because of that it's
> > full of conditionals for everything that could or could not have
> > happened.
> > 
> 
> We have the following in the need_alloc block:
> 
>                         error = xfs_bmapi_allocate(&bma);
>                         if (error)
>                                 goto error0;
>                         if (bma.blkno == NULLFSBLOCK)
>                                 break;
> 
> ... which breaks out of the loop on error or allocation failure. The
> first call after that block is xfs_bmapi_trim_map(), which uses got
> without any consideration for holes that I can see.

Wait, what?  The break gets us out of the while loop, not the
"if (inhole || wasdelay)" clause that precedes the _trim_map.

The while loop ends at "*nmap = n", correct?  So the NULLFSBLOCK case
shouldn't be calling _trim_map with uninitialized got.

> > > That aside... if we do want to do something like this, I wonder whether
> > > it's more cleanly handled by the caller.
> > 
> > I don't see how it could be done in the caller - the caller wants
> > the bmap code to convert a delayed allocation and not allocate
> > entirely new blocks.  The best way to do so is to tell the bmapi
> > code not do allocate new blocks.  Now if you mean splitting up
> > xfs_bmapi_write into different functions for allocating real blocks,
> > converting delalloc blocks or just flipping the unwritten bit: that's
> > something I'd like to look into - the current interface is just
> > too confusing.
> 
> Things like the above had me thinking it might be more clear to
> explicitly read the extent and check for delalloc in the caller while
> under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> was alluding to above wrt to closing the race. That's just an idea,
> however, and doesn't necessarily improve the error handling in the way
> that this patch does (to avoid the transaction overrun). Given that, I'm
> not against what this patch is currently doing so long as we fix up the
> rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> like a nice potential cleanup at some point too.

I'd wondered when I was writing all the cow code if it would make the
code easier to understand if the _bmapi_write was split into different
frontend wrappers of the underlying implementation.  It's a little weird
that for a remapping you have to stuff the new physical block in a
_fsblock_t and pass that in as a pointer argument.

--D

> 
> Brian
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html