Re: [PATCH v2] xfs: fix COW writeback race

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 17 Jan 2017 13:58:34 -0500

On Tue, Jan 17, 2017 at 10:39:00AM -0800, Darrick J. Wong wrote:
> On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote:
> > On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote:
> > > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote:
> > > > Any reason we don't try to address the core race rather than shake up
> > > > the affected code to accommodate it?
> > > 
> > > I think there are two aspects to the whole thing.  One is the way
> > > xfs_bmapi_write currently works is fundamentally wrong - if the
> > > caller only needs a conversion from delalloc to real space trying
> > > to allocate space is always wrong and we should catch it early.
> > > The second is if we should do the eager conversion of the whole
> > > found extent for either the data and/or the COW fork.
> > > 
> > 
> > Makes sense. I agree that the former is probably the right thing to do,
> > it just seems more like an error check than a solution for a race. The
> > second is probably a bigger question, as I assume we do that to request
> > as large allocations as possible.
> 
> I'm under the impression that yes, we do #2 to maximize request sizes.
> AFAICT this patch preserves that behavior and gets rid of the behavior
> where cow delalloc conversion creates real extents where there
> previously were holes.
> 

Yeah, Christoph is just pointing out how that behavior contributes to
the race. I'm not suggesting that this patch changes that. Rather, I'm
agreeing that we probably don't want to go the route of changing that to
address this issue.

> > > > I ask for a couple reasons: 1.) I'm
> > > > not quite following the specific race from the description and 2.) I
> > > > considered doing the exact same thing at first for the eofblocks i_size
> > > > issue, but more digging rooted out the problem in the eofblocks code.
> > > > This one may not be as straightforward a fix, of course... (but if not,
> > > > the commit log should probably explain why).
> > > 
> > > My hope was that the long commit message explained the issue, but
> > > I guess I need to go into even more details.
> > > 
> > > The writeback code (both COW and real) works like this
> > > 
> > > take ilock
> > > read in an extent at offset O
> > > drop ilock
> > > 
> > > if extent is delalloc:
> > >   while !done with the whole extent
> > >     take ilock
> > >     convert partial extent to a real allocation
> > >     drop ilock
> > > 
> > > But if multiple threads are doing writeback on pages next to
> > > each other another thread might have already converted parts
> > > of all of the extent found in the beginning to a real allocation.
> > > That on it's own is not a problem because xfs_bmapi_write
> > > handles a call to allocate an already real allocation as a no-op.
> > > But with the COW code moving real extents from the COW fork to
> > > the data fork after I/O completion this can become a problem,
> > > because we now might not only have delalloc or real extents in
> > > the area covered by extent returned from the inital xfs_bmapi_read
> > > call, but also a hole in the COW work.  At which point it blows up.
> > > 
> > 
> > Got it, thanks. So all of the writeback stuff is protected via
> > page/buffer locks, and even if we still had those locks, it doesn't
> > matter because the same extent is obviously covered by many page/buffer
> > objects.
> > 
> > > As for why we're doing the eager conversion:  at least for the data
> > > fork this was initentional to get better allocation patterns, I
> > > remember a discussion with Dave on that a long time ago.  Maybe we
> > > shouldn't do it for the COW for to avoid these sorts of races, but
> > > then again without xfs_bmapi_write being stupid the race is harmless.
> > > 
> > 
> > Yeah, and doing otherwise may break the assumption that larger delallocs
> > produce larger physical allocs (re: cowextsz hint and potentially
> > preallocation).
> 
> Yep.
> 
> > > > What happens in this case if eof is true? It looks like got could be
> > > > bogus, yet we still carry on using it in the post-allocation part of the
> > > > loop.
> > > 
> > > For an initial EOF lookup it could indeed be bogus.  To properly
> > > work it would need something like the trick xfs_bmapi_read uses
> > > for this case.
> > > 
> > 
> > That seems reasonable so long as we skip the parts of the loop that are
> > expecting a real (non-hole) startblock.
> > 
> > > > The fact that the allocation code breaks out of the loop if
> > > > allocation doesn't occur is a bit of a red flag that the post-allocation
> > > > code may very well expect to always have an allocated mapping.
> > > 
> > > The post-allocation cleanup code bust handle xfs_bmapi_allocate
> > > returning an error before doing anything, and because of that it's
> > > full of conditionals for everything that could or could not have
> > > happened.
> > > 
> > 
> > We have the following in the need_alloc block:
> > 
> >                         error = xfs_bmapi_allocate(&bma);
> >                         if (error)
> >                                 goto error0;
> >                         if (bma.blkno == NULLFSBLOCK)
> >                                 break;
> > 
> > ... which breaks out of the loop on error or allocation failure. The
> > first call after that block is xfs_bmapi_trim_map(), which uses got
> > without any consideration for holes that I can see.
> 
> Wait, what?  The break gets us out of the while loop, not the
> "if (inhole || wasdelay)" clause that precedes the _trim_map.
> 
> The while loop ends at "*nmap = n", correct?  So the NULLFSBLOCK case
> shouldn't be calling _trim_map with uninitialized got.
> 

Precisely. As it stands, this shouldn't happen. With this patch,
however, it is now possible to get down to _trim_map() with a completely
uninitialized got. E.g., consider the first iteration of the loop if eof
is true (which means got is invalid) and XFS_BMAPI_DELALLOC is set.

(I suspect the whole 'if (xyz) { do_error_checks() }' pattern somewhat
obfuscates this..)

Brian

> > > > That aside... if we do want to do something like this, I wonder whether
> > > > it's more cleanly handled by the caller.
> > > 
> > > I don't see how it could be done in the caller - the caller wants
> > > the bmap code to convert a delayed allocation and not allocate
> > > entirely new blocks.  The best way to do so is to tell the bmapi
> > > code not do allocate new blocks.  Now if you mean splitting up
> > > xfs_bmapi_write into different functions for allocating real blocks,
> > > converting delalloc blocks or just flipping the unwritten bit: that's
> > > something I'd like to look into - the current interface is just
> > > too confusing.
> > 
> > Things like the above had me thinking it might be more clear to
> > explicitly read the extent and check for delalloc in the caller while
> > under the appropriate lock (and if XFS_COW_FORK). That's kind of what I
> > was alluding to above wrt to closing the race. That's just an idea,
> > however, and doesn't necessarily improve the error handling in the way
> > that this patch does (to avoid the transaction overrun). Given that, I'm
> > not against what this patch is currently doing so long as we fix up the
> > rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds
> > like a nice potential cleanup at some point too.
> 
> I'd wondered when I was writing all the cow code if it would make the
> code easier to understand if the _bmapi_write was split into different
> frontend wrappers of the underlying implementation.  It's a little weird
> that for a remapping you have to stuff the new physical block in a
> _fsblock_t and pass that in as a pointer argument.
> 
> --D
> 
> > 
> > Brian
> > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html