On Tue, Jan 17, 2017 at 01:58:34PM -0500, Brian Foster wrote: > On Tue, Jan 17, 2017 at 10:39:00AM -0800, Darrick J. Wong wrote: > > On Tue, Jan 17, 2017 at 12:14:51PM -0500, Brian Foster wrote: > > > On Tue, Jan 17, 2017 at 03:37:21PM +0100, Christoph Hellwig wrote: > > > > On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote: > > > > > Any reason we don't try to address the core race rather than shake up > > > > > the affected code to accommodate it? > > > > > > > > I think there are two aspects to the whole thing. One is the way > > > > xfs_bmapi_write currently works is fundamentally wrong - if the > > > > caller only needs a conversion from delalloc to real space trying > > > > to allocate space is always wrong and we should catch it early. > > > > The second is if we should do the eager conversion of the whole > > > > found extent for either the data and/or the COW fork. > > > > > > > > > > Makes sense. I agree that the former is probably the right thing to do, > > > it just seems more like an error check than a solution for a race. The > > > second is probably a bigger question, as I assume we do that to request > > > as large allocations as possible. > > > > I'm under the impression that yes, we do #2 to maximize request sizes. > > AFAICT this patch preserves that behavior and gets rid of the behavior > > where cow delalloc conversion creates real extents where there > > previously were holes. > > > > Yeah, Christoph is just pointing out how that behavior contributes to > the race. I'm not suggesting that this patch changes that. Rather, I'm > agreeing that we probably don't want to go the route of changing that to > address this issue. Oh. So am I. I think we're /all/ in agreement then. :) > > > > > I ask for a couple reasons: 1.) I'm > > > > > not quite following the specific race from the description and 2.) I > > > > > considered doing the exact same thing at first for the eofblocks i_size > > > > > issue, but more digging rooted out the problem in the eofblocks code. > > > > > This one may not be as straightforward a fix, of course... (but if not, > > > > > the commit log should probably explain why). > > > > > > > > My hope was that the long commit message explained the issue, but > > > > I guess I need to go into even more details. > > > > > > > > The writeback code (both COW and real) works like this > > > > > > > > take ilock > > > > read in an extent at offset O > > > > drop ilock > > > > > > > > if extent is delalloc: > > > > while !done with the whole extent > > > > take ilock > > > > convert partial extent to a real allocation > > > > drop ilock > > > > > > > > But if multiple threads are doing writeback on pages next to > > > > each other another thread might have already converted parts > > > > of all of the extent found in the beginning to a real allocation. > > > > That on it's own is not a problem because xfs_bmapi_write > > > > handles a call to allocate an already real allocation as a no-op. > > > > But with the COW code moving real extents from the COW fork to > > > > the data fork after I/O completion this can become a problem, > > > > because we now might not only have delalloc or real extents in > > > > the area covered by extent returned from the inital xfs_bmapi_read > > > > call, but also a hole in the COW work. At which point it blows up. > > > > > > > > > > Got it, thanks. So all of the writeback stuff is protected via > > > page/buffer locks, and even if we still had those locks, it doesn't > > > matter because the same extent is obviously covered by many page/buffer > > > objects. > > > > > > > As for why we're doing the eager conversion: at least for the data > > > > fork this was initentional to get better allocation patterns, I > > > > remember a discussion with Dave on that a long time ago. Maybe we > > > > shouldn't do it for the COW for to avoid these sorts of races, but > > > > then again without xfs_bmapi_write being stupid the race is harmless. > > > > > > > > > > Yeah, and doing otherwise may break the assumption that larger delallocs > > > produce larger physical allocs (re: cowextsz hint and potentially > > > preallocation). > > > > Yep. > > > > > > > What happens in this case if eof is true? It looks like got could be > > > > > bogus, yet we still carry on using it in the post-allocation part of the > > > > > loop. > > > > > > > > For an initial EOF lookup it could indeed be bogus. To properly > > > > work it would need something like the trick xfs_bmapi_read uses > > > > for this case. > > > > > > > > > > That seems reasonable so long as we skip the parts of the loop that are > > > expecting a real (non-hole) startblock. > > > > > > > > The fact that the allocation code breaks out of the loop if > > > > > allocation doesn't occur is a bit of a red flag that the post-allocation > > > > > code may very well expect to always have an allocated mapping. > > > > > > > > The post-allocation cleanup code bust handle xfs_bmapi_allocate > > > > returning an error before doing anything, and because of that it's > > > > full of conditionals for everything that could or could not have > > > > happened. > > > > > > > > > > We have the following in the need_alloc block: > > > > > > error = xfs_bmapi_allocate(&bma); > > > if (error) > > > goto error0; > > > if (bma.blkno == NULLFSBLOCK) > > > break; > > > > > > ... which breaks out of the loop on error or allocation failure. The > > > first call after that block is xfs_bmapi_trim_map(), which uses got > > > without any consideration for holes that I can see. > > > > Wait, what? The break gets us out of the while loop, not the > > "if (inhole || wasdelay)" clause that precedes the _trim_map. > > > > The while loop ends at "*nmap = n", correct? So the NULLFSBLOCK case > > shouldn't be calling _trim_map with uninitialized got. > > > > Precisely. As it stands, this shouldn't happen. With this patch, > however, it is now possible to get down to _trim_map() with a completely > uninitialized got. E.g., consider the first iteration of the loop if eof > is true (which means got is invalid) and XFS_BMAPI_DELALLOC is set. > > (I suspect the whole 'if (xyz) { do_error_checks() }' pattern somewhat > obfuscates this..) Aha, yes, I see it now. Thanks for having a sharper pair of eyes! :) --D > > Brian > > > > > > That aside... if we do want to do something like this, I wonder whether > > > > > it's more cleanly handled by the caller. > > > > > > > > I don't see how it could be done in the caller - the caller wants > > > > the bmap code to convert a delayed allocation and not allocate > > > > entirely new blocks. The best way to do so is to tell the bmapi > > > > code not do allocate new blocks. Now if you mean splitting up > > > > xfs_bmapi_write into different functions for allocating real blocks, > > > > converting delalloc blocks or just flipping the unwritten bit: that's > > > > something I'd like to look into - the current interface is just > > > > too confusing. > > > > > > Things like the above had me thinking it might be more clear to > > > explicitly read the extent and check for delalloc in the caller while > > > under the appropriate lock (and if XFS_COW_FORK). That's kind of what I > > > was alluding to above wrt to closing the race. That's just an idea, > > > however, and doesn't necessarily improve the error handling in the way > > > that this patch does (to avoid the transaction overrun). Given that, I'm > > > not against what this patch is currently doing so long as we fix up the > > > rest of the loop. Your idea of xfs_bmapi_convert() or some such sounds > > > like a nice potential cleanup at some point too. > > > > I'd wondered when I was writing all the cow code if it would make the > > code easier to understand if the _bmapi_write was split into different > > frontend wrappers of the underlying implementation. It's a little weird > > that for a remapping you have to stuff the new physical block in a > > _fsblock_t and pass that in as a pointer argument. > > > > --D > > > > > > > > Brian > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html