On Tue, Jan 17, 2017 at 08:44:21AM -0500, Brian Foster wrote: > Any reason we don't try to address the core race rather than shake up > the affected code to accommodate it? I think there are two aspects to the whole thing. One is the way xfs_bmapi_write currently works is fundamentally wrong - if the caller only needs a conversion from delalloc to real space trying to allocate space is always wrong and we should catch it early. The second is if we should do the eager conversion of the whole found extent for either the data and/or the COW fork. > I ask for a couple reasons: 1.) I'm > not quite following the specific race from the description and 2.) I > considered doing the exact same thing at first for the eofblocks i_size > issue, but more digging rooted out the problem in the eofblocks code. > This one may not be as straightforward a fix, of course... (but if not, > the commit log should probably explain why). My hope was that the long commit message explained the issue, but I guess I need to go into even more details. The writeback code (both COW and real) works like this take ilock read in an extent at offset O drop ilock if extent is delalloc: while !done with the whole extent take ilock convert partial extent to a real allocation drop ilock But if multiple threads are doing writeback on pages next to each other another thread might have already converted parts of all of the extent found in the beginning to a real allocation. That on it's own is not a problem because xfs_bmapi_write handles a call to allocate an already real allocation as a no-op. But with the COW code moving real extents from the COW fork to the data fork after I/O completion this can become a problem, because we now might not only have delalloc or real extents in the area covered by extent returned from the inital xfs_bmapi_read call, but also a hole in the COW work. At which point it blows up. As for why we're doing the eager conversion: at least for the data fork this was initentional to get better allocation patterns, I remember a discussion with Dave on that a long time ago. Maybe we shouldn't do it for the COW for to avoid these sorts of races, but then again without xfs_bmapi_write being stupid the race is harmless. > What happens in this case if eof is true? It looks like got could be > bogus, yet we still carry on using it in the post-allocation part of the > loop. For an initial EOF lookup it could indeed be bogus. To properly work it would need something like the trick xfs_bmapi_read uses for this case. > The fact that the allocation code breaks out of the loop if > allocation doesn't occur is a bit of a red flag that the post-allocation > code may very well expect to always have an allocated mapping. The post-allocation cleanup code bust handle xfs_bmapi_allocate returning an error before doing anything, and because of that it's full of conditionals for everything that could or could not have happened. > That aside... if we do want to do something like this, I wonder whether > it's more cleanly handled by the caller. I don't see how it could be done in the caller - the caller wants the bmap code to convert a delayed allocation and not allocate entirely new blocks. The best way to do so is to tell the bmapi code not do allocate new blocks. Now if you mean splitting up xfs_bmapi_write into different functions for allocating real blocks, converting delalloc blocks or just flipping the unwritten bit: that's something I'd like to look into - the current interface is just too confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html