On Wed, Oct 29, 2014 at 11:22:53AM +1100, Dave Chinner wrote: > On Mon, Oct 20, 2014 at 01:06:21PM -0400, Brian Foster wrote: > > The zero range operation is analogous to fallocate with the exception of > > converting the range to zeroes. E.g., it attempts to allocate zeroed > > blocks over the range specified by the caller. The XFS implementation > > kills all delalloc blocks currently over the aligned range, converts the > > range to allocated zero blocks (unwritten extents) and handles the > > partial pages at the ends of the range by sending writes through the > > pagecache. > > > > The current implementation suffers from several problems associated with > > inode size. If the aligned range covers an extending I/O, said I/O is > > discarded and an inode size update from a previous write never makes it > > to disk. Further, if an unaligned zero range extends beyond eof, the > > page write induced for the partial end page can itself increase the > > inode size, even if the zero range request is not supposed to update > > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). > > > > The latter behavior not only incorrectly increases the inode size, but > > can lead to stray delalloc blocks on the inode. Typically, post-eof > > preallocation blocks are either truncated on release or inode eviction > > or explicitly written to by xfs_zero_eof() on natural file size > > extension. If the inode size increases due to zero range, however, > > associated blocks leak into the address space having never been > > converted or mapped to pagecache pages. A direct I/O to such an > > uncovered range cannot convert the extent via writeback and will BUG(). > > For example: > > > > $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> > > ... > > $ xfs_io -d -c "pread 128k 128k" <file> > > <BUG> > > > > If the entire delalloc extent happens to not have page coverage > > whatsoever (e.g., delalloc conversion couldn't find a large enough free > > space extent), even a full file writeback won't convert what's left of > > the extent and we'll assert on inode eviction. > > > > Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. > > Use the existing hole punch and prealloc mechanisms as primitives for > > zero range. This implementation is not efficient nor ideal as we > > writeback dirty data over the range and remove existing extents rather > > than convert to unwrittern. The former writeback, however, is currently > > the only mechanism available to ensure consistency between pagecache and > > extent state. Even a pagecache truncate/delalloc punch prior to hole > > punch has lead to inconsistencies due to racing with writeback. > > > > This provides a consistent, correct implementation of zero range that > > survives fsstress/fsx testing without assert failures. The > > implementation can be optimized from this point forward once the > > fundamental issue of pagecache and delalloc extent state consistency is > > addressed. > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > --- > > > > v5: > > - Further simplify to eliminate delalloc block punch. > > This now causes xfs/053 to fail, probably because it changes the > behaviour to actually write the file and hence change EOF. Can you > please check that this is working correctly, and if so submit > patches to change xfs/053 to expect the file size to change > and contain the correct data? Hmm, it appears that I tested the wrong "old kernel" when checking for regression. 3.18-rc2 also fails, so we need fixes for the xfs/053 test, not this patch.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs