On Mon, Oct 20, 2014 at 01:06:21PM -0400, Brian Foster wrote: > The zero range operation is analogous to fallocate with the exception of > converting the range to zeroes. E.g., it attempts to allocate zeroed > blocks over the range specified by the caller. The XFS implementation > kills all delalloc blocks currently over the aligned range, converts the > range to allocated zero blocks (unwritten extents) and handles the > partial pages at the ends of the range by sending writes through the > pagecache. > > The current implementation suffers from several problems associated with > inode size. If the aligned range covers an extending I/O, said I/O is > discarded and an inode size update from a previous write never makes it > to disk. Further, if an unaligned zero range extends beyond eof, the > page write induced for the partial end page can itself increase the > inode size, even if the zero range request is not supposed to update > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). > > The latter behavior not only incorrectly increases the inode size, but > can lead to stray delalloc blocks on the inode. Typically, post-eof > preallocation blocks are either truncated on release or inode eviction > or explicitly written to by xfs_zero_eof() on natural file size > extension. If the inode size increases due to zero range, however, > associated blocks leak into the address space having never been > converted or mapped to pagecache pages. A direct I/O to such an > uncovered range cannot convert the extent via writeback and will BUG(). > For example: > > $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> > ... > $ xfs_io -d -c "pread 128k 128k" <file> > <BUG> > > If the entire delalloc extent happens to not have page coverage > whatsoever (e.g., delalloc conversion couldn't find a large enough free > space extent), even a full file writeback won't convert what's left of > the extent and we'll assert on inode eviction. > > Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. > Use the existing hole punch and prealloc mechanisms as primitives for > zero range. This implementation is not efficient nor ideal as we > writeback dirty data over the range and remove existing extents rather > than convert to unwrittern. The former writeback, however, is currently > the only mechanism available to ensure consistency between pagecache and > extent state. Even a pagecache truncate/delalloc punch prior to hole > punch has lead to inconsistencies due to racing with writeback. > > This provides a consistent, correct implementation of zero range that > survives fsstress/fsx testing without assert failures. The > implementation can be optimized from this point forward once the > fundamental issue of pagecache and delalloc extent state consistency is > addressed. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > > v5: > - Further simplify to eliminate delalloc block punch. This now causes xfs/053 to fail, probably because it changes the behaviour to actually write the file and hence change EOF. Can you please check that this is working correctly, and if so submit patches to change xfs/053 to expect the file size to change and contain the correct data? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs