On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote: > On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > If we fail a write beyond EOF and have to handle it in > > xfs_vm_write_begin(), we truncate the inode back to the current inode > > size. This doesn't take into account the fact that we may have > > already made successful writes to the same page (in the case of block > > size < page size) and hence we can truncate the page cache away from > > blocks with valid data in them. If these blocks are delayed > > allocation blocks, we now have a mismatch between the page cache and > > the extent tree, and this will trigger - at minimum - a delayed > > block count mismatch assert when the inode is evicted from the cache. > > We can also trip over it when block mapping for direct IO - this is > > the most common symptom seen from fsx and fsstress when run from > > xfstests. > > > > Fix it by only truncating away the exact range we are updating state > > for in this write_begin call. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_aops.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > index e810243..6b4ecc8 100644 > > --- a/fs/xfs/xfs_aops.c > > +++ b/fs/xfs/xfs_aops.c > > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin( > > status = __block_write_begin(page, pos, len, xfs_get_blocks); > > if (unlikely(status)) { > > struct inode *inode = mapping->host; > > + size_t isize = i_size_read(inode); > > > > xfs_vm_write_failed(inode, page, pos, len); > > unlock_page(page); > > > > - if (pos + len > i_size_read(inode)) > > - truncate_pagecache(inode, i_size_read(inode)); > > From what I can see, we have a write_begin/copy/write_end sequence per > page and the inode size is updated down in the write_end path. If the > write fails at write_begin time, then we haven't copied any data and the > inode size hasn't changed. > > The intent of the original code looks like it wants to break off any > blocks that might have been set up beyond EOF before the write happened > to fail. Could you elaborate on how this can kill blocks that might have > been written successfully? Doesn't this only affect post-EOF metadata? > > > + /* > > + * If the write is beyond EOF, we only want to kill blocks > > + * allocated in this write, not blocks that were previously > > + * written successfully. > > + */ > > + if (pos + len > isize) > > + truncate_pagecache_range(inode, pos, pos + len); > > So the cache truncate now covers the range of the write. What happens if > the part of the write is an overwrite? This seems similar to the problem > you've described above... should the truncate start at isize instead? > I ran an experiment on this to confirm my suspicion here. I added a quick little hack to fail any write_begin (set status=-1) for which pos != 0. With that in place: xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file # hexdump /mnt/file 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0000800 xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file (fails) # hexdump /mnt/file 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0000400 0000 0000 0000 0000 0000 0000 0000 0000 * 0000800 The failed extending write trashes the data over the previously valid region. Brian > Brian > > > > > page_cache_release(page); > > page = NULL; > > -- > > 1.9.0 > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs