On Mon, Oct 13, 2014 at 01:11:25PM -0400, Brian Foster wrote: > The zero range operation is analogous to fallocate with the exception of > converting the range to zeroes. E.g., it attempts to allocate zeroed > blocks over the range specified by the caller. The XFS implementation > kills all delalloc blocks currently over the aligned range, converts the > range to allocated zero blocks (unwritten extents) and handles the > partial pages at the ends of the range by sending writes through the > pagecache. > > The current implementation suffers from several problems associated with > inode size. If the aligned range covers an extending I/O, said I/O is > discarded and an inode size update from a previous write never makes it > to disk. Further, if an unaligned zero range extends beyond eof, the > page write induced for the partial end page can itself increase the > inode size, even if the zero range request is not supposed to update > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). > > The latter behavior not only incorrectly increases the inode size, but > can lead to stray delalloc blocks on the inode. Typically, post-eof > preallocation blocks are either truncated on release or inode eviction > or explicitly written to by xfs_zero_eof() on natural file size > extension. If the inode size increases due to zero range, however, > associated blocks leak into the address space having never been > converted or mapped to pagecache pages. A direct I/O to such an > uncovered range cannot convert the extent via writeback and will BUG(). > For example: > > $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> > ... > $ xfs_io -d -c "pread 128k 128k" <file> > <BUG> > > If the entire delalloc extent happens to not have page coverage > whatsoever (e.g., delalloc conversion couldn't find a large enough free > space extent), even a full file writeback won't convert what's left of > the extent and we'll assert on inode eviction. > > Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. > Use the existing hole punch and prealloc mechanisms as primitives for > zero range. We punch out the pagecache beforehand to eliminate > unnecessary writeback. The hole punch mechanism handles partial block > zeroing for us and facilitates the use of a single prealloc call over > the entire range, which increases the odds of contiguous allocation. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- This patch triggers the same bug pretty much straight away on generic/033 on all my test systems: [ 306.378041] XFS: Assertion failed: startblockval(del.br_startblock) > 0, file: fs/xfs/libxfs/xfs_bmap.c, line: 5279 [ 306.380694] ------------[ cut here ]------------ [ 306.381655] kernel BUG at fs/xfs/xfs_message.c:107! [ 306.382535] invalid opcode: 0000 [#1] SMP [ 306.383310] Modules linked in: [ 306.383889] CPU: 0 PID: 12151 Comm: xfs_io Not tainted 3.17.0-dgc+ #537 [ 306.384665] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 306.384665] task: ffff88007597e2c0 ti: ffff880077644000 task.ti: ffff880077644000 [ 306.384665] RIP: 0010:[<ffffffff814dadb2>] [<ffffffff814dadb2>] assfail+0x22/0x30 [ 306.384665] RSP: 0018:ffff880077647c98 EFLAGS: 00010296 [ 306.384665] RAX: 0000000000000067 RBX: 0000000000000007 RCX: 0000000000000000 [ 306.384665] RDX: 0000000000000001 RSI: ffff88007fc0d258 RDI: ffff88007fc0d258 [ 306.384665] RBP: ffff880077647c98 R08: 000000000000000a R09: 0000000000000000 [ 306.384665] R10: 000000000000026b R11: ffff880077647946 R12: 0000000000000007 [ 306.384665] R13: ffff88006a35d300 R14: ffff880076669370 R15: ffff880077647db0 [ 306.384665] FS: 00007fc0b858e700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [ 306.384665] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 306.384665] CR2: 000000000061dc40 CR3: 000000007592d000 CR4: 00000000000006f0 [ 306.384665] Stack: [ 306.384665] ffff880077647d88 ffffffff81491341 ffff880077647d40 ffff880077647db4 [ 306.384665] ffff880075997000 0000000100000007 ffff88006a35d340 0000000000000000 [ 306.384665] 0000000000000000 ffffffff00000000 0000000700000000 0000000000000000 [ 306.384665] Call Trace: [ 306.384665] [<ffffffff81491341>] xfs_bunmapi+0x781/0x1000 [ 306.384665] [<ffffffff814c0ad6>] xfs_bmap_punch_delalloc_range+0xf6/0x1a0 [ 306.384665] [<ffffffff814c1b13>] xfs_zero_file_space+0xf3/0x1d0 [ 306.384665] [<ffffffff814c8538>] xfs_file_fallocate+0xe8/0x2f0 [ 306.384665] [<ffffffff811aeb48>] ? __sb_start_write+0x58/0xf0 [ 306.384665] [<ffffffff811aa8b7>] do_fallocate+0x127/0x1c0 [ 306.384665] [<ffffffff811aa994>] SyS_fallocate+0x44/0x70 [ 306.384665] [<ffffffff81d01a29>] system_call_fastpath+0x16/0x1b [ 306.384665] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 f1 41 89 d0 48 89 e5 48 89 fa 48 c7 c6 f0 67 15 82 31 ff 31 c0 e8 ce fb ff ff <0f> 0b 66 66 66 [ 306.384665] RIP [<ffffffff814dadb2>] assfail+0x22/0x30 [ 306.384665] RSP <ffff880077647c98> [ 306.418027] ---[ end trace 18ffcc2e14a50ab1 ]--- I'm running 3.17 + for-next + a handful of local patches, but this is the only patch that modifies anything in this area. I'll remove all the other patches I have just to check.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs