On Wed, 2024-09-04 at 10:35 +1000, Dave Chinner wrote: Hi, Dave. Thanks for your review and comments. > On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote: > > Hi, all. > > > > Recently, syzbot reported a issue as following: > > > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0 > > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > > Call Trace: > > <TASK> > > iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline] > > iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391 > > xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681 > > xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997 > > vfs_fallocate+0x553/0x6c0 fs/open.c:334 > > ksys_fallocate fs/open.c:357 [inline] > > __do_sys_fallocate fs/open.c:365 [inline] > > __se_sys_fallocate fs/open.c:363 [inline] > > __x64_sys_fallocate+0xbd/0x110 fs/open.c:363 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > RIP: 0033:0x7f2d716a6899 > > > > syzbot constructed the following scenario: syzbot called the write() > > system call, passed an illegal pointer, and attempted to write 0x1017 > > bytes, resulting in 0 bytes written and returning EFAULT to user > > space. Then, it called the write() system call again, passed another > > illegal pointer, and attempted to write 0xfea7 bytes, resulting in > > 0xe00 bytes written. Finally called copy_file_range() sys call and > > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag. > > > > What happened here is: during the first write, xfs_buffered_write_iomap_begin() > > used preallocated 512 blocks, inserted an extent with a length of 512 and > > reserved 512 blocks in the quota, with the iomap length being 1M. > > Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so > it should be zero length and a write of just over 4100 bytes will > not trigger speculative prealloc on a zero length file (threshold is > 64kB). Indeed, the first speculative prealloc will only be 64kB in > size... Perhaps my wording was misleading. This might not be actual preallocation, but rather it reached the following statement. if (xfs_has_allocsize(mp)) prealloc_blocks = mp->m_allocsize_blocks; And mp->m_allocsize_blocks is 512. > > Hence it's not immediately obvious what precondition is causing this > behaviour to occur. > > > However, when the write failed(0 byte was written), only 0x1017 bytes were > > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks. > > This caused only 3 blocks to be unreserved for the quota in iomap_end(), > > instead of 512, and the corresponding extent information also only removed > > 3 blocks instead of 512. > > Why 3 blocks? what was the filesystem block size? 2kB? Exactly. GDB observed that sb_blocksize was 2048 and sb_blocklog was 11. > > This also smells of delayed allocation, because if it were > -preallocation- it would be unwritten extents and we don't punch > them out on write failures. See my first question.... > > Regardless, the behaviour is perfectly fine. The remainder of the > blocks are beyond EOF, and so have no impact on anything at this > point in time. > > > As a result, during the second write, the iomap length was 3 blocks > > instead of the expected 512 blocks, > > What was the mapping? A new delalloc extent from 0 to 6kB? If so, > that is exactly as expected, because there's a hole in the file > there. Yeah. There's a delalloc extent from 0 to 6KB. And below is the log of extent changes during the reproduction process before the panic. [ 24.466778][ T2491] write begin for ino 9292 offset 0 count 4119 offset_fsb 0 end_fsb 3 [ 24.467325][ T2491] insert extent pos:0 off:0 count:512 start:4503599627239434 [ 24.467909][ T2491] iomap->offset is 0 iomap->length is 1048576 [ 24.468603][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434 [ 24.469089][ T2491] update extent after pos:0 off:3 count:509 start:4503599627239434 write: Bad address [ 24.469734][ T2491] write begin for ino 9292 offset 0 count 65191 offset_fsb 0 end_fsb 32 [ 24.470206][ T2491] update extent before pos:0 off:3 count:509 start:4503599627239434 [ 24.470673][ T2491] update extent after pos:0 off:0 count:512 start:4503599627239434 [ 24.471110][ T2491] iomap->offset is 0 iomap->length is 6144 [ 24.471916][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434 [ 24.472382][ T2491] update extent after pos:0 off:0 count:2 start:4503599627239429 [ 24.472838][ T2491] insert extent pos:1 off:3 count:509 start:4503599627239430 [ 24.473270][ T2491] write begin for ino 9292 offset 3584 count 61607 offset_fsb 1 end_fsb 32 write ret is 0xe00 [ 24.474517][ T2491] update extent before pos:0 off:0 count:2 start:4503599627239429 [ 24.474994][ T2491] update extent after pos:0 off:0 count:2 start:13 [ 24.476451][ T8] update extent before pos:0 off:0 count:2 start:13 [ 24.476902][ T8] update extent after pos:0 off:0 count:2 start:13 [ 24.478880][ T2491] insert extent pos:0 off:0 count:2 start:13 copy_file_range ret is 0xe00 [ 24.481463][ T2491] write begin for ino 9292 offset 0 count 8192 offset_fsb 0 end_fsb 4 [ 24.481968][ T2491] insert extent pos:0 off:0 count:32 start:4503599627239430 [ 24.482422][ T2491] write begin for ino 9292 offset 4096 count 4096 offset_fsb 2 end_fsb 4 [ 24.482882][ T2491] write begin for ino 9292 offset 6144 count 2048 offset_fsb 3 end_fsb 4 Note the last two logs: fallocate() successfully unshared 4k bytes, then skipped the last 2k bytes because they were not being shared, and then attempted to unshare 2k bytes beyond EOF. > > > which ultimately triggered the > > issue reported by syzbot in the fallocate() system call. > > How? We wrote 3584 bytes, which means the first 2 blocks were > written and marked dirty, and the third block should have been > punched out by the same process that punched the delalloc blocks in > the first write. We end up with a file size of 3584 bytes, and > a delalloc mapping for those first two blocks followed by a hole, > followed by another delalloc extent beyond EOF. > > There is absolutely nothing incorrect about this state - this is > exactly how we want delalloc beyond EOF to be handled. i.e. it > remains in place until it is explicitly removed. An normal > application would fix the write buffer and rewrite the data, thereby > using the space beyond EOF that we've already preallocated. Got it. Thanks for your explanation. > > IOWs, this change in this patch is papering over the issue by > preventing short writes from leaving extents beyond EOF, rather than > working out why either CFR or UNSHARE is going wrong when there are > extents beyond EOF. Yeah, totally agreed. > > So, onto the copy_file_range() bit. > > Then CFR was called with a length of 0xffffffffa003e45bul, which is > almost 16EB in size. This should result in -EOVERFLOW, because that > is out of the range that an loff_t can represent. > > i.e. generic_copy_file_checks() does this: > > if (pos_in + count < pos_in || pos_out + count < pos_out) > return -EOVERFLOW; > > and pos_in is a loff_t which is signed. Hence (0 + 15EB) should > overflow to a large negative offset and be less than 0. Implicit > type casting rules always break my brain - this looks like it is > casting everything to unsigned, thereby not actually checking if > we're overflowing the max offset the kernel can operate on. Yes! The check was supposed to fail here, but it didn't. Maybe what we should do is like this: if (pos_in + (loff_t)count < pos_in || pos_out + (loff_t)count < pos_out) return -EOVERFLOW; My tests showed that CFR returned -EOVERFLOW with the change. A search for the keyword "Ensure offsets don't wrap" shows that there is also this kind of implicit conversion issue in generic_remap_checks(). And xfs_exchange_range_checks() uses check_add_overflow() to check for overflow. Clearly, the latter is a more elegant way of checking. I'm wondering if we have any tools or can we create a tool to check for such implicit conversions. GCC provides -Wconversion, but it reports too many issues, and people might get overwhelmed by such information. Or can we find a method to filter the warning that we really need to fix? > > This oversize length is not caught by a check against max file size > supported by the superblock, either,, because the count gets > truncated to EOF before the generic checks against supported maximum > file sizes are done. > > That seems ... wrong. Look at fallocate() - after checking for > overflow, it checks offset + len against inode->i_sb->s_maxbytes and > returns -EFBIG at that point. In this case, offset is 0 and len is 8192, so the check in fallocate() should passed. > > IOWs, I think CFR should be doing nothing but returning either > -EOVERFLOW of -EFBIG here because the length is way longer than > maximum supported file offset for any file operation in Linux. Then > unshare should do nothing because the file is not shared and should > not have the reflink flag set on it..... > > However, the strace indicates that: > > copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584 > > That it is copying 3584 bytes. i.e. the overflow check is broken. > It indicates, however, that the file size is as expected from the > the short writes that occurred. > > Hence the unshare operation should only be operating on the range of > 0-3584 bytes because that is the only possible range of the file > that is shared. > > However, the fallocate() call is for offset 0, length 0x2000 bytes > (8kB), and these ranges are passed directly from XFS to > iomap_file_unshare(). iomap_file_unshare() never checks the range > against EOF, either, and so we've passed a range beyond EOF to > iomap_file_unshare() for it to process. That seems ... wrong. > > Hence the unshare does: > > first map: 0 - 4k - unshare successfully. > second map: 4k - 6k - hole, skip. Beyond EOF. > third map: 6k - 8k - delalloc, beyond EOF so needs zeroing. > Fires warnings because UNSHARE. Yes. A slight difference: the second mapping was skipped because it wasn't marked with IOMAP_F_SHARED. > > IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware > it is trying to unshare blocks that cannot ever be shared first > place because reflink will not share blocks beyond EOF. And because > those blocks are beyond EOF, they will always trigger the "need > zeroing" case in __iomap_write_begin() and fire warnings. > > So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need > to clamp the range being unshared to EOF. > > Hence this looks like a couple of contributing issues that need to > be fixed: > > - The CFR overflow checks look broken. > - The unshare range is never trimmed to EOF. Exactly. The first can be fixed with using check_add_overflow() macro. The latter one maybe can be fixed with the following patch: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index f420c53d86ac..8898d5ec606f 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter) /* don't bother with holes or unwritten extents */ if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) return length; + /* don't try to unshare any extents that beyond EOF. */ + if (pos > i_size_read(iter->inode)) + return length; do { struct folio *folio; > > but it's definitely not a bug caused by short writes leaving > delalloc extents beyond EOF. There are many, many ways that > we can create delalloc extents beyond EOF before running an UNSHARE > operation that can trip over them like this. > I see. Thanks again for your detailed and patient explanation. > -Dave. Thanks, -- Julian Sun <sunjunchao2870@xxxxxxxxx>