On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote: > Hi, all. > > Recently, syzbot reported a issue as following: > > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0 > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline] > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830 > Call Trace: > <TASK> > iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline] > iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391 > xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681 > xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997 > vfs_fallocate+0x553/0x6c0 fs/open.c:334 > ksys_fallocate fs/open.c:357 [inline] > __do_sys_fallocate fs/open.c:365 [inline] > __se_sys_fallocate fs/open.c:363 [inline] > __x64_sys_fallocate+0xbd/0x110 fs/open.c:363 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f2d716a6899 > > syzbot constructed the following scenario: syzbot called the write() > system call, passed an illegal pointer, and attempted to write 0x1017 > bytes, resulting in 0 bytes written and returning EFAULT to user > space. Then, it called the write() system call again, passed another > illegal pointer, and attempted to write 0xfea7 bytes, resulting in > 0xe00 bytes written. Finally called copy_file_range() sys call and > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag. > > What happened here is: during the first write, xfs_buffered_write_iomap_begin() > used preallocated 512 blocks, inserted an extent with a length of 512 and > reserved 512 blocks in the quota, with the iomap length being 1M. Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so it should be zero length and a write of just over 4100 bytes will not trigger speculative prealloc on a zero length file (threshold is 64kB). Indeed, the first speculative prealloc will only be 64kB in size... Hence it's not immediately obvious what precondition is causing this behaviour to occur. > However, when the write failed(0 byte was written), only 0x1017 bytes were > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks. > This caused only 3 blocks to be unreserved for the quota in iomap_end(), > instead of 512, and the corresponding extent information also only removed > 3 blocks instead of 512. Why 3 blocks? what was the filesystem block size? 2kB? This also smells of delayed allocation, because if it were -preallocation- it would be unwritten extents and we don't punch them out on write failures. See my first question.... Regardless, the behaviour is perfectly fine. The remainder of the blocks are beyond EOF, and so have no impact on anything at this point in time. > As a result, during the second write, the iomap length was 3 blocks > instead of the expected 512 blocks, What was the mapping? A new delalloc extent from 0 to 6kB? If so, that is exactly as expected, because there's a hole in the file there. > which ultimately triggered the > issue reported by syzbot in the fallocate() system call. How? We wrote 3584 bytes, which means the first 2 blocks were written and marked dirty, and the third block should have been punched out by the same process that punched the delalloc blocks in the first write. We end up with a file size of 3584 bytes, and a delalloc mapping for those first two blocks followed by a hole, followed by another delalloc extent beyond EOF. There is absolutely nothing incorrect about this state - this is exactly how we want delalloc beyond EOF to be handled. i.e. it remains in place until it is explicitly removed. An normal application would fix the write buffer and rewrite the data, thereby using the space beyond EOF that we've already preallocated. IOWs, this change in this patch is papering over the issue by preventing short writes from leaving extents beyond EOF, rather than working out why either CFR or UNSHARE is going wrong when there are extents beyond EOF. So, onto the copy_file_range() bit. Then CFR was called with a length of 0xffffffffa003e45bul, which is almost 16EB in size. This should result in -EOVERFLOW, because that is out of the range that an loff_t can represent. i.e. generic_copy_file_checks() does this: if (pos_in + count < pos_in || pos_out + count < pos_out) return -EOVERFLOW; and pos_in is a loff_t which is signed. Hence (0 + 15EB) should overflow to a large negative offset and be less than 0. Implicit type casting rules always break my brain - this looks like it is casting everything to unsigned, thereby not actually checking if we're overflowing the max offset the kernel can operate on. This oversize length is not caught by a check against max file size supported by the superblock, either,, because the count gets truncated to EOF before the generic checks against supported maximum file sizes are done. That seems ... wrong. Look at fallocate() - after checking for overflow, it checks offset + len against inode->i_sb->s_maxbytes and returns -EFBIG at that point. IOWs, I think CFR should be doing nothing but returning either -EOVERFLOW of -EFBIG here because the length is way longer than maximum supported file offset for any file operation in Linux. Then unshare should do nothing because the file is not shared and should not have the reflink flag set on it..... However, the strace indicates that: copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584 That it is copying 3584 bytes. i.e. the overflow check is broken. It indicates, however, that the file size is as expected from the the short writes that occurred. Hence the unshare operation should only be operating on the range of 0-3584 bytes because that is the only possible range of the file that is shared. However, the fallocate() call is for offset 0, length 0x2000 bytes (8kB), and these ranges are passed directly from XFS to iomap_file_unshare(). iomap_file_unshare() never checks the range against EOF, either, and so we've passed a range beyond EOF to iomap_file_unshare() for it to process. That seems ... wrong. Hence the unshare does: first map: 0 - 4k - unshare successfully. second map: 4k - 6k - hole, skip. Beyond EOF. third map: 6k - 8k - delalloc, beyond EOF so needs zeroing. Fires warnings because UNSHARE. IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware it is trying to unshare blocks that cannot ever be shared first place because reflink will not share blocks beyond EOF. And because those blocks are beyond EOF, they will always trigger the "need zeroing" case in __iomap_write_begin() and fire warnings. So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need to clamp the range being unshared to EOF. Hence this looks like a couple of contributing issues that need to be fixed: - The CFR overflow checks look broken. - The unshare range is never trimmed to EOF. but it's definitely not a bug caused by short writes leaving delalloc extents beyond EOF. There are many, many ways that we can create delalloc extents beyond EOF before running an UNSHARE operation that can trip over them like this. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx