Re: [PATCH] iomap: clean preallocated blocks in iomap_end() when 0 bytes was written.

Julian Sun <sunjunchao2870@xxxxxxxxx> · Wed, 04 Sep 2024 15:51:31 +0800

On Wed, 2024-09-04 at 10:35 +1000, Dave Chinner wrote:
Hi, Dave. Thanks for your review and comments. 
> On Tue, Sep 03, 2024 at 01:48:08PM +0800, Julian Sun wrote:
> > Hi, all.
> > 
> > Recently, syzbot reported a issue as following:
> > 
> > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 __iomap_write_begin fs/iomap/buffered-io.c:727 [inline]
> > WARNING: CPU: 1 PID: 5222 at fs/iomap/buffered-io.c:727 iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830
> > CPU: 1 UID: 0 PID: 5222 Comm: syz-executor247 Not tainted 6.11.0-rc2-syzkaller-00111-gee9a43b7cfe2 #0
> > RIP: 0010:__iomap_write_begin fs/iomap/buffered-io.c:727 [inline]
> > RIP: 0010:iomap_write_begin+0x13f0/0x16f0 fs/iomap/buffered-io.c:830
> > Call Trace:
> >  <TASK>
> >  iomap_unshare_iter fs/iomap/buffered-io.c:1351 [inline]
> >  iomap_file_unshare+0x460/0x780 fs/iomap/buffered-io.c:1391
> >  xfs_reflink_unshare+0x173/0x5f0 fs/xfs/xfs_reflink.c:1681
> >  xfs_file_fallocate+0x6be/0xa50 fs/xfs/xfs_file.c:997
> >  vfs_fallocate+0x553/0x6c0 fs/open.c:334
> >  ksys_fallocate fs/open.c:357 [inline]
> >  __do_sys_fallocate fs/open.c:365 [inline]
> >  __se_sys_fallocate fs/open.c:363 [inline]
> >  __x64_sys_fallocate+0xbd/0x110 fs/open.c:363
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f2d716a6899
> > 
> > syzbot constructed the following scenario: syzbot called the write()
> > system call, passed an illegal pointer, and attempted to write 0x1017
> > bytes, resulting in 0 bytes written and returning EFAULT to user
> > space. Then, it called the write() system call again, passed another
> > illegal pointer, and attempted to write 0xfea7 bytes, resulting in
> > 0xe00 bytes written. Finally called copy_file_range() sys call and
> > fallocate() sys call with FALLOC_FL_UNSHARE_RANGE flag.
> > 
> > What happened here is: during the first write, xfs_buffered_write_iomap_begin()
> > used preallocated 512 blocks, inserted an extent with a length of 512 and
> > reserved 512 blocks in the quota, with the iomap length being 1M.
> 
> Why did XFS preallocate 512 blocks? The file was opened O_TRUNC, so
> it should be zero length and a write of just over 4100 bytes will
> not trigger speculative prealloc on a zero length file (threshold is
> 64kB). Indeed, the first speculative prealloc will only be 64kB in
> size...

Perhaps my wording was misleading. This might not be actual preallocation, 
but rather it reached the following statement.
	if (xfs_has_allocsize(mp))
		prealloc_blocks = mp->m_allocsize_blocks;
And mp->m_allocsize_blocks is 512.
> 
> Hence it's not immediately obvious what precondition is causing this
> behaviour to occur.
> 
> > However, when the write failed(0 byte was written), only 0x1017 bytes were
> > passed to iomap_end() instead of the preallocated 1M bytes/512 blocks.
> > This caused only 3 blocks to be unreserved for the quota in iomap_end(),
> > instead of 512, and the corresponding extent information also only removed
> > 3 blocks instead of 512.
> 
> Why 3 blocks? what was the filesystem block size? 2kB?

Exactly. GDB observed that sb_blocksize was 2048 and sb_blocklog was 11.
> 
> This also smells of delayed allocation, because if it were
> -preallocation- it would be unwritten extents and we don't punch
> them out on write failures. See my first question....
> 
> Regardless, the behaviour is perfectly fine. The remainder of the
> blocks are beyond EOF, and so have no impact on anything at this
> point in time.
> 
> > As a result, during the second write, the iomap length was 3 blocks
> > instead of the expected 512 blocks,
> 
> What was the mapping? A new delalloc extent from 0 to 6kB? If so,
> that is exactly as expected, because there's a hole in the file
> there.

Yeah. There's a delalloc extent from 0 to 6KB.
And below is the log of extent changes during the reproduction 
process before the panic.

[   24.466778][ T2491] write begin for ino 9292 offset 0 count 4119 offset_fsb 0 end_fsb 3 
[   24.467325][ T2491] insert extent pos:0 off:0 count:512 start:4503599627239434
[   24.467909][ T2491] iomap->offset is 0 iomap->length is 1048576
[   24.468603][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434
[   24.469089][ T2491] update extent after  pos:0 off:3 count:509 start:4503599627239434
write: Bad address
[   24.469734][ T2491] write begin for ino 9292 offset 0 count 65191 offset_fsb 0 end_fsb 32 
[   24.470206][ T2491] update extent before pos:0 off:3 count:509 start:4503599627239434
[   24.470673][ T2491] update extent after  pos:0 off:0 count:512 start:4503599627239434
[   24.471110][ T2491] iomap->offset is 0 iomap->length is 6144
[   24.471916][ T2491] update extent before pos:0 off:0 count:512 start:4503599627239434
[   24.472382][ T2491] update extent after  pos:0 off:0 count:2 start:4503599627239429
[   24.472838][ T2491] insert extent pos:1 off:3 count:509 start:4503599627239430
[   24.473270][ T2491] write begin for ino 9292 offset 3584 count 61607 offset_fsb 1 end_fsb 32 
write ret is 0xe00
[   24.474517][ T2491] update extent before pos:0 off:0 count:2 start:4503599627239429
[   24.474994][ T2491] update extent after  pos:0 off:0 count:2 start:13
[   24.476451][    T8] update extent before pos:0 off:0 count:2 start:13
[   24.476902][    T8] update extent after  pos:0 off:0 count:2 start:13
[   24.478880][ T2491] insert extent pos:0 off:0 count:2 start:13
copy_file_range ret is 0xe00
[   24.481463][ T2491] write begin for ino 9292 offset 0 count 8192 offset_fsb 0 end_fsb 4 
[   24.481968][ T2491] insert extent pos:0 off:0 count:32 start:4503599627239430
[   24.482422][ T2491] write begin for ino 9292 offset 4096 count 4096 offset_fsb 2 end_fsb 4 
[   24.482882][ T2491] write begin for ino 9292 offset 6144 count 2048 offset_fsb 3 end_fsb 4

Note the last two logs: fallocate() successfully unshared 4k bytes, 
then skipped the last 2k bytes because they were not being shared, 
and then attempted to unshare 2k bytes beyond EOF.
> 
> > which ultimately triggered the
> > issue reported by syzbot in the fallocate() system call.
> 
> How? We wrote 3584 bytes, which means the first 2 blocks were
> written and marked dirty, and the third block should have been
> punched out by the same process that punched the delalloc blocks in
> the first write. We end up with a file size of 3584 bytes, and
> a delalloc mapping for those first two blocks followed by a hole,
> followed by another delalloc extent beyond EOF.
> 
> There is absolutely nothing incorrect about this state - this is
> exactly how we want delalloc beyond EOF to be handled. i.e. it
> remains in place until it is explicitly removed. An normal
> application would fix the write buffer and rewrite the data, thereby
> using the space beyond EOF that we've already preallocated.
Got it. Thanks for your explanation.
> 
> IOWs, this change in this patch is papering over the issue by
> preventing short writes from leaving extents beyond EOF, rather than
> working out why either CFR or UNSHARE is going wrong when there are
> extents beyond EOF.
Yeah, totally agreed. 
> 
> So, onto the copy_file_range() bit.
> 
> Then CFR was called with a length of 0xffffffffa003e45bul, which is
> almost 16EB in size. This should result in -EOVERFLOW, because that
> is out of the range that an loff_t can represent.
> 
> i.e. generic_copy_file_checks() does this:
> 
>         if (pos_in + count < pos_in || pos_out + count < pos_out)
>                 return -EOVERFLOW;
> 
> and pos_in is a loff_t which is signed. Hence (0 + 15EB) should
> overflow to a large negative offset and be less than 0. Implicit
> type casting rules always break my brain - this looks like it is
> casting everything to unsigned, thereby not actually checking if
> we're overflowing the max offset the kernel can operate on.
Yes! The check was supposed to fail here, but it didn't. Maybe what 
we should do is like this:
	if (pos_in + (loff_t)count < pos_in || pos_out + (loff_t)count < pos_out)
                return -EOVERFLOW;
My tests showed that CFR returned -EOVERFLOW with the change.

A search for the keyword "Ensure offsets don't wrap" shows that there is 
also this kind of implicit conversion issue in generic_remap_checks().
And xfs_exchange_range_checks() uses check_add_overflow() to check for 
overflow. Clearly, the latter is a more elegant way of checking. 

I'm wondering if we have any tools or can we create a tool to check 
for such implicit conversions. GCC provides -Wconversion, but it reports 
too many issues, and people might get overwhelmed by such information.
Or can we find a method to filter the warning that we really need to fix?
> 
> This oversize length is not caught by a check against max file size
> supported by the superblock, either,, because the count gets
> truncated to EOF before the generic checks against supported maximum
> file sizes are done.
> 
> That seems ... wrong. Look at fallocate() - after checking for
> overflow, it checks offset + len against inode->i_sb->s_maxbytes and
> returns -EFBIG at that point.

In this case, offset is 0 and len is 8192, so the check in fallocate()
should passed.
> 
> IOWs, I think CFR should be doing nothing but returning either
> -EOVERFLOW of -EFBIG here because the length is way longer than
> maximum supported file offset for any file operation in Linux.  Then
> unshare should do nothing because the file is not shared and should
> not have the reflink flag set on it.....
> 
> However, the strace indicates that:
> 
> copy_file_range(6, [0], 5, NULL, 18446744072099193947, 0) = 3584
> 
> That it is copying 3584 bytes. i.e. the overflow check is broken.
> It indicates, however, that the file size is as expected from the
> the short writes that occurred.
> 
> Hence the unshare operation should only be operating on the range of
> 0-3584 bytes because that is the only possible range of the file
> that is shared.
> 
> However, the fallocate() call is for offset 0, length 0x2000 bytes
> (8kB), and these ranges are passed directly from XFS to
> iomap_file_unshare().  iomap_file_unshare() never checks the range
> against EOF, either, and so we've passed a range beyond EOF to
> iomap_file_unshare() for it to process. That seems ... wrong.
> 
> Hence the unshare does:
> 
> first map:       0 - 4k - unshare successfully.
> second map:     4k - 6k - hole, skip. Beyond EOF.
> third map:      6k - 8k - delalloc, beyond EOF so needs zeroing.
>                           Fires warnings because UNSHARE.

Yes. A slight difference: the second mapping was skipped because it wasn't 
marked with IOMAP_F_SHARED. 
> 
> IOWs, iomap_file_unshare() will walk beyond EOF blissfully unaware
> it is trying to unshare blocks that cannot ever be shared first
> place because reflink will not share blocks beyond EOF. And because
> those blocks are beyond EOF, they will always trigger the "need
> zeroing" case in __iomap_write_begin() and fire warnings.
> 
> So, yeah, either xfs_file_fallocate() or iomap_file_unshare() need
> to clamp the range being unshared to EOF.
> 
> Hence this looks like a couple of contributing issues that need to
> be fixed:
> 
> - The CFR overflow checks look broken.
> - The unshare range is never trimmed to EOF.

Exactly. The first can be fixed with using check_add_overflow() macro.
The latter one maybe can be fixed with the following patch:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..8898d5ec606f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
        /* don't bother with holes or unwritten extents */
        if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
                return length;
+       /* don't try to unshare any extents that beyond EOF. */
+       if (pos > i_size_read(iter->inode))
+               return length;
 
        do {
                struct folio *folio;

> 
> but it's definitely not a bug caused by short writes leaving
> delalloc extents beyond EOF. There are many, many ways that
> we can create delalloc extents beyond EOF before running an UNSHARE
> operation that can trip over them like this.
> 

I see. Thanks again for your detailed and patient explanation. 
> -Dave.

Thanks,
-- 
Julian Sun <sunjunchao2870@xxxxxxxxx>