Re: [PATCH 1/2] iomap: Do not unshare exents beyond EOF

Julian Sun <sunjunchao2870@xxxxxxxxx> · Tue, 10 Sep 2024 21:25:41 +0800

Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月10日周二 20:29写道：
>
> On Tue, Sep 10, 2024 at 03:03:18PM +0800, Julian Sun wrote:
> > On Mon, 2024-09-09 at 15:29 -0400, Brian Foster wrote:
> > > On Tue, Sep 10, 2024 at 01:40:24AM +0800, Julian Sun wrote:
> > > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月9日周一 21:27写道：
> > > > >
> > > > > On Mon, Sep 09, 2024 at 08:15:43PM +0800, Julian Sun wrote:
> > > > > > Hi Brian,
> > > > > >
> > > > > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月7日周六 03:11写道：
> > > > > > >
> > > > > > > On Thu, Sep 05, 2024 at 06:24:24PM +0800, Julian Sun wrote:
> > > > > > > > Attempting to unshare extents beyond EOF will trigger
> > > > > > > > the need zeroing case, which in turn triggers a warning.
> > > > > > > > Therefore, let's skip the unshare process if extents are
> > > > > > > > beyond EOF.
> > > > > > > >
> > > > > > > > Reported-and-tested-by: syzbot+296b1c84b9cbf306e5a0@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > > > > > > Closes: https://syzkaller.appspot.com/bug?extid=296b1c84b9cbf306e5a0
> > > > > > > > Fixes: 32a38a499104 ("iomap: use write_begin to read pages to unshare")
> > > > > > > > Inspired-by: Dave Chinner <david@xxxxxxxxxxxxx>
> > > > > > > > Signed-off-by: Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > > > > > > ---
> > > > > > > >  fs/iomap/buffered-io.c | 3 +++
> > > > > > > >  1 file changed, 3 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > > > > index f420c53d86ac..8898d5ec606f 100644
> > > > > > > > --- a/fs/iomap/buffered-io.c
> > > > > > > > +++ b/fs/iomap/buffered-io.c
> > > > > > > > @@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> > > > > > > >       /* don't bother with holes or unwritten extents */
> > > > > > > >       if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > > > > > >               return length;
> > > > > > > > +     /* don't try to unshare any extents beyond EOF. */
> > > > > > > > +     if (pos > i_size_read(iter->inode))
> > > > > > > > +             return length;
> > > > > > >
> > > > > > > Hi Julian,
> > > > > > >
> > > > > > >
> > > > > > > > What about if pos starts within EOF and the operation extends beyond it?
> > > > > >
> > > > > > Extents within EOF will be unshared as usual. Details are below.
> > > > > > >
> > > > > > > > I ask because I think I've reproduced this scenario, though it is a bit
> > > > > > > > tricky and has dependencies...
> > > > > > > >
> > > > > > > > For one, it seems to depend on the cowblocks patch I recently posted
> > > > > > > > here [1] (though I don't think this is necessarily a problem with the
> > > > > > > > patch, it just depends on keeping COW fork blocks around after the
> > > > > > > > unshare). With that, I reproduce via fsx with unshare range support [2]
> > > > > > > > using the ops file appended below [3] on a -bsize=1k XFS fs.
> > > > > > > >
> > > > > > > > I haven't quite characterized the full sequence other than it looks like
> > > > > > > > the unshare walks across EOF with a shared data fork block and COW fork
> > > > > > > > delalloc, presumably finds the post-eof part of the folio !uptodate (so
> > > > > > > > iomap_adjust_read_range() doesn't skip it), and then trips over the
> > > > > > > > warning and error return associated with the folio zeroing in
> > > > > > > > __iomap_write_begin().
> > > > > >
> > > > > > The scenario has already been reproduced by syzbot[1]. The reproducer
> > > > > > provided by syzbot constructed the following extent maps for a file of
> > > > > > size 0xE00 before fallocate unshare:
> > > > > >
> > > > > > 0 - 4k: shared between two files
> > > > > > 4k - 6k: hole beyond EOF, not shared
> > > > > > 6k - 8k: delalloc extends
> > > > > >
> > > > > > Then the reproducer attempted to unshare the extent between 0 and
> > > > > > 0x2000 bytes, but the file size is 0xE00. This is likely the scenario
> > > > > > you were referring to?
> > > > > >
> > > > >
> > > > > Yes, sort of..
> > > > >
> > > > > > Eventually the unshare code does:
> > > > > > first map: 0 - 4k - unshare successfully.
> > > > > > second map: 4k - 6k - hole, skip. Beyond EOF.
> > > > > > third map: 6k - 8k - delalloc, beyond EOF so needs zeroing.
> > > > > > Fires warnings because UNSHARE.
> > > > > >
> > > > > > During the first call to iomap_unshare_iter(), iomap_length() returned
> > > > > > 4k, so 4k bytes were unshared.
> > > > > > See discuss here[2] for more details.
> > > > > > >
> > > > > > > This all kind of has me wondering.. do we know the purpose of this
> > > > > > > warning/error in the first place? It seems like it's more of an
> > > > > > > "unexpected situation" than a specific problem. E.g., assuming the same
> > > > > > > page were mmap'd, I _think_ the read fault path would do the same sort
> > > > > > > of zeroing such that the unshare would see a fully uptodate folio and
> > > > > > > carry on as normal. I added the mapread op to the opsfile below to give
> > > > > > > that a quick test (remove the "skip" text to enable it), and it seems to
> > > > > > > prevent the error, but I've not confirmed whether that theory is
> > > > > > > actually what's happening.
> > > > > > >
> > > > > > >
> > > > > > > > FWIW, I also wonder if another way to handle this would be to just
> > > > > > > > restrict the range of iomap_file_unshare() to within EOF. IOW if a
> > > > > > > > caller passes a range beyond EOF, just process whatever part of the
> > > > > > > > range falls within EOF. It seems iomap isn't responsible for the file
> > > > > > > > extending aspect of the fallocate unshare command anyways.
> > > > > >
> > > > > > It already does 'just process whatever part of the range falls within EOF'.
> > > > > > Check the above case.
> > > > > >
> > > > > > I'm not sure if I fully understand what you mean. This patch does not
> > > > > > prevent unsharing extents within the EOF. This patch checks if pos is
> > > > > > beyond EOF, instead of checking if pos + length is beyond EOF. So the
> > > > > > extents within EOF should be unshared as usual.
> > > > > >
> > > > >
> > > > > I'm not concerned about preventing unsharing. I'm concerned that this
> > > > > patch doesn't always prevent attempts to unshare post-eof ranges. I
> > > > > think the difference here is that in the variant I was hitting, we end
> > > > >
> > > > > > up with a mapping that starts within EOF and ends beyond EOF, whereas
> > > > > > the syzbot variant produces a scenario where the problematic mapping
> > > > > > always starts beyond EOF.
> > > >
> > > > This is not true. In the case above, the syzbot did indeed unshare the
> > > > extents between 0-4k, which were started within EOF and ended beyond
> > > > EOF. The specific variants here are: pos:0 len:0x1000 EOF: 0xE00. And
> > > > the unshare code successfully unshared extents between 0 and 4k.
> > > >
> > > > During the next loop in iomap_file_unshare(), the pos became 0x1000,
> > > > which is beyond EOF.  What this patch does is to skip the unshare
> > > > during the second loop.
> > > > Is there anything I misunderstand？
> > >
> > > Hmm, what block size? Does the associated mapping have at least one full
> > > block beyond EOF? If you have a map at offset 0, length 0x1000 and EOF
> > > at 0xE00, then unless you have 512b blocks it sounds like the EOF block
> > > actually starts within EOF.
> >
> > The block size here is 2k, and there isn't a full block beyond EOF within
> > this extent map.
>
> Ok. That likely explains the difference in behavior. The fsx variant has
> a mapping that starts within EOF and has at least one post-EOF block
> that unshare attempts to process.
>
> > >
> > > The variant I'm seeing is more like this.. consider a -bsize=1k fs, a
> > > file size of 0x3c400, and an EOF mapping of (offset 0x3c000, length
> > > 0x4000). The EOF folio in this case is 4k in size and starts at the same
> > > 0x3c000 offset as the EOF mapping.
> > >
> > > So with 1k blocks, the EOF mapping starts one block before EOF and
> > > extends well beyond it. What happens in the test case is that
> > > iomap_unshare_iter() is called with the EOF folio, pos 0x3c000, length
> > > 0x800, and where the block at offset 0x3c400 is not marked uptodate. pos
> > > is thus within EOF, but the while loop in __iomap_write_begin() walks
> > > past it and attempts to process one block beyond EOF.
> >
> > Ok, so the key point here is that there is a full block beyond EOF within
> > the associated extent map, which is different with the scenario reproduced
> > by syzbot.
> > According to the Dave's comments, the trimming behavior seems like should
> > be done in filesystem(e.g.,xfs), instead of iomap. I will reconsider this scenario.
> >
>
> Seems reasonable, but I don't agree that is a suitable fix for iomap. To
> be clear, it's perfectly fine for the fs to trim the range however it
> sees fit (i.e. no shared blocks beyond EOF in XFS), but we should also
> recognize that iomap is a generic layer and unshare is currently
> implemented to trip over itself, warn and fail if passed a post-eof
> range.
>
>
> > I still suspect that just making unshare work correctly around i_size
> > might be the more elegant long term solution, but that is not totally
> > clear. IMO, as long as unshare is written to intentionally trip over
> > itself for post-eof ranges, it should either trim the range or check for
> > valid parameters and document the limitation. Otherwise we just leave a
> > landmine for the next caller to have to work through the same problems,
> > which is particularly subtle since the higher level fallocate unshare
> > mode supports post-eof ranges. Just my .02.

Yeah, totally agreed. I prefer to do the trimming in the vfs layer
just like what generic_copy_file_checks() does, instead of a specific
file system. Maybe we need a helper function
generic_fallocate_checks(). But it requires more thought and testing.

>
> Brian
>
> > Thanks for your comments and review.
> > >
> > > Brian
> > >
> > > > >
> > > > > So IOW, this patch works for the syzbot variant because it happens to
> > > > > reproduce a situation where pos will be beyond EOF, but that is an
> > > > >
> > > > > > assumption that might not always be true. The fsx generated variant runs
> > > > > > a sequence that produces a mapping that spans across EOF, which means
> > > > > > that pos is within EOF at the start of unshare, so unshare proceeds to
> > > > > > walk across the EOF boundary, the corresponding EOF folio is not fully
> > > > > > uptodate, and thus write begin wants to do partial zeroing and
> > > > > > fails/warns.
> > > >
> > > > Yeah, it's exactly what the syzbot does.
> > > > >
> > > > > I suspect that if the higher level range were trimmed to be within EOF
> > > > > in iomap_file_unshare(), that would prevent this problem in either case.
> > > > > Note that this was on a -bsize=1k fs, so what I'm not totally sure about
> > > > > is whether skipping zeroing as such would be a problem with larger FSBs.
> > > > > My initial thinking was this might not be possible since the EOF folio
> > > > > should be fully uptodate in that case, but that probably requires some
> > > > > thought/review/testing.
> > > > >
> > > > > Brian
> > > > >
> > > > > > BTW, maybe the check here should be
> > > > > >                   if (pos >= i_size_read(iter->inode))
> > > > > >
> > > > > > If there is any misunderstanding, please let me know, thanks.
> > > > > >
> > > > > > [1]: https://lore.kernel.org/all/0000000000008964f1061f8c32b6@xxxxxxxxxx/T/
> > > > > > [2]: https://lore.kernel.org/all/20240903054808.126799-1-sunjunchao2870@xxxxxxxxx/
> > > > > >
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Brian
> > > > > > >
> > > > > > > [1] https://lore.kernel.org/linux-xfs/20240906114051.120743-1-bfoster@xxxxxxxxxx/
> > > > > > > [2] https://lore.kernel.org/fstests/20240906185606.136402-1-bfoster@xxxxxxxxxx/
> > > > > > > [3] fsx ops file:
> > > > > > >
> > > > > > > fallocate 0x3bc00 0x400 0
> > > > > > > write 0x3bc00 0x800 0x3c000
> > > > > > > clone_range 0x3bc00 0x400 0x0 0x3c400
> > > > > > > skip mapread 0x3c000 0x400 0x3c400
> > > > > > > fallocate 0x3bc00 0xc00 0x3c400 unshare
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > --
> > > > > > Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > > > >
> > > > >
> > > >
> > > > Thanks,
> > > > --
> > > > Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > >
> > >
> >
> > Thanks,
> > --
> > Julian Sun <sunjunchao2870@xxxxxxxxx>
> >
>

-- 
Julian Sun <sunjunchao2870@xxxxxxxxx>