Re: [PATCH 1/2] iomap: Do not unshare exents beyond EOF

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 10 Sep 2024 08:30:08 -0400

On Tue, Sep 10, 2024 at 03:03:18PM +0800, Julian Sun wrote:
> On Mon, 2024-09-09 at 15:29 -0400, Brian Foster wrote:
> > On Tue, Sep 10, 2024 at 01:40:24AM +0800, Julian Sun wrote:
> > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月9日周一 21:27写道：
> > > > 
> > > > On Mon, Sep 09, 2024 at 08:15:43PM +0800, Julian Sun wrote:
> > > > > Hi Brian,
> > > > > 
> > > > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月7日周六 03:11写道：
> > > > > > 
> > > > > > On Thu, Sep 05, 2024 at 06:24:24PM +0800, Julian Sun wrote:
> > > > > > > Attempting to unshare extents beyond EOF will trigger
> > > > > > > the need zeroing case, which in turn triggers a warning.
> > > > > > > Therefore, let's skip the unshare process if extents are
> > > > > > > beyond EOF.
> > > > > > > 
> > > > > > > Reported-and-tested-by: syzbot+296b1c84b9cbf306e5a0@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > > > > > Closes: https://syzkaller.appspot.com/bug?extid=296b1c84b9cbf306e5a0
> > > > > > > Fixes: 32a38a499104 ("iomap: use write_begin to read pages to unshare")
> > > > > > > Inspired-by: Dave Chinner <david@xxxxxxxxxxxxx>
> > > > > > > Signed-off-by: Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > > > > > ---
> > > > > > >  fs/iomap/buffered-io.c | 3 +++
> > > > > > >  1 file changed, 3 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > > > > index f420c53d86ac..8898d5ec606f 100644
> > > > > > > --- a/fs/iomap/buffered-io.c
> > > > > > > +++ b/fs/iomap/buffered-io.c
> > > > > > > @@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> > > > > > >       /* don't bother with holes or unwritten extents */
> > > > > > >       if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > > > > >               return length;
> > > > > > > +     /* don't try to unshare any extents beyond EOF. */
> > > > > > > +     if (pos > i_size_read(iter->inode))
> > > > > > > +             return length;
> > > > > > 
> > > > > > Hi Julian,
> > > > > > 
> > > > > > 
> > > > > > > What about if pos starts within EOF and the operation extends beyond it?
> > > > > 
> > > > > Extents within EOF will be unshared as usual. Details are below.
> > > > > > 
> > > > > > > I ask because I think I've reproduced this scenario, though it is a bit
> > > > > > > tricky and has dependencies...
> > > > > > > 
> > > > > > > For one, it seems to depend on the cowblocks patch I recently posted
> > > > > > > here [1] (though I don't think this is necessarily a problem with the
> > > > > > > patch, it just depends on keeping COW fork blocks around after the
> > > > > > > unshare). With that, I reproduce via fsx with unshare range support [2]
> > > > > > > using the ops file appended below [3] on a -bsize=1k XFS fs.
> > > > > > > 
> > > > > > > I haven't quite characterized the full sequence other than it looks like
> > > > > > > the unshare walks across EOF with a shared data fork block and COW fork
> > > > > > > delalloc, presumably finds the post-eof part of the folio !uptodate (so
> > > > > > > iomap_adjust_read_range() doesn't skip it), and then trips over the
> > > > > > > warning and error return associated with the folio zeroing in
> > > > > > > __iomap_write_begin().
> > > > > 
> > > > > The scenario has already been reproduced by syzbot[1]. The reproducer
> > > > > provided by syzbot constructed the following extent maps for a file of
> > > > > size 0xE00 before fallocate unshare:
> > > > > 
> > > > > 0 - 4k: shared between two files
> > > > > 4k - 6k: hole beyond EOF, not shared
> > > > > 6k - 8k: delalloc extends
> > > > > 
> > > > > Then the reproducer attempted to unshare the extent between 0 and
> > > > > 0x2000 bytes, but the file size is 0xE00. This is likely the scenario
> > > > > you were referring to?
> > > > > 
> > > > 
> > > > Yes, sort of..
> > > > 
> > > > > Eventually the unshare code does:
> > > > > first map: 0 - 4k - unshare successfully.
> > > > > second map: 4k - 6k - hole, skip. Beyond EOF.
> > > > > third map: 6k - 8k - delalloc, beyond EOF so needs zeroing.
> > > > > Fires warnings because UNSHARE.
> > > > > 
> > > > > During the first call to iomap_unshare_iter(), iomap_length() returned
> > > > > 4k, so 4k bytes were unshared.
> > > > > See discuss here[2] for more details.
> > > > > > 
> > > > > > This all kind of has me wondering.. do we know the purpose of this
> > > > > > warning/error in the first place? It seems like it's more of an
> > > > > > "unexpected situation" than a specific problem. E.g., assuming the same
> > > > > > page were mmap'd, I _think_ the read fault path would do the same sort
> > > > > > of zeroing such that the unshare would see a fully uptodate folio and
> > > > > > carry on as normal. I added the mapread op to the opsfile below to give
> > > > > > that a quick test (remove the "skip" text to enable it), and it seems to
> > > > > > prevent the error, but I've not confirmed whether that theory is
> > > > > > actually what's happening.
> > > > > > 
> > > > > > 
> > > > > > > FWIW, I also wonder if another way to handle this would be to just
> > > > > > > restrict the range of iomap_file_unshare() to within EOF. IOW if a
> > > > > > > caller passes a range beyond EOF, just process whatever part of the
> > > > > > > range falls within EOF. It seems iomap isn't responsible for the file
> > > > > > > extending aspect of the fallocate unshare command anyways.
> > > > > 
> > > > > It already does 'just process whatever part of the range falls within EOF'.
> > > > > Check the above case.
> > > > > 
> > > > > I'm not sure if I fully understand what you mean. This patch does not
> > > > > prevent unsharing extents within the EOF. This patch checks if pos is
> > > > > beyond EOF, instead of checking if pos + length is beyond EOF. So the
> > > > > extents within EOF should be unshared as usual.
> > > > > 
> > > > 
> > > > I'm not concerned about preventing unsharing. I'm concerned that this
> > > > patch doesn't always prevent attempts to unshare post-eof ranges. I
> > > > think the difference here is that in the variant I was hitting, we end
> > > > 
> > > > > up with a mapping that starts within EOF and ends beyond EOF, whereas
> > > > > the syzbot variant produces a scenario where the problematic mapping
> > > > > always starts beyond EOF.
> > > 
> > > This is not true. In the case above, the syzbot did indeed unshare the
> > > extents between 0-4k, which were started within EOF and ended beyond
> > > EOF. The specific variants here are: pos:0 len:0x1000 EOF: 0xE00. And
> > > the unshare code successfully unshared extents between 0 and 4k.
> > > 
> > > During the next loop in iomap_file_unshare(), the pos became 0x1000,
> > > which is beyond EOF.  What this patch does is to skip the unshare
> > > during the second loop.
> > > Is there anything I misunderstand？
> > 
> > Hmm, what block size? Does the associated mapping have at least one full
> > block beyond EOF? If you have a map at offset 0, length 0x1000 and EOF
> > at 0xE00, then unless you have 512b blocks it sounds like the EOF block
> > actually starts within EOF.
> 
> The block size here is 2k, and there isn't a full block beyond EOF within 
> this extent map.

Ok. That likely explains the difference in behavior. The fsx variant has
a mapping that starts within EOF and has at least one post-EOF block
that unshare attempts to process.

> > 
> > The variant I'm seeing is more like this.. consider a -bsize=1k fs, a
> > file size of 0x3c400, and an EOF mapping of (offset 0x3c000, length
> > 0x4000). The EOF folio in this case is 4k in size and starts at the same
> > 0x3c000 offset as the EOF mapping.
> > 
> > So with 1k blocks, the EOF mapping starts one block before EOF and
> > extends well beyond it. What happens in the test case is that
> > iomap_unshare_iter() is called with the EOF folio, pos 0x3c000, length
> > 0x800, and where the block at offset 0x3c400 is not marked uptodate. pos
> > is thus within EOF, but the while loop in __iomap_write_begin() walks
> > past it and attempts to process one block beyond EOF.
> 
> Ok, so the key point here is that there is a full block beyond EOF within
> the associated extent map, which is different with the scenario reproduced 
> by syzbot.
> According to the Dave's comments, the trimming behavior seems like should 
> be done in filesystem(e.g.,xfs), instead of iomap. I will reconsider this scenario.
> 

Seems reasonable, but I don't agree that is a suitable fix for iomap. To
be clear, it's perfectly fine for the fs to trim the range however it
sees fit (i.e. no shared blocks beyond EOF in XFS), but we should also
recognize that iomap is a generic layer and unshare is currently
implemented to trip over itself, warn and fail if passed a post-eof
range.

I still suspect that just making unshare work correctly around i_size
might be the more elegant long term solution, but that is not totally
clear. IMO, as long as unshare is written to intentionally trip over
itself for post-eof ranges, it should either trim the range or check for
valid parameters and document the limitation. Otherwise we just leave a
landmine for the next caller to have to work through the same problems,
which is particularly subtle since the higher level fallocate unshare
mode supports post-eof ranges. Just my .02.

Brian

> Thanks for your comments and review.
> > 
> > Brian
> > 
> > > > 
> > > > So IOW, this patch works for the syzbot variant because it happens to
> > > > reproduce a situation where pos will be beyond EOF, but that is an
> > > > 
> > > > > assumption that might not always be true. The fsx generated variant runs
> > > > > a sequence that produces a mapping that spans across EOF, which means
> > > > > that pos is within EOF at the start of unshare, so unshare proceeds to
> > > > > walk across the EOF boundary, the corresponding EOF folio is not fully
> > > > > uptodate, and thus write begin wants to do partial zeroing and
> > > > > fails/warns.
> > > 
> > > Yeah, it's exactly what the syzbot does.
> > > > 
> > > > I suspect that if the higher level range were trimmed to be within EOF
> > > > in iomap_file_unshare(), that would prevent this problem in either case.
> > > > Note that this was on a -bsize=1k fs, so what I'm not totally sure about
> > > > is whether skipping zeroing as such would be a problem with larger FSBs.
> > > > My initial thinking was this might not be possible since the EOF folio
> > > > should be fully uptodate in that case, but that probably requires some
> > > > thought/review/testing.
> > > > 
> > > > Brian
> > > > 
> > > > > BTW, maybe the check here should be
> > > > >                   if (pos >= i_size_read(iter->inode))
> > > > > 
> > > > > If there is any misunderstanding, please let me know, thanks.
> > > > > 
> > > > > [1]: https://lore.kernel.org/all/0000000000008964f1061f8c32b6@xxxxxxxxxx/T/
> > > > > [2]: https://lore.kernel.org/all/20240903054808.126799-1-sunjunchao2870@xxxxxxxxx/
> > > > > 
> > > > > > 
> > > > > > Thoughts?
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > [1] https://lore.kernel.org/linux-xfs/20240906114051.120743-1-bfoster@xxxxxxxxxx/
> > > > > > [2] https://lore.kernel.org/fstests/20240906185606.136402-1-bfoster@xxxxxxxxxx/
> > > > > > [3] fsx ops file:
> > > > > > 
> > > > > > fallocate 0x3bc00 0x400 0
> > > > > > write 0x3bc00 0x800 0x3c000
> > > > > > clone_range 0x3bc00 0x400 0x0 0x3c400
> > > > > > skip mapread 0x3c000 0x400 0x3c400
> > > > > > fallocate 0x3bc00 0xc00 0x3c400 unshare
> > > > > > 
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > --
> > > > > Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > > > 
> > > > 
> > > 
> > > Thanks,
> > > -- 
> > > Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > 
> > 
> 
> Thanks,
> -- 
> Julian Sun <sunjunchao2870@xxxxxxxxx>
>