Re: [PATCH 1/2] iomap: Do not unshare exents beyond EOF

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 9 Sep 2024 09:28:36 -0400

On Mon, Sep 09, 2024 at 08:15:43PM +0800, Julian Sun wrote:
> Hi Brian,
> 
> Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月7日周六 03:11写道：
> >
> > On Thu, Sep 05, 2024 at 06:24:24PM +0800, Julian Sun wrote:
> > > Attempting to unshare extents beyond EOF will trigger
> > > the need zeroing case, which in turn triggers a warning.
> > > Therefore, let's skip the unshare process if extents are
> > > beyond EOF.
> > >
> > > Reported-and-tested-by: syzbot+296b1c84b9cbf306e5a0@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > Closes: https://syzkaller.appspot.com/bug?extid=296b1c84b9cbf306e5a0
> > > Fixes: 32a38a499104 ("iomap: use write_begin to read pages to unshare")
> > > Inspired-by: Dave Chinner <david@xxxxxxxxxxxxx>
> > > Signed-off-by: Julian Sun <sunjunchao2870@xxxxxxxxx>
> > > ---
> > >  fs/iomap/buffered-io.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index f420c53d86ac..8898d5ec606f 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -1340,6 +1340,9 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> > >       /* don't bother with holes or unwritten extents */
> > >       if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > >               return length;
> > > +     /* don't try to unshare any extents beyond EOF. */
> > > +     if (pos > i_size_read(iter->inode))
> > > +             return length;
> >
> > Hi Julian,
> >
> >
> > > What about if pos starts within EOF and the operation extends beyond it?
> 
> Extents within EOF will be unshared as usual. Details are below.
> >
> > > I ask because I think I've reproduced this scenario, though it is a bit
> > > tricky and has dependencies...
> > >
> > > For one, it seems to depend on the cowblocks patch I recently posted
> > > here [1] (though I don't think this is necessarily a problem with the
> > > patch, it just depends on keeping COW fork blocks around after the
> > > unshare). With that, I reproduce via fsx with unshare range support [2]
> > > using the ops file appended below [3] on a -bsize=1k XFS fs.
> > >
> > > I haven't quite characterized the full sequence other than it looks like
> > > the unshare walks across EOF with a shared data fork block and COW fork
> > > delalloc, presumably finds the post-eof part of the folio !uptodate (so
> > > iomap_adjust_read_range() doesn't skip it), and then trips over the
> > > warning and error return associated with the folio zeroing in
> > > __iomap_write_begin().
> 
> The scenario has already been reproduced by syzbot[1]. The reproducer
> provided by syzbot constructed the following extent maps for a file of
> size 0xE00 before fallocate unshare:
> 
> 0 - 4k: shared between two files
> 4k - 6k: hole beyond EOF, not shared
> 6k - 8k: delalloc extends
> 
> Then the reproducer attempted to unshare the extent between 0 and
> 0x2000 bytes, but the file size is 0xE00. This is likely the scenario
> you were referring to?
> 

Yes, sort of..

> Eventually the unshare code does:
> first map: 0 - 4k - unshare successfully.
> second map: 4k - 6k - hole, skip. Beyond EOF.
> third map: 6k - 8k - delalloc, beyond EOF so needs zeroing.
> Fires warnings because UNSHARE.
> 
> During the first call to iomap_unshare_iter(), iomap_length() returned
> 4k, so 4k bytes were unshared.
> See discuss here[2] for more details.
> >
> > This all kind of has me wondering.. do we know the purpose of this
> > warning/error in the first place? It seems like it's more of an
> > "unexpected situation" than a specific problem. E.g., assuming the same
> > page were mmap'd, I _think_ the read fault path would do the same sort
> > of zeroing such that the unshare would see a fully uptodate folio and
> > carry on as normal. I added the mapread op to the opsfile below to give
> > that a quick test (remove the "skip" text to enable it), and it seems to
> > prevent the error, but I've not confirmed whether that theory is
> > actually what's happening.
> >
> >
> > > FWIW, I also wonder if another way to handle this would be to just
> > > restrict the range of iomap_file_unshare() to within EOF. IOW if a
> > > caller passes a range beyond EOF, just process whatever part of the
> > > range falls within EOF. It seems iomap isn't responsible for the file
> > > extending aspect of the fallocate unshare command anyways.
> 
> It already does 'just process whatever part of the range falls within EOF'.
> Check the above case.
> 
> I'm not sure if I fully understand what you mean. This patch does not
> prevent unsharing extents within the EOF. This patch checks if pos is
> beyond EOF, instead of checking if pos + length is beyond EOF. So the
> extents within EOF should be unshared as usual.
> 

I'm not concerned about preventing unsharing. I'm concerned that this
patch doesn't always prevent attempts to unshare post-eof ranges. I
think the difference here is that in the variant I was hitting, we end
up with a mapping that starts within EOF and ends beyond EOF, whereas
the syzbot variant produces a scenario where the problematic mapping
always starts beyond EOF.

So IOW, this patch works for the syzbot variant because it happens to
reproduce a situation where pos will be beyond EOF, but that is an
assumption that might not always be true. The fsx generated variant runs
a sequence that produces a mapping that spans across EOF, which means
that pos is within EOF at the start of unshare, so unshare proceeds to
walk across the EOF boundary, the corresponding EOF folio is not fully
uptodate, and thus write begin wants to do partial zeroing and
fails/warns.

I suspect that if the higher level range were trimmed to be within EOF
in iomap_file_unshare(), that would prevent this problem in either case.
Note that this was on a -bsize=1k fs, so what I'm not totally sure about
is whether skipping zeroing as such would be a problem with larger FSBs.
My initial thinking was this might not be possible since the EOF folio
should be fully uptodate in that case, but that probably requires some
thought/review/testing.

Brian

> BTW, maybe the check here should be
>                   if (pos >= i_size_read(iter->inode))
> 
> If there is any misunderstanding, please let me know, thanks.
> 
> [1]: https://lore.kernel.org/all/0000000000008964f1061f8c32b6@xxxxxxxxxx/T/
> [2]: https://lore.kernel.org/all/20240903054808.126799-1-sunjunchao2870@xxxxxxxxx/
> 
> >
> > Thoughts?
> >
> > Brian
> >
> > [1] https://lore.kernel.org/linux-xfs/20240906114051.120743-1-bfoster@xxxxxxxxxx/
> > [2] https://lore.kernel.org/fstests/20240906185606.136402-1-bfoster@xxxxxxxxxx/
> > [3] fsx ops file:
> >
> > fallocate 0x3bc00 0x400 0
> > write 0x3bc00 0x800 0x3c000
> > clone_range 0x3bc00 0x400 0x0 0x3c400
> > skip mapread 0x3c000 0x400 0x3c400
> > fallocate 0x3bc00 0xc00 0x3c400 unshare
> >
> 
> 
> Thanks,
> -- 
> Julian Sun <sunjunchao2870@xxxxxxxxx>
>