On Tue, 2024-09-10 at 09:56 -0400, Brian Foster wrote: > On Tue, Sep 10, 2024 at 09:25:41PM +0800, Julian Sun wrote: > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月10日周二 20:29写道: > > > > > > On Tue, Sep 10, 2024 at 03:03:18PM +0800, Julian Sun wrote: > > > > On Mon, 2024-09-09 at 15:29 -0400, Brian Foster wrote: > > > > > On Tue, Sep 10, 2024 at 01:40:24AM +0800, Julian Sun wrote: > > > > > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月9日周一 21:27写道: > > > > > > > > > > > > > > On Mon, Sep 09, 2024 at 08:15:43PM +0800, Julian Sun wrote: > > > > > > > > Hi Brian, > > > > > > > > > > > > > > > > Brian Foster <bfoster@xxxxxxxxxx> 于2024年9月7日周六 03:11写道: > > > > > > > > > > > > > > > > > > On Thu, Sep 05, 2024 at 06:24:24PM +0800, Julian Sun > > > > > > > > > wrote: > > > > > > > > > > Attempting to unshare extents beyond EOF will trigger > > > > > > > > > > the need zeroing case, which in turn triggers a > > > > > > > > > > warning. > > > > > > > > > > Therefore, let's skip the unshare process if extents > > > > > > > > > > are > > > > > > > > > > beyond EOF. > > > > > > > > > > > > > > > > > > > > Reported-and-tested-by: > > > > > > > > > > syzbot+296b1c84b9cbf306e5a0@xxxxxxxxxxxxxxxxxxxxxxxxx > > > > > > > > > > Closes: > > > > > > > > > > https://syzkaller.appspot.com/bug?extid=296b1c84b9cbf30 > > > > > > > > > > 6e5a0 > > > > > > > > > > Fixes: 32a38a499104 ("iomap: use write_begin to read > > > > > > > > > > pages to unshare") > > > > > > > > > > Inspired-by: Dave Chinner <david@xxxxxxxxxxxxx> > > > > > > > > > > Signed-off-by: Julian Sun <sunjunchao2870@xxxxxxxxx> > > > > > > > > > > --- > > > > > > > > > > fs/iomap/buffered-io.c | 3 +++ > > > > > > > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/fs/iomap/buffered-io.c > > > > > > > > > > b/fs/iomap/buffered-io.c > > > > > > > > > > index f420c53d86ac..8898d5ec606f 100644 > > > > > > > > > > --- a/fs/iomap/buffered-io.c > > > > > > > > > > +++ b/fs/iomap/buffered-io.c > > > > > > > > > > @@ -1340,6 +1340,9 @@ static loff_t > > > > > > > > > > iomap_unshare_iter(struct iomap_iter *iter) > > > > > > > > > > /* don't bother with holes or unwritten extents > > > > > > > > > > */ > > > > > > > > > > if (srcmap->type == IOMAP_HOLE || srcmap->type == > > > > > > > > > > IOMAP_UNWRITTEN) > > > > > > > > > > return length; > > > > > > > > > > + /* don't try to unshare any extents beyond EOF. > > > > > > > > > > */ > > > > > > > > > > + if (pos > i_size_read(iter->inode)) > > > > > > > > > > + return length; > > > > > > > > > > > > > > > > > > Hi Julian, > > > > > > > > > > > > > > > > > > > > > > > > > > > > What about if pos starts within EOF and the operation > > > > > > > > > > extends beyond it? > > > > > > > > > > > > > > > > Extents within EOF will be unshared as usual. Details are > > > > > > > > below. > > > > > > > > > > > > > > > > > > > I ask because I think I've reproduced this scenario, > > > > > > > > > > though it is a bit > > > > > > > > > > tricky and has dependencies... > > > > > > > > > > > > > > > > > > > > For one, it seems to depend on the cowblocks patch I > > > > > > > > > > recently posted > > > > > > > > > > here [1] (though I don't think this is necessarily a > > > > > > > > > > problem with the > > > > > > > > > > patch, it just depends on keeping COW fork blocks > > > > > > > > > > around after the > > > > > > > > > > unshare). With that, I reproduce via fsx with unshare > > > > > > > > > > range support [2] > > > > > > > > > > using the ops file appended below [3] on a -bsize=1k > > > > > > > > > > XFS fs. > > > > > > > > > > > > > > > > > > > > I haven't quite characterized the full sequence other > > > > > > > > > > than it looks like > > > > > > > > > > the unshare walks across EOF with a shared data fork > > > > > > > > > > block and COW fork > > > > > > > > > > delalloc, presumably finds the post-eof part of the > > > > > > > > > > folio !uptodate (so > > > > > > > > > > iomap_adjust_read_range() doesn't skip it), and then > > > > > > > > > > trips over the > > > > > > > > > > warning and error return associated with the folio > > > > > > > > > > zeroing in > > > > > > > > > > __iomap_write_begin(). > > > > > > > > > > > > > > > > The scenario has already been reproduced by syzbot[1]. The > > > > > > > > reproducer > > > > > > > > provided by syzbot constructed the following extent maps > > > > > > > > for a file of > > > > > > > > size 0xE00 before fallocate unshare: > > > > > > > > > > > > > > > > 0 - 4k: shared between two files > > > > > > > > 4k - 6k: hole beyond EOF, not shared > > > > > > > > 6k - 8k: delalloc extends > > > > > > > > > > > > > > > > Then the reproducer attempted to unshare the extent between > > > > > > > > 0 and > > > > > > > > 0x2000 bytes, but the file size is 0xE00. This is likely > > > > > > > > the scenario > > > > > > > > you were referring to? > > > > > > > > > > > > > > > > > > > > > > Yes, sort of.. > > > > > > > > > > > > > > > Eventually the unshare code does: > > > > > > > > first map: 0 - 4k - unshare successfully. > > > > > > > > second map: 4k - 6k - hole, skip. Beyond EOF. > > > > > > > > third map: 6k - 8k - delalloc, beyond EOF so needs zeroing. > > > > > > > > Fires warnings because UNSHARE. > > > > > > > > > > > > > > > > During the first call to iomap_unshare_iter(), > > > > > > > > iomap_length() returned > > > > > > > > 4k, so 4k bytes were unshared. > > > > > > > > See discuss here[2] for more details. > > > > > > > > > > > > > > > > > > This all kind of has me wondering.. do we know the > > > > > > > > > purpose of this > > > > > > > > > warning/error in the first place? It seems like it's more > > > > > > > > > of an > > > > > > > > > "unexpected situation" than a specific problem. E.g., > > > > > > > > > assuming the same > > > > > > > > > page were mmap'd, I _think_ the read fault path would do > > > > > > > > > the same sort > > > > > > > > > of zeroing such that the unshare would see a fully > > > > > > > > > uptodate folio and > > > > > > > > > carry on as normal. I added the mapread op to the opsfile > > > > > > > > > below to give > > > > > > > > > that a quick test (remove the "skip" text to enable it), > > > > > > > > > and it seems to > > > > > > > > > prevent the error, but I've not confirmed whether that > > > > > > > > > theory is > > > > > > > > > actually what's happening. > > > > > > > > > > > > > > > > > > > > > > > > > > > > FWIW, I also wonder if another way to handle this would > > > > > > > > > > be to just > > > > > > > > > > restrict the range of iomap_file_unshare() to within > > > > > > > > > > EOF. IOW if a > > > > > > > > > > caller passes a range beyond EOF, just process whatever > > > > > > > > > > part of the > > > > > > > > > > range falls within EOF. It seems iomap isn't > > > > > > > > > > responsible for the file > > > > > > > > > > extending aspect of the fallocate unshare command > > > > > > > > > > anyways. > > > > > > > > > > > > > > > > It already does 'just process whatever part of the range > > > > > > > > falls within EOF'. > > > > > > > > Check the above case. > > > > > > > > > > > > > > > > I'm not sure if I fully understand what you mean. This > > > > > > > > patch does not > > > > > > > > prevent unsharing extents within the EOF. This patch checks > > > > > > > > if pos is > > > > > > > > beyond EOF, instead of checking if pos + length is beyond > > > > > > > > EOF. So the > > > > > > > > extents within EOF should be unshared as usual. > > > > > > > > > > > > > > > > > > > > > > I'm not concerned about preventing unsharing. I'm concerned > > > > > > > that this > > > > > > > patch doesn't always prevent attempts to unshare post-eof > > > > > > > ranges. I > > > > > > > think the difference here is that in the variant I was > > > > > > > hitting, we end > > > > > > > > > > > > > > > up with a mapping that starts within EOF and ends beyond > > > > > > > > EOF, whereas > > > > > > > > the syzbot variant produces a scenario where the > > > > > > > > problematic mapping > > > > > > > > always starts beyond EOF. > > > > > > > > > > > > This is not true. In the case above, the syzbot did indeed > > > > > > unshare the > > > > > > extents between 0-4k, which were started within EOF and ended > > > > > > beyond > > > > > > EOF. The specific variants here are: pos:0 len:0x1000 EOF: > > > > > > 0xE00. And > > > > > > the unshare code successfully unshared extents between 0 and > > > > > > 4k. > > > > > > > > > > > > During the next loop in iomap_file_unshare(), the pos became > > > > > > 0x1000, > > > > > > which is beyond EOF. What this patch does is to skip the > > > > > > unshare > > > > > > during the second loop. > > > > > > Is there anything I misunderstand? > > > > > > > > > > Hmm, what block size? Does the associated mapping have at least > > > > > one full > > > > > block beyond EOF? If you have a map at offset 0, length 0x1000 > > > > > and EOF > > > > > at 0xE00, then unless you have 512b blocks it sounds like the EOF > > > > > block > > > > > actually starts within EOF. > > > > > > > > The block size here is 2k, and there isn't a full block beyond EOF > > > > within > > > > this extent map. > > > > > > Ok. That likely explains the difference in behavior. The fsx variant > > > has > > > a mapping that starts within EOF and has at least one post-EOF block > > > that unshare attempts to process. > > > > > > > > > > > > > The variant I'm seeing is more like this.. consider a -bsize=1k > > > > > fs, a > > > > > file size of 0x3c400, and an EOF mapping of (offset 0x3c000, > > > > > length > > > > > 0x4000). The EOF folio in this case is 4k in size and starts at > > > > > the same > > > > > 0x3c000 offset as the EOF mapping. > > > > > > > > > > So with 1k blocks, the EOF mapping starts one block before EOF > > > > > and > > > > > extends well beyond it. What happens in the test case is that > > > > > iomap_unshare_iter() is called with the EOF folio, pos 0x3c000, > > > > > length > > > > > 0x800, and where the block at offset 0x3c400 is not marked > > > > > uptodate. pos > > > > > is thus within EOF, but the while loop in __iomap_write_begin() > > > > > walks > > > > > past it and attempts to process one block beyond EOF. > > > > > > > > Ok, so the key point here is that there is a full block beyond EOF > > > > within > > > > the associated extent map, which is different with the scenario > > > > reproduced > > > > by syzbot. > > > > According to the Dave's comments, the trimming behavior seems like > > > > should > > > > be done in filesystem(e.g.,xfs), instead of iomap. I will > > > > reconsider this scenario. > > > > > > > > > > Seems reasonable, but I don't agree that is a suitable fix for iomap. > > > To > > > be clear, it's perfectly fine for the fs to trim the range however it > > > sees fit (i.e. no shared blocks beyond EOF in XFS), but we should > > > also > > > recognize that iomap is a generic layer and unshare is currently > > > implemented to trip over itself, warn and fail if passed a post-eof > > > range. > > > > > > > > > > I still suspect that just making unshare work correctly around > > > > i_size > > > > might be the more elegant long term solution, but that is not > > > > totally > > > > clear. IMO, as long as unshare is written to intentionally trip > > > > over > > > > itself for post-eof ranges, it should either trim the range or > > > > check for > > > > valid parameters and document the limitation. Otherwise we just > > > > leave a > > > > landmine for the next caller to have to work through the same > > > > problems, > > > > which is particularly subtle since the higher level fallocate > > > > unshare > > > > mode supports post-eof ranges. Just my .02. > > > > Yeah, totally agreed. I prefer to do the trimming in the vfs layer > > just like what generic_copy_file_checks() does, instead of a specific > > file system. Maybe we need a helper function > > generic_fallocate_checks(). But it requires more thought and testing. > > > > I would agree with that if not for the subtle difference between > fallocate unshare and the iomap implementation. fallocate unshare is > basically just a behavior modifier for how shared blocks are handled by > mode=0 preallocation, so AIUI it fully supports post-eof ranges. It can > extend the file size or leave it unchanged (based on whether KEEP_SIZE > is set) and preallocate beyond EOF. > > IOW, it might be perfectly valid for a caller to run an fallocate > unshare across EOF that happens to unshare various shared blocks within > EOF, and then preallocate and extend the file size as part of the same > command. The functional responsibility just happens to be split between > iomap and the fs. Yeah, precisely. It seems that fallocate unshare do support post-eof ranges, but iomap_unshare_iter() doesn't expect any blocks beyond eof. Maybe the semantics here is that any blocks beyond eof shouldn't be shared, but I can not confirm that. Also, it's possible that blocks beyond eof which may belong to an extent acrossing eof may be passed to iomap_unshare_iter() in some cases. It seems we should figure out the purpose of this warning. Hi, Christoph. Do you remember why the warning was introduced? Looking forward to your reply after you return from vacation. > > Brian > > > > > > > Brian > > > > > > > Thanks for your comments and review. > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > > So IOW, this patch works for the syzbot variant because it > > > > > > > happens to > > > > > > > reproduce a situation where pos will be beyond EOF, but that > > > > > > > is an > > > > > > > > > > > > > > > assumption that might not always be true. The fsx generated > > > > > > > > variant runs > > > > > > > > a sequence that produces a mapping that spans across EOF, > > > > > > > > which means > > > > > > > > that pos is within EOF at the start of unshare, so unshare > > > > > > > > proceeds to > > > > > > > > walk across the EOF boundary, the corresponding EOF folio > > > > > > > > is not fully > > > > > > > > uptodate, and thus write begin wants to do partial zeroing > > > > > > > > and > > > > > > > > fails/warns. > > > > > > > > > > > > Yeah, it's exactly what the syzbot does. > > > > > > > > > > > > > > I suspect that if the higher level range were trimmed to be > > > > > > > within EOF > > > > > > > in iomap_file_unshare(), that would prevent this problem in > > > > > > > either case. > > > > > > > Note that this was on a -bsize=1k fs, so what I'm not totally > > > > > > > sure about > > > > > > > is whether skipping zeroing as such would be a problem with > > > > > > > larger FSBs. > > > > > > > My initial thinking was this might not be possible since the > > > > > > > EOF folio > > > > > > > should be fully uptodate in that case, but that probably > > > > > > > requires some > > > > > > > thought/review/testing. > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > BTW, maybe the check here should be > > > > > > > > if (pos >= i_size_read(iter->inode)) > > > > > > > > > > > > > > > > If there is any misunderstanding, please let me know, > > > > > > > > thanks. > > > > > > > > > > > > > > > > [1]: > > > > > > > > https://lore.kernel.org/all/0000000000008964f1061f8c32b6@go > > > > > > > > ogle.com/T/ > > > > > > > > [2]: https://lore.kernel.org/all/20240903054808.126799-1- > > > > > > > > sunjunchao2870@xxxxxxxxx/ > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > Brian > > > > > > > > > > > > > > > > > > [1] https://lore.kernel.org/linux- > > > > > > > > > xfs/20240906114051.120743-1-bfoster@xxxxxxxxxx/ > > > > > > > > > [2] > > > > > > > > > https://lore.kernel.org/fstests/20240906185606.136402-1- > > > > > > > > > bfoster@xxxxxxxxxx/ > > > > > > > > > [3] fsx ops file: > > > > > > > > > > > > > > > > > > fallocate 0x3bc00 0x400 0 > > > > > > > > > write 0x3bc00 0x800 0x3c000 > > > > > > > > > clone_range 0x3bc00 0x400 0x0 0x3c400 > > > > > > > > > skip mapread 0x3c000 0x400 0x3c400 > > > > > > > > > fallocate 0x3bc00 0xc00 0x3c400 unshare > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -- > > > > > > > > Julian Sun <sunjunchao2870@xxxxxxxxx> > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > -- > > > > > > Julian Sun <sunjunchao2870@xxxxxxxxx> > > > > > > > > > > > > > > > > > > > Thanks, > > > > -- > > > > Julian Sun <sunjunchao2870@xxxxxxxxx> > > > > > > > > > > > > > -- > > Julian Sun <sunjunchao2870@xxxxxxxxx> > > > Thanks, -- Julian Sun <sunjunchao2870@xxxxxxxxx>