On Sun, Oct 30, 2022 at 09:51:00PM -0700, Andrew Morton wrote: > On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > A call to file[map]_write_and_wait_range() with an end offset that > > precedes the start offset but happens to land in the same page can > > trigger writeback submission but fail to wait on the submitted page. > > Writeback submission occurs because __filemap_fdatawrite_range() > > passes both offsets down into write_cache_pages(), which rounds down > > to page indexes before it starts processing writeback. > > __filemap_fdatawait_range() immediately returns if the specified end > > offset precedes the start offset, however. > > > > I suspect these checks are primarily intended to handle overflow > > conditions. I happened to notice this behavior when investigating an > > unrelated problem and observed that a filemap_write_and_wait_range() > > call with unexpected parameters had seemingly unpredictable latency. > > That latency turned out to be the submission path occasionally > > waiting on writeback state of the page (i.e. from > > write_cache_pages()) before issuing the currently requested > > writepage and then unconditionally failing to wait on the latter via > > __filemap_fdatawait_range(). > > > > This could probably be reasonably fixed to either elide the > > submission, as this patch does, or modify the fdatawait path to > > check the page indexes instead of the unaligned offsets. After > > poking around a bit, it seemed more consistent with various other > > filemap interfaces to check the offsets in the write path and return > > if the end offset is not >= the start. For example, > > filemap_range_has_page() and filemap_range_has_writeback() both > > include similar byte granularity checks. > > > > ... > > > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, > > .range_end = end, > > }; > > > > + if (end < start) > > + return 0; > > + > > return filemap_fdatawrite_wbc(mapping, &wbc); > > } > Hi Andrew, Sorry for the delay.. > Is there any way in which this condition can be triggered from > userspace? Or from any non-buggy kernelspace? > Hmm.. good question. Making a quick pass through the callers of __filemap_fdatawrite_range(), I see the following situations: - sync_file_range() includes a higher level end < start check that results in skipping the operation. This appears to cover the sync_file_range() family of syscalls. - generic_fadvise(..., POSIX_FADV_DONTNEED) - sets end = -1 if less than start. This essentially converts the range to cover through the end of the file. - filemap_fdatawrite_range() and file[map]_write_and_wait_range() are called from quite a few places with non-determistic inputs. It wouldn't surprise me a ton if somehow it were possible for some of these callers to do the end < start thing based on unsanitized input or buggy logic, they may just not care depending on if they wait or not. For example, gfs2_fsync() calls filemap_fdata[write|wait]_range() separately, so in theory if called with end < start, the write could submit but the wait could skip similar to having called filemap_write_and_wait_range() (assuming the conditional whole file write in that path doesn't trigger) if the offsets land in the same page. > Should we have a WARN_ON() in there to detect this? > That might make sense. I suppose it depends on what expected behavior is. It certainly doesn't make much sense to write and then not wait from write_and_wait() variants, so callers would probably want to know about that. It might be hard to really audit all callsites to determine whether anybody actually relies on the "end is before start but lands on the same page" behavior for the write only case. We could alternatively change fdatawait to compare the shifted page indexes to match fdatawrite behavior, but that all seems a bit fragile because of 1. the various higher level byte granularity checks and 2. I doubt anybody actually checks for whether the range crosses a page boundary, which is the difference between skipping just the wait or the write as well. So I dunno, I could see various combinations of changes being considered reasonable. Perhaps a good starting point would be to wrap the check in this patch with a WARN_ON_ONCE() and let it soak in -next for a while? That would avoid excessive noise from repetitive callers [1] but still allow those callsites to be identified/fixed. If there is some really weird fdatawrite-only caller that conflicts, the change could always be loosened up from there (as unlikely as that seems).. Hm? Brian [1] The use case that identified this problem is a wonky call from the XFS truncate path from a workload that makes this truncate call repeatedly. A WARN_ON_ONCE() would have most definitely been useful IMO, but an unconditional warning would spam the logs in this particular case.